Thread: Minimal logical decoding on standbys

Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

Craig previously worked on $subject, see thread [1].  A bunch of the
prerequisite features from that and other related threads have been
integrated into PG.  What's missing is actually allowing logical
decoding on a standby.  The latest patch from that thread does that [2],
but unfortunately hasn't been updated after slipping v10.

The biggest remaining issue to allow it is that the catalog xmin on the
primary has to be above the catalog xmin horizon of all slots on the
standby. The patch in [2] does so by periodically logging a new record
that announces the current catalog xmin horizon.   Additionally it
checks that hot_standby_feedback is enabled when doing logical decoding
from a standby.

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements.  I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts.  As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

I also don't think we should actually enforce hot_standby_feedback being
enabled - there's use-cases where that's not convenient, and it's not
bullet proof anyway (can be enabled/disabled without using logical
decoding inbetween).  I think when there's a conflict we should have the
HINT mention that hs_feedback can be used to prevent such conflicts,
that ought to be enough.

Attached is a rough draft patch. If we were to go for this approach,
we'd obviously need to improve the actual conflict handling against
slots - right now it just logs a WARNING and retries shortly after.

I think there's currently one hole in this approach. Nbtree (and other
index types, which are pretty unlikely to matter here) have this logic
to handle snapshot conflicts for single-page deletions:


    /*
     * If we have any conflict processing to do, it must happen before we
     * update the page.
     *
     * Btree delete records can conflict with standby queries.  You might
     * think that vacuum records would conflict as well, but we've handled
     * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
     * cleaned by the vacuum of the heap and so we can resolve any conflicts
     * just once when that arrives.  After that we know that no conflicts
     * exist from individual btree vacuum records on that index.
     */
    if (InHotStandby)
    {
        TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
        RelFileNode rnode;

        XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);

        ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
                                            xlrec->onCatalogTable, rnode);
    }

I.e. we get the latest removed xid from the heap, which has the
following logic:
    /*
     * If there's nothing running on the standby we don't need to derive a
     * full latestRemovedXid value, so use a fast path out of here.  This
     * returns InvalidTransactionId, and so will conflict with all HS
     * transactions; but since we just worked out that that's zero people,
     * it's OK.
     *
     * XXX There is a race condition here, which is that a new backend might
     * start just after we look.  If so, it cannot need to conflict, but this
     * coding will result in throwing a conflict anyway.
     */
    if (CountDBBackends(InvalidOid) == 0)
        return latestRemovedXid;

    /*
     * In what follows, we have to examine the previous state of the index
     * page, as well as the heap page(s) it points to.  This is only valid if
     * WAL replay has reached a consistent database state; which means that
     * the preceding check is not just an optimization, but is *necessary*. We
     * won't have let in any user sessions before we reach consistency.
     */
    if (!reachedConsistency)
        elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data");

so we wouldn't get a correct xid when not if nobody is connected to a
database (and by implication when not yet consistent).


I'm wondering if it's time to move the latestRemovedXid computation for
this type of record to the primary - it's likely to be cheaper there and
avoids this kind of complication. Secondarily, it'd have the advantage
of making pluggable storage integration easier - there we have the
problem that we don't know which type of relation we're dealing with
during recovery, so such lookups make pluggability harder (zheap just
adds extra flags to signal that, but that's not extensible).

Another alternative would be to just prevent such index deletions for
catalog tables when wal_level = logical.


If we were to go with this approach, there'd be at least the following
tasks:
- adapt tests from [2]
- enforce hot-standby to be enabled on the standby when logical slots
  are created, and at startup if a logical slot exists
- fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
  above.
- Have a nicer conflict handling than what I implemented here.  Craig's
  approach deleted the slots, but I'm not sure I like that.  Blocking
  seems more appropriately here, after all it's likely that the
  replication topology would be broken afterwards.
- get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
  optimized (e.g. check wal_level before opening rel etc).


Once we have this logic, it can be used to implement something like
failover slots on-top, by having having a mechanism that occasionally
forwards slots on standbys using pg_replication_slot_advance().

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=XCg@mail.gmail.com
[2]
https://archives.postgresql.org/message-id/CAMsr%2BYEbS8ZZ%2Bw18j7OPM2MZEeDtGN9wDVF68%3DMzpeW%3DKRZZ9Q%40mail.gmail.com

Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Wed, Dec 12, 2018 at 3:41 PM Andres Freund <andres@anarazel.de> wrote:
> I don't like the approach of managing the catalog horizon via those
> periodically logged catalog xmin announcements.  I think we instead
> should build ontop of the records we already have and use to compute
> snapshot conflicts.  As of HEAD we don't know whether such tables are
> catalog tables, but that's just a bool that we need to include in the
> records, a basically immeasurable overhead given the size of those
> records.

To me, this paragraph appears to say that you don't like Craig's
approach without quite explaining why you don't like it.  Could you be
a bit more explicit about that?

> I also don't think we should actually enforce hot_standby_feedback being
> enabled - there's use-cases where that's not convenient, and it's not
> bullet proof anyway (can be enabled/disabled without using logical
> decoding inbetween).  I think when there's a conflict we should have the
> HINT mention that hs_feedback can be used to prevent such conflicts,
> that ought to be enough.

If we can make that work, +1 from me.

> I'm wondering if it's time to move the latestRemovedXid computation for
> this type of record to the primary - it's likely to be cheaper there and
> avoids this kind of complication. Secondarily, it'd have the advantage
> of making pluggable storage integration easier - there we have the
> problem that we don't know which type of relation we're dealing with
> during recovery, so such lookups make pluggability harder (zheap just
> adds extra flags to signal that, but that's not extensible).

That doesn't look trivial.  It seems like _bt_delitems_delete() would
need to get an array of XIDs, but that gets called from
_bt_vacuum_one_page(), which doesn't have that information available.
It doesn't look like there is a particularly cheap way of getting it,
either.  What do you have in mind?

> Another alternative would be to just prevent such index deletions for
> catalog tables when wal_level = logical.

That doesn't sound like a very nice idea.

> If we were to go with this approach, there'd be at least the following
> tasks:
> - adapt tests from [2]

OK.

> - enforce hot-standby to be enabled on the standby when logical slots
>   are created, and at startup if a logical slot exists

Why do we need this?

> - fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
>   above.

OK.

> - Have a nicer conflict handling than what I implemented here.  Craig's
>   approach deleted the slots, but I'm not sure I like that.  Blocking
>   seems more appropriately here, after all it's likely that the
>   replication topology would be broken afterwards.

I guess the viable options are approximately -- (1) drop the slot, (2)
advance the slot, (3) mark the slot as "failed" but leave it in
existence as a tombstone, (4) wait until something changes.  I like
(3) better than (1).  (4) seems pretty unfortunate unless there's some
other system for having the slot advance automatically.  Seems like a
way for replication to hang indefinitely without anybody understanding
why it's happened (or, maybe, noticing).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2018-12-13 19:32:19 -0500, Robert Haas wrote:
> On Wed, Dec 12, 2018 at 3:41 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't like the approach of managing the catalog horizon via those
> > periodically logged catalog xmin announcements.  I think we instead
> > should build ontop of the records we already have and use to compute
> > snapshot conflicts.  As of HEAD we don't know whether such tables are
> > catalog tables, but that's just a bool that we need to include in the
> > records, a basically immeasurable overhead given the size of those
> > records.
> 
> To me, this paragraph appears to say that you don't like Craig's
> approach without quite explaining why you don't like it.  Could you be
> a bit more explicit about that?

I think the conflict system introduced in Craig's patch is quite
complicated, relies on logging new wal records on a regular basis, adds
needs to be more conservative about the xmin horizon, which is obviously
not great for performance.

If you look at Craig's patch, it currently relies on blocking out
concurrent checkpoints:
               /*
                * We must prevent a concurrent checkpoint, otherwise the catalog xmin
                * advance xlog record with the new value might be written before the
                * checkpoint but the checkpoint may still see the old
                * oldestCatalogXmin value.
                */
               if (!LWLockConditionalAcquire(CheckpointLock, LW_SHARED))
                       /* Couldn't get checkpointer lock; will retry later */
                       return;
which on its own seems unacceptable, given that CheckpointLock can be
held by checkpointer for a very long time. While that's ongoing the
catalog xmin horizon doesn't advance.

Looking at the code it seems hard, to me, to make that approach work
nicely. But I might just be tired.


> > I'm wondering if it's time to move the latestRemovedXid computation for
> > this type of record to the primary - it's likely to be cheaper there and
> > avoids this kind of complication. Secondarily, it'd have the advantage
> > of making pluggable storage integration easier - there we have the
> > problem that we don't know which type of relation we're dealing with
> > during recovery, so such lookups make pluggability harder (zheap just
> > adds extra flags to signal that, but that's not extensible).
> 
> That doesn't look trivial.  It seems like _bt_delitems_delete() would
> need to get an array of XIDs, but that gets called from
> _bt_vacuum_one_page(), which doesn't have that information available.
> It doesn't look like there is a particularly cheap way of getting it,
> either.  What do you have in mind?

I've a prototype attached, but let's discuss the details in a separate
thread. This also needs to be changed for pluggable storage, as we don't
know about table access methods in the startup process, so we can't call
can't determine which AM the heap is from during
btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Writing that message right now.


> > - enforce hot-standby to be enabled on the standby when logical slots
> >   are created, and at startup if a logical slot exists
> 
> Why do we need this?

Currently the conflict routines are only called when hot standby is
on. There's also no way to use logical decoding (including just advancing the
slot), without hot-standby being enabled, so I think that'd be a pretty
harmless restriction.


> > - Have a nicer conflict handling than what I implemented here.  Craig's
> >   approach deleted the slots, but I'm not sure I like that.  Blocking
> >   seems more appropriately here, after all it's likely that the
> >   replication topology would be broken afterwards.
> 
> I guess the viable options are approximately --

> (1) drop the slot

Doable.


> (2)  advance the slot

That's not realistically possible, I think. We'd need to be able to use
most of the logical decoding infrastructure in that context, and we
don't have that available.  It's also possible to deadlock, where
advancing the slot's xmin horizon would need further WAL, but WAL replay
is blocked on advancing the slot.


> (3) mark the slot as "failed" but leave it in existence as a tombstone

We currently don't have that, but it'd be doable, I think.

> (4) wait until something changes.
> (4) seems pretty unfortunate unless there's some other system for
> having the slot advance automatically.  Seems like a way for
> replication to hang indefinitely without anybody understanding why
> it's happened (or, maybe, noticing).

On the other hand, it would often allow whatever user of the slot to
continue using it, till the conflict is "resolved". To me it seems about
as easy to debug physical replication being blocked, as somehow the slot
being magically deleted or marked as invalid.


Thanks for looking,

Andres Freund

Attachment

Re: Minimal logical decoding on standbys

From
Petr Jelinek
Date:
Hi,

On 12/12/2018 21:41, Andres Freund wrote:
> 
> I don't like the approach of managing the catalog horizon via those
> periodically logged catalog xmin announcements.  I think we instead
> should build ontop of the records we already have and use to compute
> snapshot conflicts.  As of HEAD we don't know whether such tables are
> catalog tables, but that's just a bool that we need to include in the
> records, a basically immeasurable overhead given the size of those
> records.

IIRC I was originally advocating adding that xmin announcement to the
standby snapshot message, but this seems better.

> 
> If we were to go with this approach, there'd be at least the following
> tasks:
> - adapt tests from [2]
> - enforce hot-standby to be enabled on the standby when logical slots
>   are created, and at startup if a logical slot exists
> - fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
>   above.
> - Have a nicer conflict handling than what I implemented here.  Craig's
>   approach deleted the slots, but I'm not sure I like that.  Blocking
>   seems more appropriately here, after all it's likely that the
>   replication topology would be broken afterwards.
> - get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
>   optimized (e.g. check wal_level before opening rel etc).
> 
> 
> Once we have this logic, it can be used to implement something like
> failover slots on-top, by having having a mechanism that occasionally
> forwards slots on standbys using pg_replication_slot_advance().
> 

Looking at this from the failover slots perspective. Wouldn't blocking
on conflict mean that we stop physical replication on catalog xmin
advance when there is lagging logical replication on primary? It might
not be too big deal as in that use-case it should only happen if
hs_feedback was off at some point, but just wanted to point out this
potential problem.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: Minimal logical decoding on standbys

From
tushar
Date:
Hi,

While testing  this feature  found that - if lots of insert happened on 
the master cluster then pg_recvlogical is not showing the DATA 
information  on logical replication slot which created on SLAVE.

Please refer this scenario -

1)
Create a Master cluster with wal_level=logcal and create logical 
replication slot -
  SELECT * FROM pg_create_logical_replication_slot('master_slot', 
'test_decoding');

2)
Create a Standby  cluster using pg_basebackup ( ./pg_basebackup -D 
slave/ -v -R)  and create logical replication slot -
SELECT * FROM pg_create_logical_replication_slot('standby_slot', 
'test_decoding');

3)
X terminal - start  pg_recvlogical  , provide port=5555 ( slave 
cluster)  and specify slot=standby_slot
./pg_recvlogical -d postgres  -p 5555 -s 1 -F 1  -v --slot=standby_slot  
--start -f -

Y terminal - start  pg_recvlogical  , provide port=5432 ( master 
cluster)  and specify slot=master_slot
./pg_recvlogical -d postgres  -p 5432 -s 1 -F 1  -v --slot=master_slot  
--start -f -

Z terminal - run pg_bench  against Master cluster ( ./pg_bench -i -s 10 
postgres)

Able to see DATA information on Y terminal  but not on X.

but same able to see by firing this below query on SLAVE cluster -

SELECT * FROM pg_logical_slot_get_changes('standby_slot', NULL, NULL);

Is it expected ?

regards,
tushar

On 12/17/2018 10:46 PM, Petr Jelinek wrote:
> Hi,
>
> On 12/12/2018 21:41, Andres Freund wrote:
>> I don't like the approach of managing the catalog horizon via those
>> periodically logged catalog xmin announcements.  I think we instead
>> should build ontop of the records we already have and use to compute
>> snapshot conflicts.  As of HEAD we don't know whether such tables are
>> catalog tables, but that's just a bool that we need to include in the
>> records, a basically immeasurable overhead given the size of those
>> records.
> IIRC I was originally advocating adding that xmin announcement to the
> standby snapshot message, but this seems better.
>
>> If we were to go with this approach, there'd be at least the following
>> tasks:
>> - adapt tests from [2]
>> - enforce hot-standby to be enabled on the standby when logical slots
>>    are created, and at startup if a logical slot exists
>> - fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
>>    above.
>> - Have a nicer conflict handling than what I implemented here.  Craig's
>>    approach deleted the slots, but I'm not sure I like that.  Blocking
>>    seems more appropriately here, after all it's likely that the
>>    replication topology would be broken afterwards.
>> - get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
>>    optimized (e.g. check wal_level before opening rel etc).
>>
>>
>> Once we have this logic, it can be used to implement something like
>> failover slots on-top, by having having a mechanism that occasionally
>> forwards slots on standbys using pg_replication_slot_advance().
>>
> Looking at this from the failover slots perspective. Wouldn't blocking
> on conflict mean that we stop physical replication on catalog xmin
> advance when there is lagging logical replication on primary? It might
> not be too big deal as in that use-case it should only happen if
> hs_feedback was off at some point, but just wanted to point out this
> potential problem.
>

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-03-01 13:33:23 +0530, tushar wrote:
> While testing  this feature  found that - if lots of insert happened on the
> master cluster then pg_recvlogical is not showing the DATA information  on
> logical replication slot which created on SLAVE.
> 
> Please refer this scenario -
> 
> 1)
> Create a Master cluster with wal_level=logcal and create logical replication
> slot -
>  SELECT * FROM pg_create_logical_replication_slot('master_slot',
> 'test_decoding');
> 
> 2)
> Create a Standby  cluster using pg_basebackup ( ./pg_basebackup -D slave/ -v
> -R)  and create logical replication slot -
> SELECT * FROM pg_create_logical_replication_slot('standby_slot',
> 'test_decoding');

So, if I understand correctly you do *not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Thanks,

Andres


Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 14 Dec 2018 at 06:25, Andres Freund <andres@anarazel.de> wrote:
> I've a prototype attached, but let's discuss the details in a separate
> thread. This also needs to be changed for pluggable storage, as we don't
> know about table access methods in the startup process, so we can't call
> can't determine which AM the heap is from during
> btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Attached is a WIP test patch
0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
modified version of Craig Ringer's test cases
(012_logical_decoding_on_replica.pl) that he had attached in [1].
Here, I have also attached his original file
(Craigs_012_logical_decoding_on_replica.pl).

Also attached are rebased versions of couple of Andres's implementation patches.

I have added a new test scenario :
DROP TABLE from master *before* the logical records of the table
insertions are retrieved from standby. The logical records should be
successfully retrieved.


Regarding the test result failures, I could see that when we drop a
logical replication slot at standby server, then the catalog_xmin of
physical replication slot becomes NULL, whereas the test expects it to
be equal to xmin; and that's the reason a couple of test scenarios are
failing :

ok 33 - slot on standby dropped manually
Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
done
not ok 34 - physical catalog_xmin still non-null
not ok 35 - xmin and catalog_xmin equal after slot drop
#   Failed test 'xmin and catalog_xmin equal after slot drop'
#   at t/016_logical_decoding_on_replica.pl line 272.
#          got:
#     expected: 2584



Other than the above, there is this test scenario which I had to remove :

#########################################################
# Conflict with recovery: xmin cancels decoding session
#########################################################
#
# Start a transaction on the replica then perform work that should cause a
# recovery conflict with it. We'll check to make sure the client gets
# terminated with recovery conflict.
#
# Temporarily disable hs feedback so we can test recovery conflicts.
# It's fine to continue using a physical slot, the xmin should be
# cleared. We only check hot_standby_feedback when establishing
# a new decoding session so this approach circumvents the safeguards
# in place and forces a conflict.

This test starts pg_recvlogical, and expects it to be terminated due
to recovery conflict because hs feedback is disabled.
But that does not happen; instead, pg_recvlogical does not return.

But I am not sure why it does not terminate with Andres's patch; it
was expected to terminate with Craig Ringer's patch.

Further, there are subsequent test scenarios that test pg_recvlogical
with hs_feedback disabled, which I have removed because pg_recvlogical
does not return. I am yet to clearly understand why that happens. I
suspect that is only because hs_feedback is disabled.

Also, the testcases verify pg_controldata's oldestCatalogXmin values,
which are now not present with Andres's patch; so I removed tracking
of oldestCatalogXmin.

[1] https://www.postgresql.org/message-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=XCg@mail.gmail.com

Thanks
-Amit Khandekar

Attachment

Re: Minimal logical decoding on standbys

From
tushar
Date:
On 03/01/2019 11:16 PM, Andres Freund wrote:
So, if I understand correctly you do *not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Ok, This time around  - I performed like this -

.)Master cluster (set wal_level=logical and  hot_standby_feedback=on in postgresql.conf) , start the server  and create a physical replication slot

postgres=# SELECT * FROM pg_create_physical_replication_slot('decoding_standby');
    slot_name     | lsn
------------------+-----
 decoding_standby |
(1 row)

.)Perform pg_basebackup using --slot=decoding_standby  with option -R . modify port=5555 , start the server

.)Connect to slave and create a logical replication slot

postgres=# create table t(n int);
ERROR:  cannot execute CREATE TABLE in a read-only transaction
postgres=#

postgres=# SELECT * FROM pg_create_logical_replication_slot('standby_slot', 'test_decoding');
  slot_name   |    lsn   
--------------+-----------
 standby_slot | 0/2000060
(1 row)

run pgbench (./pgbench -i -s 10 postgres) against  master and  simultaneously- start  pg_recvlogical  , provide port=5555 ( slave cluster)  and specify slot=standby_slot
./pg_recvlogical -d postgres  -p 5555 -s 1 -F 1  -v --slot=standby_slot  --start -f -


[centos@centos-cpula bin]$ ./pg_recvlogical -d postgres  -p 5555 -s 1 -F 1  -v --slot=standby_slot  --start -f -
pg_recvlogical: starting log streaming at 0/0 (slot standby_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot standby_slot)
pg_recvlogical: confirming write up to 0/30194E8, flush to 0/30194E8 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 (slot standby_slot)
pg_recvlogical: confirming write up to 0/301D558, flush to 0/301D558 (slot standby_slot)
BEGIN 476
COMMIT 476
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 (slot standby_slot)
BEGIN 477
COMMIT 477

If we do the same for the logical replication slot which created on Master cluster  [centos@centos-cpula bin]$ ./pg_recvlogical -d postgres  -s 1 -F 1  -v --slot=master_slot  --start -f -
pg_recvlogical: starting log streaming at 0/0 (slot master_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot master_slot)
table public.pgbench_accounts: INSERT: aid[integer]:65057 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '
table public.pgbench_accounts: INSERT: aid[integer]:65058 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '
table public.pgbench_accounts: INSERT: aid[integer]:65059 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '
table public.pgbench_accounts: INSERT: aid[integer]:65060 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '
table public.pgbench_accounts: INSERT: aid[integer]:65061 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '
table public.pgbench_accounts: INSERT: aid[integer]:65062 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '
table public.pgbench_accounts: INSERT: aid[integer]:65063 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '
table public.pgbench_accounts: INSERT: aid[integer]:65064 bid[integer]:1 abalance[integer]:0 filler[character]:'                                                                                    '

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Minimal logical decoding on standbys

From
tushar
Date:
On 03/04/2019 04:54 PM, tushar wrote:
> .)Perform pg_basebackup using --slot=decoding_standby  with option -R 
> . modify port=5555 , start the server 

set primary_slot_name = 'decoding_standby'  in the postgresql.conf file 
of slave.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-03-04 16:54:32 +0530, tushar wrote:
> On 03/01/2019 11:16 PM, Andres Freund wrote:
> > So, if I understand correctly you do*not*  have a phyiscal replication
> > slot for this standby? For the feature to work reliably that needs to
> > exist, and you need to have hot_standby_feedback enabled. Does having
> > that fix the issue?
> 
> Ok, This time around  - I performed like this -
> 
> .)Master cluster (set wal_level=logical and hot_standby_feedback=on in
> postgresql.conf) , start the server and create a physical replication slot

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).

Thanks,

Andres


Re: Minimal logical decoding on standbys

From
tushar
Date:
On 03/04/2019 10:57 PM, Andres Freund wrote:
> Note that hot_standby_feedback=on needs to be set on a standby, not on
> the primary (although it doesn't do any harm there).

Right, This parameter was enabled on both Master and slave.

Is someone able to reproduce this issue ?

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
tushar
Date:
There is an another issue , where i am getting error while executing 
"pg_logical_slot_get_changes" on SLAVE

Master (running on port=5432) -  run "make installcheck"  after setting  
PATH=<installation/bin:$PATH )  and export PGDATABASE=postgres from 
regress/ folder
Slave (running on port=5555)  -  Connect to regression database and 
select pg_logical_slot_get_changes

[centos@mail-arts bin]$ ./psql postgres -p 5555 -f t.sql
You are now connected to database "regression" as user "centos".
  slot_name |    lsn
-----------+-----------
  m61       | 1/D437AD8
(1 row)

psql:t.sql:3: ERROR:  could not resolve cmin/cmax of catalog tuple

[centos@mail-arts bin]$ cat t.sql
\c regression
SELECT * from   pg_create_logical_replication_slot('m61', 'test_decoding');
select * from pg_logical_slot_get_changes('m61',null,null);

regards,

On 03/04/2019 10:57 PM, Andres Freund wrote:
> Hi,
>
> On 2019-03-04 16:54:32 +0530, tushar wrote:
>> On 03/01/2019 11:16 PM, Andres Freund wrote:
>>> So, if I understand correctly you do*not*  have a phyiscal replication
>>> slot for this standby? For the feature to work reliably that needs to
>>> exist, and you need to have hot_standby_feedback enabled. Does having
>>> that fix the issue?
>> Ok, This time around  - I performed like this -
>>
>> .)Master cluster (set wal_level=logical and hot_standby_feedback=on in
>> postgresql.conf) , start the server and create a physical replication slot
> Note that hot_standby_feedback=on needs to be set on a standby, not on
> the primary (although it doesn't do any harm there).
>
> Thanks,
>
> Andres
>

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Mon, 4 Mar 2019 at 14:09, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Fri, 14 Dec 2018 at 06:25, Andres Freund <andres@anarazel.de> wrote:
> > I've a prototype attached, but let's discuss the details in a separate
> > thread. This also needs to be changed for pluggable storage, as we don't
> > know about table access methods in the startup process, so we can't call
> > can't determine which AM the heap is from during
> > btree_xlog_delete_get_latestRemovedXid() (and sibling routines).
>
> Attached is a WIP test patch
> 0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
> modified version of Craig Ringer's test cases

Hi Andres,

I am trying to come up with new testcases to test the recovery
conflict handling. Before that I have some queries :

With Craig Ringer's approach, the way to reproduce the recovery
conflict was, I believe, easy : Do a checkpoint, which will log the
global-catalog-xmin-advance WAL record, due to which the standby -
while replaying the message - may find out that it's a recovery
conflict. But with your approach, the latestRemovedXid is passed only
during specific vacuum-related WAL records, so to reproduce the
recovery conflict error, we need to make sure some specific WAL
records are logged, such as XLOG_BTREE_DELETE. So we need to create a
testcase such that while creating an index tuple, it erases dead
tuples from a page, so that it eventually calls
_bt_vacuum_one_page()=>_bt_delitems_delete(), thus logging a
XLOG_BTREE_DELETE record.

I tried to come up with this reproducible testcase without success.
This seems difficult. Do you have an easier option ? May be we can use
some other WAL records that may have easier more reliable test case
for showing up recovery conflict ?

Further, with your patch, in ResolveRecoveryConflictWithSlots(), it
just throws a WARNING error level; so the wal receiver would not make
the backends throw an error; hence the test case won't catch the
error. Is that right ?


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Minimal logical decoding on standbys

From
tushar
Date:
Hi ,

I am getting a server crash on standby while executing 
pg_logical_slot_get_changes function   , please refer this scenario

Master cluster( ./initdb -D master)
set wal_level='hot_standby in master/postgresql.conf file
start the server , connect to  psql terminal and create a physical 
replication slot ( SELECT * from 
pg_create_physical_replication_slot('p1');)

perform pg_basebackup using --slot 'p1'  (./pg_basebackup -D slave/ -R 
--slot p1 -v))
set wal_level='logical' , hot_standby_feedback=on, 
primary_slot_name='p1' in slave/postgresql.conf file
start the server , connect to psql terminal and create a logical 
replication slot (  SELECT * from 
pg_create_logical_replication_slot('t','test_decoding');)

run pgbench ( ./pgbench -i -s 10 postgres) on master and select 
pg_logical_slot_get_changes on Slave database

postgres=# select * from pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.274 IST [26817] LOG:  starting logical decoding for 
slot "t"
2019-03-13 20:34:50.274 IST [26817] DETAIL:  Streaming transactions 
committing after 0/6C000060, reading WAL from 0/6C000028.
2019-03-13 20:34:50.274 IST [26817] STATEMENT:  select * from 
pg_logical_slot_get_changes('t',null,null);
2019-03-13 20:34:50.275 IST [26817] LOG:  logical decoding found 
consistent point at 0/6C000028
2019-03-13 20:34:50.275 IST [26817] DETAIL:  There are no running 
transactions.
2019-03-13 20:34:50.275 IST [26817] STATEMENT:  select * from 
pg_logical_slot_get_changes('t',null,null);
TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File: 
"decode.c", Line: 977)
server closed the connection unexpectedly
     This probably means the server terminated abnormally
     before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-03-13 
20:34:50.276 IST [26809] LOG:  server process (PID 26817) was terminated 
by signal 6: Aborted

Stack trace -

(gdb) bt
#0  0x00007f370e673277 in raise () from /lib64/libc.so.6
#1  0x00007f370e674968 in abort () from /lib64/libc.so.6
#2  0x0000000000a30edf in ExceptionalCondition (conditionName=0xc36090 
"!(data == tupledata + tuplelen)", errorType=0xc35f5c "FailedAssertion", 
fileName=0xc35d70 "decode.c",
     lineNumber=977) at assert.c:54
#3  0x0000000000843c6f in DecodeMultiInsert (ctx=0x2ba1ac8, 
buf=0x7ffd7a5136d0) at decode.c:977
#4  0x0000000000842b32 in DecodeHeap2Op (ctx=0x2ba1ac8, 
buf=0x7ffd7a5136d0) at decode.c:375
#5  0x00000000008424dd in LogicalDecodingProcessRecord (ctx=0x2ba1ac8, 
record=0x2ba1d88) at decode.c:125
#6  0x000000000084830d in pg_logical_slot_get_changes_guts 
(fcinfo=0x2b95838, confirm=true, binary=false) at logicalfuncs.c:307
#7  0x000000000084846a in pg_logical_slot_get_changes (fcinfo=0x2b95838) 
at logicalfuncs.c:376
#8  0x00000000006e5b9f in ExecMakeTableFunctionResult 
(setexpr=0x2b93ee8, econtext=0x2b93d98, argContext=0x2b99940, 
expectedDesc=0x2b97970, randomAccess=false) at execSRF.c:233
#9  0x00000000006fb738 in FunctionNext (node=0x2b93c80) at 
nodeFunctionscan.c:94
#10 0x00000000006e52b1 in ExecScanFetch (node=0x2b93c80, 
accessMtd=0x6fb67b <FunctionNext>, recheckMtd=0x6fba77 
<FunctionRecheck>) at execScan.c:93
#11 0x00000000006e5326 in ExecScan (node=0x2b93c80, accessMtd=0x6fb67b 
<FunctionNext>, recheckMtd=0x6fba77 <FunctionRecheck>) at execScan.c:143
#12 0x00000000006fbac1 in ExecFunctionScan (pstate=0x2b93c80) at 
nodeFunctionscan.c:270
#13 0x00000000006e3293 in ExecProcNodeFirst (node=0x2b93c80) at 
execProcnode.c:445
#14 0x00000000006d8253 in ExecProcNode (node=0x2b93c80) at 
../../../src/include/executor/executor.h:241
#15 0x00000000006daa4e in ExecutePlan (estate=0x2b93a28, 
planstate=0x2b93c80, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true, numberTuples=0,
     direction=ForwardScanDirection, dest=0x2b907e0, execute_once=true) 
at execMain.c:1643
#16 0x00000000006d8865 in standard_ExecutorRun (queryDesc=0x2afff28, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:362
#17 0x00000000006d869b in ExecutorRun (queryDesc=0x2afff28, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
execMain.c:306
#18 0x00000000008ccef1 in PortalRunSelect (portal=0x2b36168, 
forward=true, count=0, dest=0x2b907e0) at pquery.c:929
#19 0x00000000008ccb90 in PortalRun (portal=0x2b36168, 
count=9223372036854775807, isTopLevel=true, run_once=true, 
dest=0x2b907e0, altdest=0x2b907e0, completionTag=0x7ffd7a513e90 "")
     at pquery.c:770
#20 0x00000000008c6b58 in exec_simple_query (query_string=0x2adc1e8 
"select * from pg_logical_slot_get_changes('t',null,null);") at 
postgres.c:1215
#21 0x00000000008cae88 in PostgresMain (argc=1, argv=0x2b06590, 
dbname=0x2b063d0 "postgres", username=0x2ad8da8 "centos") at postgres.c:4256
#22 0x0000000000828464 in BackendRun (port=0x2afe3b0) at postmaster.c:4399
#23 0x0000000000827c42 in BackendStartup (port=0x2afe3b0) at 
postmaster.c:4090
#24 0x0000000000824036 in ServerLoop () at postmaster.c:1703
#25 0x00000000008238ec in PostmasterMain (argc=3, argv=0x2ad6d00) at 
postmaster.c:1376
#26 0x0000000000748aab in main (argc=3, argv=0x2ad6d00) at main.c:228
(gdb)

regards,


On 03/07/2019 09:03 PM, tushar wrote:
> There is an another issue , where i am getting error while executing 
> "pg_logical_slot_get_changes" on SLAVE
>
> Master (running on port=5432) -  run "make installcheck"  after 
> setting  PATH=<installation/bin:$PATH )  and export 
> PGDATABASE=postgres from regress/ folder
> Slave (running on port=5555)  -  Connect to regression database and 
> select pg_logical_slot_get_changes
>
> [centos@mail-arts bin]$ ./psql postgres -p 5555 -f t.sql
> You are now connected to database "regression" as user "centos".
>  slot_name |    lsn
> -----------+-----------
>  m61       | 1/D437AD8
> (1 row)
>
> psql:t.sql:3: ERROR:  could not resolve cmin/cmax of catalog tuple
>
> [centos@mail-arts bin]$ cat t.sql
> \c regression
> SELECT * from   pg_create_logical_replication_slot('m61', 
> 'test_decoding');
> select * from pg_logical_slot_get_changes('m61',null,null);
>
> regards,
>
> On 03/04/2019 10:57 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2019-03-04 16:54:32 +0530, tushar wrote:
>>> On 03/01/2019 11:16 PM, Andres Freund wrote:
>>>> So, if I understand correctly you do*not*  have a phyiscal replication
>>>> slot for this standby? For the feature to work reliably that needs to
>>>> exist, and you need to have hot_standby_feedback enabled. Does having
>>>> that fix the issue?
>>> Ok, This time around  - I performed like this -
>>>
>>> .)Master cluster (set wal_level=logical and hot_standby_feedback=on in
>>> postgresql.conf) , start the server and create a physical 
>>> replication slot
>> Note that hot_standby_feedback=on needs to be set on a standby, not on
>> the primary (although it doesn't do any harm there).
>>
>> Thanks,
>>
>> Andres
>>
>

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 8 Mar 2019 at 20:59, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Mon, 4 Mar 2019 at 14:09, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > On Fri, 14 Dec 2018 at 06:25, Andres Freund <andres@anarazel.de> wrote:
> > > I've a prototype attached, but let's discuss the details in a separate
> > > thread. This also needs to be changed for pluggable storage, as we don't
> > > know about table access methods in the startup process, so we can't call
> > > can't determine which AM the heap is from during
> > > btree_xlog_delete_get_latestRemovedXid() (and sibling routines).
> >
> > Attached is a WIP test patch
> > 0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
> > modified version of Craig Ringer's test cases
>
> Hi Andres,
>
> I am trying to come up with new testcases to test the recovery
> conflict handling. Before that I have some queries :
>
> With Craig Ringer's approach, the way to reproduce the recovery
> conflict was, I believe, easy : Do a checkpoint, which will log the
> global-catalog-xmin-advance WAL record, due to which the standby -
> while replaying the message - may find out that it's a recovery
> conflict. But with your approach, the latestRemovedXid is passed only
> during specific vacuum-related WAL records, so to reproduce the
> recovery conflict error, we need to make sure some specific WAL
> records are logged, such as XLOG_BTREE_DELETE. So we need to create a
> testcase such that while creating an index tuple, it erases dead
> tuples from a page, so that it eventually calls
> _bt_vacuum_one_page()=>_bt_delitems_delete(), thus logging a
> XLOG_BTREE_DELETE record.
>
> I tried to come up with this reproducible testcase without success.
> This seems difficult. Do you have an easier option ? May be we can use
> some other WAL records that may have easier more reliable test case
> for showing up recovery conflict ?
>

I managed to get a recovery conflict by :
1. Setting hot_standby_feedback to off
2. Creating a logical replication slot on standby
3. Creating a table on master, and insert some data.
2. Running : VACUUM FULL;

This gives WARNING messages in the standby log file.
2019-03-14 14:57:56.833 IST [40076] WARNING:  slot decoding_standby w/
catalog xmin 474 conflicts with removed xid 477
2019-03-14 14:57:56.833 IST [40076] CONTEXT:  WAL redo at 0/3069E98
for Heap2/CLEAN: remxid 477

But I did not add such a testcase into the test file, because with the
current patch, it does not do anything with the slot; it just keeps on
emitting WARNING in the log file; so we can't test this scenario as of
now using the tap test.


> Further, with your patch, in ResolveRecoveryConflictWithSlots(), it
> just throws a WARNING error level; so the wal receiver would not make
> the backends throw an error; hence the test case won't catch the
> error. Is that right ?

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I managed to get a recovery conflict by :
> 1. Setting hot_standby_feedback to off
> 2. Creating a logical replication slot on standby
> 3. Creating a table on master, and insert some data.
> 2. Running : VACUUM FULL;
>
> This gives WARNING messages in the standby log file.
> 2019-03-14 14:57:56.833 IST [40076] WARNING:  slot decoding_standby w/
> catalog xmin 474 conflicts with removed xid 477
> 2019-03-14 14:57:56.833 IST [40076] CONTEXT:  WAL redo at 0/3069E98
> for Heap2/CLEAN: remxid 477
>
> But I did not add such a testcase into the test file, because with the
> current patch, it does not do anything with the slot; it just keeps on
> emitting WARNING in the log file; so we can't test this scenario as of
> now using the tap test.

I am going ahead with drop-the-slot way of handling the recovery
conflict. I am trying out using ReplicationSlotDropPtr() to drop the
slot. It seems the required locks are already in place inside the for
loop of ResolveRecoveryConflictWithSlots(), so we can directly call
ReplicationSlotDropPtr() when the slot xmin conflict is found.

As explained above, the only way I could reproduce the conflict is by
turning hot_standby_feedback off on slave, creating and inserting into
a table on master and then running VACUUM FULL. But after doing this,
I am not able to verify whether the slot is dropped, because on slave,
any simple psql command thereon, waits on a lock acquired on sys
catache, e.g. pg_authid. Working on it.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote:
> On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > I managed to get a recovery conflict by :
> > 1. Setting hot_standby_feedback to off
> > 2. Creating a logical replication slot on standby
> > 3. Creating a table on master, and insert some data.
> > 2. Running : VACUUM FULL;
> >
> > This gives WARNING messages in the standby log file.
> > 2019-03-14 14:57:56.833 IST [40076] WARNING:  slot decoding_standby w/
> > catalog xmin 474 conflicts with removed xid 477
> > 2019-03-14 14:57:56.833 IST [40076] CONTEXT:  WAL redo at 0/3069E98
> > for Heap2/CLEAN: remxid 477
> >
> > But I did not add such a testcase into the test file, because with the
> > current patch, it does not do anything with the slot; it just keeps on
> > emitting WARNING in the log file; so we can't test this scenario as of
> > now using the tap test.
> 
> I am going ahead with drop-the-slot way of handling the recovery
> conflict. I am trying out using ReplicationSlotDropPtr() to drop the
> slot. It seems the required locks are already in place inside the for
> loop of ResolveRecoveryConflictWithSlots(), so we can directly call
> ReplicationSlotDropPtr() when the slot xmin conflict is found.

Cool.


> As explained above, the only way I could reproduce the conflict is by
> turning hot_standby_feedback off on slave, creating and inserting into
> a table on master and then running VACUUM FULL. But after doing this,
> I am not able to verify whether the slot is dropped, because on slave,
> any simple psql command thereon, waits on a lock acquired on sys
> catache, e.g. pg_authid. Working on it.

I think that indicates a bug somewhere. If replay progressed, it should
have killed the slot, and continued replaying past the VACUUM
FULL. Those symptoms suggest replay is stuck somewhere. I suggest a)
compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking
at a backtrace of the startup process.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 2 Apr 2019 at 21:34, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote:
> > On Thu, 14 Mar 2019 at 15:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > > I managed to get a recovery conflict by :
> > > 1. Setting hot_standby_feedback to off
> > > 2. Creating a logical replication slot on standby
> > > 3. Creating a table on master, and insert some data.
> > > 2. Running : VACUUM FULL;
> > >
> > > This gives WARNING messages in the standby log file.
> > > 2019-03-14 14:57:56.833 IST [40076] WARNING:  slot decoding_standby w/
> > > catalog xmin 474 conflicts with removed xid 477
> > > 2019-03-14 14:57:56.833 IST [40076] CONTEXT:  WAL redo at 0/3069E98
> > > for Heap2/CLEAN: remxid 477
> > >
> > > But I did not add such a testcase into the test file, because with the
> > > current patch, it does not do anything with the slot; it just keeps on
> > > emitting WARNING in the log file; so we can't test this scenario as of
> > > now using the tap test.
> >
> > I am going ahead with drop-the-slot way of handling the recovery
> > conflict. I am trying out using ReplicationSlotDropPtr() to drop the
> > slot. It seems the required locks are already in place inside the for
> > loop of ResolveRecoveryConflictWithSlots(), so we can directly call
> > ReplicationSlotDropPtr() when the slot xmin conflict is found.
>
> Cool.
>
>
> > As explained above, the only way I could reproduce the conflict is by
> > turning hot_standby_feedback off on slave, creating and inserting into
> > a table on master and then running VACUUM FULL. But after doing this,
> > I am not able to verify whether the slot is dropped, because on slave,
> > any simple psql command thereon, waits on a lock acquired on sys
> > catache, e.g. pg_authid. Working on it.
>
> I think that indicates a bug somewhere. If replay progressed, it should
> have killed the slot, and continued replaying past the VACUUM
> FULL. Those symptoms suggest replay is stuck somewhere. I suggest a)
> compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking
> at a backtrace of the startup process.

Oops, it was my own change that caused the hang. Sorry for the noise.
After using wal_debug, found out that after replaying the LOCK records
for the catalog pg_auth, it was not releasing it because it had
actually got stuck in ReplicationSlotDropPtr() itself. In
ResolveRecoveryConflictWithSlots(), a shared
ReplicationSlotControlLock was already held before iterating through
the slots, and now ReplicationSlotDropPtr() again tries to take the
same lock in exclusive mode for setting slot->in_use, leading to a
deadlock. I fixed that by releasing the shared lock before calling
ReplicationSlotDropPtr(), and then re-starting the slots' scan over
again since we released it. We do similar thing for
ReplicationSlotCleanup().

Attached is a rebased version of your patch
logical-decoding-on-standby.patch. This v2 version also has the above
changes. It also includes the tap test file which is still in WIP
state, mainly because I have yet to add the conflict recovery handling
scenarios.

I see that you have already committed the
move-latestRemovedXid-computation-for-nbtree-xlog related changes.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Wed, 3 Apr 2019 at 19:57, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Oops, it was my own change that caused the hang. Sorry for the noise.
> After using wal_debug, found out that after replaying the LOCK records
> for the catalog pg_auth, it was not releasing it because it had
> actually got stuck in ReplicationSlotDropPtr() itself. In
> ResolveRecoveryConflictWithSlots(), a shared
> ReplicationSlotControlLock was already held before iterating through
> the slots, and now ReplicationSlotDropPtr() again tries to take the
> same lock in exclusive mode for setting slot->in_use, leading to a
> deadlock. I fixed that by releasing the shared lock before calling
> ReplicationSlotDropPtr(), and then re-starting the slots' scan over
> again since we released it. We do similar thing for
> ReplicationSlotCleanup().
>
> Attached is a rebased version of your patch
> logical-decoding-on-standby.patch. This v2 version also has the above
> changes. It also includes the tap test file which is still in WIP
> state, mainly because I have yet to add the conflict recovery handling
> scenarios.

Attached v3 patch includes a new scenario to test conflict recovery
handling by verifying that the conflicting slot gets dropped.

WIth this, I am done with the test changes, except the below question
that I had posted earlier which I would like to have inputs :

Regarding the test result failures, I could see that when we drop a
logical replication slot at standby server, then the catalog_xmin of
physical replication slot becomes NULL, whereas the test expects it to
be equal to xmin; and that's the reason a couple of test scenarios are
failing :

ok 33 - slot on standby dropped manually
Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
done
not ok 34 - physical catalog_xmin still non-null
not ok 35 - xmin and catalog_xmin equal after slot drop
#   Failed test 'xmin and catalog_xmin equal after slot drop'
#   at t/016_logical_decoding_on_replica.pl line 272.
#          got:
#     expected: 2584

I am not sure what is expected. What actually happens is : the
physical xlot catalog_xmin remains NULL initially, but becomes
non-NULL after the logical replication slot is created on standby.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

Thanks for the new version of the patch.  Btw, could you add Craig as a
co-author in the commit message of the next version of the patch? Don't
want to forget him.

On 2019-04-05 17:08:39 +0530, Amit Khandekar wrote:
> Regarding the test result failures, I could see that when we drop a
> logical replication slot at standby server, then the catalog_xmin of
> physical replication slot becomes NULL, whereas the test expects it to
> be equal to xmin; and that's the reason a couple of test scenarios are
> failing :
> 
> ok 33 - slot on standby dropped manually
> Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
> done
> not ok 34 - physical catalog_xmin still non-null
> not ok 35 - xmin and catalog_xmin equal after slot drop
> #   Failed test 'xmin and catalog_xmin equal after slot drop'
> #   at t/016_logical_decoding_on_replica.pl line 272.
> #          got:
> #     expected: 2584
> 
> I am not sure what is expected. What actually happens is : the
> physical xlot catalog_xmin remains NULL initially, but becomes
> non-NULL after the logical replication slot is created on standby.

That seems like the correct behaviour to me - why would we still have a
catalog xmin if there's no slot logical slot?


> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index 006446b..5785d2f 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
>      }
>  }
>  
> +void
> +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> +{
> +    int            i;
> +    bool        found_conflict = false;
> +
> +    if (max_replication_slots <= 0)
> +        return;
> +
> +restart:
> +    if (found_conflict)
> +    {
> +        CHECK_FOR_INTERRUPTS();
> +        /*
> +         * Wait awhile for them to die so that we avoid flooding an
> +         * unresponsive backend when system is heavily loaded.
> +         */
> +        pg_usleep(100000);
> +        found_conflict = false;
> +    }
> +
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    for (i = 0; i < max_replication_slots; i++)
> +    {
> +        ReplicationSlot *s;
> +        NameData    slotname;
> +        TransactionId slot_xmin;
> +        TransactionId slot_catalog_xmin;
> +
> +        s = &ReplicationSlotCtl->replication_slots[i];
> +
> +        /* cannot change while ReplicationSlotCtlLock is held */
> +        if (!s->in_use)
> +            continue;
> +
> +        /* not our database, skip */
> +        if (s->data.database != InvalidOid && s->data.database != dboid)
> +            continue;
> +
> +        SpinLockAcquire(&s->mutex);
> +        slotname = s->data.name;
> +        slot_xmin = s->data.xmin;
> +        slot_catalog_xmin = s->data.catalog_xmin;
> +        SpinLockRelease(&s->mutex);
> +
> +        if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
> +        {
> +            found_conflict = true;
> +
> +            ereport(WARNING,
> +                    (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
> +                            NameStr(slotname), slot_xmin, xid)));
> +        }
> +
> +        if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
> +        {
> +            found_conflict = true;
> +
> +            ereport(WARNING,
> +                    (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> +                            NameStr(slotname), slot_catalog_xmin, xid)));
> +        }
> +
> +
> +        if (found_conflict)
> +        {
> +            elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
> +            LWLockRelease(ReplicationSlotControlLock);    /* avoid deadlock */
> +            ReplicationSlotDropPtr(s);
> +
> +            /* We released the lock above; so re-scan the slots. */
> +            goto restart;
> +        }
> +    }
>
I think this should be refactored so that the two found_conflict cases
set a 'reason' variable (perhaps an enum?) to the particular reason, and
then only one warning should be emitted.  I also think that LOG might be
more appropriate than WARNING - as confusing as that is, LOG is more
severe than WARNING (see docs about log_min_messages).


> @@ -0,0 +1,386 @@
> +# Demonstrate that logical can follow timeline switches.
> +#
> +# Test logical decoding on a standby.
> +#
> +use strict;
> +use warnings;
> +use 5.8.0;
> +
> +use PostgresNode;
> +use TestLib;
> +use Test::More tests => 55;
> +use RecursiveCopy;
> +use File::Copy;
> +
> +my ($stdin, $stdout, $stderr, $ret, $handle, $return);
> +my $backup_name;
> +
> +# Initialize master node
> +my $node_master = get_new_node('master');
> +$node_master->init(allows_streaming => 1, has_archiving => 1);
> +$node_master->append_conf('postgresql.conf', q{
> +wal_level = 'logical'
> +max_replication_slots = 4
> +max_wal_senders = 4
> +log_min_messages = 'debug2'
> +log_error_verbosity = verbose
> +# send status rapidly so we promptly advance xmin on master
> +wal_receiver_status_interval = 1
> +# very promptly terminate conflicting backends
> +max_standby_streaming_delay = '2s'
> +});
> +$node_master->dump_info;
> +$node_master->start;
> +
> +$node_master->psql('postgres', q[CREATE DATABASE testdb]);
> +
> +$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
> +$backup_name = 'b1';
> +my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
> +TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'),
'--slot=decoding_standby');
> +
> +sub print_phys_xmin
> +{
> +    my $slot = $node_master->slot('decoding_standby');
> +    return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
> +}
> +
> +my ($xmin, $catalog_xmin) = print_phys_xmin();
> +# After slot creation, xmins must be null
> +is($xmin, '', "xmin null");
> +is($catalog_xmin, '', "catalog_xmin null");
> +
> +my $node_replica = get_new_node('replica');
> +$node_replica->init_from_backup(
> +    $node_master, $backup_name,
> +    has_streaming => 1,
> +    has_restoring => 1);
> +$node_replica->append_conf('postgresql.conf',
> +    q[primary_slot_name = 'decoding_standby']);
> +
> +$node_replica->start;
> +$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
> +
> +# with hot_standby_feedback off, xmin and catalog_xmin must still be null
> +($xmin, $catalog_xmin) = print_phys_xmin();
> +is($xmin, '', "xmin null after replica join");
> +is($catalog_xmin, '', "catalog_xmin null after replica join");
> +
> +$node_replica->append_conf('postgresql.conf',q[
> +hot_standby_feedback = on
> +]);
> +$node_replica->restart;
> +sleep(2); # ensure walreceiver feedback sent

Can we make this more robust? E.g. by waiting till pg_stat_replication
shows the change on the primary? Because I can guarantee that this'll
fail on slow buildfarm machines (say the valgrind animals).




> +$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
> +sleep(2); # ensure walreceiver feedback sent

Similar.


Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Sat, 6 Apr 2019 at 04:45, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Thanks for the new version of the patch.  Btw, could you add Craig as a
> co-author in the commit message of the next version of the patch? Don't
> want to forget him.

I had put his name in the earlier patch. But now I have made it easier to spot.

>
> On 2019-04-05 17:08:39 +0530, Amit Khandekar wrote:
> > Regarding the test result failures, I could see that when we drop a
> > logical replication slot at standby server, then the catalog_xmin of
> > physical replication slot becomes NULL, whereas the test expects it to
> > be equal to xmin; and that's the reason a couple of test scenarios are
> > failing :
> >
> > ok 33 - slot on standby dropped manually
> > Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
> > done
> > not ok 34 - physical catalog_xmin still non-null
> > not ok 35 - xmin and catalog_xmin equal after slot drop
> > #   Failed test 'xmin and catalog_xmin equal after slot drop'
> > #   at t/016_logical_decoding_on_replica.pl line 272.
> > #          got:
> > #     expected: 2584
> >
> > I am not sure what is expected. What actually happens is : the
> > physical xlot catalog_xmin remains NULL initially, but becomes
> > non-NULL after the logical replication slot is created on standby.
>
> That seems like the correct behaviour to me - why would we still have a
> catalog xmin if there's no slot logical slot?

Yeah ... In the earlier implementation, maybe it was different, that's
why the catalog_xmin didn't become NULL. Not sure. Anyways, I have
changed this check. Details in the following sections.

>
>
> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index 006446b..5785d2f 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
> >       }
> >  }
> >
> > +void
> > +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> > +{
> > +     int                     i;
> > +     bool            found_conflict = false;
> > +
> > +     if (max_replication_slots <= 0)
> > +             return;
> > +
> > +restart:
> > +     if (found_conflict)
> > +     {
> > +             CHECK_FOR_INTERRUPTS();
> > +             /*
> > +              * Wait awhile for them to die so that we avoid flooding an
> > +              * unresponsive backend when system is heavily loaded.
> > +              */
> > +             pg_usleep(100000);
> > +             found_conflict = false;
> > +     }
> > +
> > +     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > +     for (i = 0; i < max_replication_slots; i++)
> > +     {
> > +             ReplicationSlot *s;
> > +             NameData        slotname;
> > +             TransactionId slot_xmin;
> > +             TransactionId slot_catalog_xmin;
> > +
> > +             s = &ReplicationSlotCtl->replication_slots[i];
> > +
> > +             /* cannot change while ReplicationSlotCtlLock is held */
> > +             if (!s->in_use)
> > +                     continue;
> > +
> > +             /* not our database, skip */
> > +             if (s->data.database != InvalidOid && s->data.database != dboid)
> > +                     continue;
> > +
> > +             SpinLockAcquire(&s->mutex);
> > +             slotname = s->data.name;
> > +             slot_xmin = s->data.xmin;
> > +             slot_catalog_xmin = s->data.catalog_xmin;
> > +             SpinLockRelease(&s->mutex);
> > +
> > +             if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
> > +             {
> > +                     found_conflict = true;
> > +
> > +                     ereport(WARNING,
> > +                                     (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
> > +                                                     NameStr(slotname), slot_xmin, xid)));
> > +             }
> > +
> > +             if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin,
xid))
> > +             {
> > +                     found_conflict = true;
> > +
> > +                     ereport(WARNING,
> > +                                     (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> > +                                                     NameStr(slotname), slot_catalog_xmin, xid)));
> > +             }
> > +
> > +
> > +             if (found_conflict)
> > +             {
> > +                     elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
> > +                     LWLockRelease(ReplicationSlotControlLock);      /* avoid deadlock */
> > +                     ReplicationSlotDropPtr(s);
> > +
> > +                     /* We released the lock above; so re-scan the slots. */
> > +                     goto restart;
> > +             }
> > +     }
> >
> I think this should be refactored so that the two found_conflict cases
> set a 'reason' variable (perhaps an enum?) to the particular reason, and
> then only one warning should be emitted.  I also think that LOG might be
> more appropriate than WARNING - as confusing as that is, LOG is more
> severe than WARNING (see docs about log_min_messages).

What I have in mind is :

ereport(LOG,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("Dropping conflicting slot %s", s->data.name.data),
errdetail("%s, removed xid %d.", conflict_str, xid)));
where conflict_str is a dynamically generated string containing
something like : "slot xmin : 1234, slot catalog_xmin: 5678"
So for the user, the errdetail will look like :
"slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
I think the user can figure out whether it was xmin or catalog_xmin or
both that conflicted with removed xid.
If we don't do this way, we may not be able to show in a single
message if both xmin and catalog_xmin are conflicting at the same
time.

Does this message look good to you, or you had in mind something quite
different ?

>
>
> > @@ -0,0 +1,386 @@
> > +# Demonstrate that logical can follow timeline switches.
> > +#
> > +# Test logical decoding on a standby.
> > +#
> > +use strict;
> > +use warnings;
> > +use 5.8.0;
> > +
> > +use PostgresNode;
> > +use TestLib;
> > +use Test::More tests => 55;
> > +use RecursiveCopy;
> > +use File::Copy;
> > +
> > +my ($stdin, $stdout, $stderr, $ret, $handle, $return);
> > +my $backup_name;
> > +
> > +# Initialize master node
> > +my $node_master = get_new_node('master');
> > +$node_master->init(allows_streaming => 1, has_archiving => 1);
> > +$node_master->append_conf('postgresql.conf', q{
> > +wal_level = 'logical'
> > +max_replication_slots = 4
> > +max_wal_senders = 4
> > +log_min_messages = 'debug2'
> > +log_error_verbosity = verbose
> > +# send status rapidly so we promptly advance xmin on master
> > +wal_receiver_status_interval = 1
> > +# very promptly terminate conflicting backends
> > +max_standby_streaming_delay = '2s'
> > +});
> > +$node_master->dump_info;
> > +$node_master->start;
> > +
> > +$node_master->psql('postgres', q[CREATE DATABASE testdb]);
> > +
> > +$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
> > +$backup_name = 'b1';
> > +my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
> > +TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'),
'--slot=decoding_standby');
> > +
> > +sub print_phys_xmin
> > +{
> > +     my $slot = $node_master->slot('decoding_standby');
> > +     return ($slot->{'xmin'}, $slot->{'catalog_xmin'});
> > +}
> > +
> > +my ($xmin, $catalog_xmin) = print_phys_xmin();
> > +# After slot creation, xmins must be null
> > +is($xmin, '', "xmin null");
> > +is($catalog_xmin, '', "catalog_xmin null");
> > +
> > +my $node_replica = get_new_node('replica');
> > +$node_replica->init_from_backup(
> > +     $node_master, $backup_name,
> > +     has_streaming => 1,
> > +     has_restoring => 1);
> > +$node_replica->append_conf('postgresql.conf',
> > +     q[primary_slot_name = 'decoding_standby']);
> > +
> > +$node_replica->start;
> > +$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
> > +
> > +# with hot_standby_feedback off, xmin and catalog_xmin must still be null
> > +($xmin, $catalog_xmin) = print_phys_xmin();
> > +is($xmin, '', "xmin null after replica join");
> > +is($catalog_xmin, '', "catalog_xmin null after replica join");
> > +
> > +$node_replica->append_conf('postgresql.conf',q[
> > +hot_standby_feedback = on
> > +]);
> > +$node_replica->restart;
> > +sleep(2); # ensure walreceiver feedback sent
>
> Can we make this more robust? E.g. by waiting till pg_stat_replication
> shows the change on the primary? Because I can guarantee that this'll
> fail on slow buildfarm machines (say the valgrind animals).
>
>
>
>
> > +$node_master->wait_for_catchup($node_replica, 'replay', $node_master->lsn('flush'));
> > +sleep(2); # ensure walreceiver feedback sent
>
> Similar.

Ok. I have put a copy of the get_slot_xmins() function from
t/001_stream_rep.pl() into 016_logical_decoding_on_replica.pl. Renamed
it to wait_for_phys_mins(). And used this to wait for the
hot_standby_feedback change to propagate to master. This function
waits for the physical slot's xmin and catalog_xmin to get the right
values depending on whether there is a logical slot in standby and
whether hot_standby_feedback is on on standby.

I was not sure how pg_stat_replication could be used to identify about
hot_standby_feedback change reaching to master. So i did the above
way, which I think pretty much does what we want, I think.

Attached v4 patch only has the testcase change, and some minor cleanup
in the test file.




--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
tushar
Date:
On 03/13/2019 08:40 PM, tushar wrote:
> Hi ,
>
> I am getting a server crash on standby while executing 
> pg_logical_slot_get_changes function   , please refer this scenario
>
> Master cluster( ./initdb -D master)
> set wal_level='hot_standby in master/postgresql.conf file
> start the server , connect to  psql terminal and create a physical 
> replication slot ( SELECT * from 
> pg_create_physical_replication_slot('p1');)
>
> perform pg_basebackup using --slot 'p1'  (./pg_basebackup -D slave/ -R 
> --slot p1 -v))
> set wal_level='logical' , hot_standby_feedback=on, 
> primary_slot_name='p1' in slave/postgresql.conf file
> start the server , connect to psql terminal and create a logical 
> replication slot (  SELECT * from 
> pg_create_logical_replication_slot('t','test_decoding');)
>
> run pgbench ( ./pgbench -i -s 10 postgres) on master and select 
> pg_logical_slot_get_changes on Slave database
>
> postgres=# select * from pg_logical_slot_get_changes('t',null,null);
> 2019-03-13 20:34:50.274 IST [26817] LOG:  starting logical decoding 
> for slot "t"
> 2019-03-13 20:34:50.274 IST [26817] DETAIL:  Streaming transactions 
> committing after 0/6C000060, reading WAL from 0/6C000028.
> 2019-03-13 20:34:50.274 IST [26817] STATEMENT:  select * from 
> pg_logical_slot_get_changes('t',null,null);
> 2019-03-13 20:34:50.275 IST [26817] LOG:  logical decoding found 
> consistent point at 0/6C000028
> 2019-03-13 20:34:50.275 IST [26817] DETAIL:  There are no running 
> transactions.
> 2019-03-13 20:34:50.275 IST [26817] STATEMENT:  select * from 
> pg_logical_slot_get_changes('t',null,null);
> TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File: 
> "decode.c", Line: 977)
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: 2019-03-13 
> 20:34:50.276 IST [26809] LOG:  server process (PID 26817) was 
> terminated by signal 6: Aborted
>
Andres - Do you think - this is an issue which needs to  be fixed ?

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-04-10 12:11:21 +0530, tushar wrote:
> 
> On 03/13/2019 08:40 PM, tushar wrote:
> > Hi ,
> > 
> > I am getting a server crash on standby while executing
> > pg_logical_slot_get_changes function   , please refer this scenario
> > 
> > Master cluster( ./initdb -D master)
> > set wal_level='hot_standby in master/postgresql.conf file
> > start the server , connect to  psql terminal and create a physical
> > replication slot ( SELECT * from
> > pg_create_physical_replication_slot('p1');)
> > 
> > perform pg_basebackup using --slot 'p1'  (./pg_basebackup -D slave/ -R
> > --slot p1 -v))
> > set wal_level='logical' , hot_standby_feedback=on,
> > primary_slot_name='p1' in slave/postgresql.conf file
> > start the server , connect to psql terminal and create a logical
> > replication slot (  SELECT * from
> > pg_create_logical_replication_slot('t','test_decoding');)
> > 
> > run pgbench ( ./pgbench -i -s 10 postgres) on master and select
> > pg_logical_slot_get_changes on Slave database
> > 
> > postgres=# select * from pg_logical_slot_get_changes('t',null,null);
> > 2019-03-13 20:34:50.274 IST [26817] LOG:  starting logical decoding for
> > slot "t"
> > 2019-03-13 20:34:50.274 IST [26817] DETAIL:  Streaming transactions
> > committing after 0/6C000060, reading WAL from 0/6C000028.
> > 2019-03-13 20:34:50.274 IST [26817] STATEMENT:  select * from
> > pg_logical_slot_get_changes('t',null,null);
> > 2019-03-13 20:34:50.275 IST [26817] LOG:  logical decoding found
> > consistent point at 0/6C000028
> > 2019-03-13 20:34:50.275 IST [26817] DETAIL:  There are no running
> > transactions.
> > 2019-03-13 20:34:50.275 IST [26817] STATEMENT:  select * from
> > pg_logical_slot_get_changes('t',null,null);
> > TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File:
> > "decode.c", Line: 977)
> > server closed the connection unexpectedly
> >     This probably means the server terminated abnormally
> >     before or while processing the request.
> > The connection to the server was lost. Attempting reset: 2019-03-13
> > 20:34:50.276 IST [26809] LOG:  server process (PID 26817) was terminated
> > by signal 6: Aborted
> > 
> Andres - Do you think - this is an issue which needs to  be fixed ?

Yes, it definitely needs to be fixed. I just haven't had sufficient time
to look into it. Have you reproduced this with Amit's latest version?

Amit, have you spent any time looking into it? I know that you're not
that deeply steeped into the internals of logical decoding, but perhaps
there's something obvious going on.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
tushar
Date:
On 04/10/2019 09:39 PM, Andres Freund wrote:
>   Have you reproduced this with Amit's latest version?

Yes-it is very much reproducible.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Wed, 10 Apr 2019 at 21:39, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-04-10 12:11:21 +0530, tushar wrote:
> >
> > On 03/13/2019 08:40 PM, tushar wrote:
> > > Hi ,
> > >
> > > I am getting a server crash on standby while executing
> > > pg_logical_slot_get_changes function   , please refer this scenario
> > >
> > > Master cluster( ./initdb -D master)
> > > set wal_level='hot_standby in master/postgresql.conf file
> > > start the server , connect to  psql terminal and create a physical
> > > replication slot ( SELECT * from
> > > pg_create_physical_replication_slot('p1');)
> > >
> > > perform pg_basebackup using --slot 'p1'  (./pg_basebackup -D slave/ -R
> > > --slot p1 -v))
> > > set wal_level='logical' , hot_standby_feedback=on,
> > > primary_slot_name='p1' in slave/postgresql.conf file
> > > start the server , connect to psql terminal and create a logical
> > > replication slot (  SELECT * from
> > > pg_create_logical_replication_slot('t','test_decoding');)
> > >
> > > run pgbench ( ./pgbench -i -s 10 postgres) on master and select
> > > pg_logical_slot_get_changes on Slave database
> > >
> > > postgres=# select * from pg_logical_slot_get_changes('t',null,null);
> > > 2019-03-13 20:34:50.274 IST [26817] LOG:  starting logical decoding for
> > > slot "t"
> > > 2019-03-13 20:34:50.274 IST [26817] DETAIL:  Streaming transactions
> > > committing after 0/6C000060, reading WAL from 0/6C000028.
> > > 2019-03-13 20:34:50.274 IST [26817] STATEMENT:  select * from
> > > pg_logical_slot_get_changes('t',null,null);
> > > 2019-03-13 20:34:50.275 IST [26817] LOG:  logical decoding found
> > > consistent point at 0/6C000028
> > > 2019-03-13 20:34:50.275 IST [26817] DETAIL:  There are no running
> > > transactions.
> > > 2019-03-13 20:34:50.275 IST [26817] STATEMENT:  select * from
> > > pg_logical_slot_get_changes('t',null,null);
> > > TRAP: FailedAssertion("!(data == tupledata + tuplelen)", File:
> > > "decode.c", Line: 977)
> > > server closed the connection unexpectedly
> > >     This probably means the server terminated abnormally
> > >     before or while processing the request.
> > > The connection to the server was lost. Attempting reset: 2019-03-13
> > > 20:34:50.276 IST [26809] LOG:  server process (PID 26817) was terminated
> > > by signal 6: Aborted
> > >
> > Andres - Do you think - this is an issue which needs to  be fixed ?
>
> Yes, it definitely needs to be fixed. I just haven't had sufficient time
> to look into it. Have you reproduced this with Amit's latest version?
>
> Amit, have you spent any time looking into it? I know that you're not
> that deeply steeped into the internals of logical decoding, but perhaps
> there's something obvious going on.

I tried to see if I can quickly understand what's going on.

Here, master wal_level is hot_standby, not logical, though slave
wal_level is logical.

On slave, when pg_logical_slot_get_changes() is run, in
DecodeMultiInsert(), it does not get any WAL records having
XLH_INSERT_CONTAINS_NEW_TUPLE set. So data pointer is never
incremented, it remains at tupledata. So at the end of the function,
this assertion fails :
Assert(data == tupledata + tuplelen);
because data is actually at tupledata.

Not sure why this is happening. On slave, wal_level is logical, so
logical records should have tuple data. Not sure what does that have
to do with wal_level of master. Everything should be there on slave
after it replays the inserts; and also slave wal_level is logical.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-04-12 23:34:02 +0530, Amit Khandekar wrote:
> I tried to see if I can quickly understand what's going on.
> 
> Here, master wal_level is hot_standby, not logical, though slave
> wal_level is logical.

Oh, that's well diagnosed.  Cool.  Also nicely tested - this'd be ugly
in production.

I assume the problem isn't present if you set the primary to wal_level =
logical?


> Not sure why this is happening. On slave, wal_level is logical, so
> logical records should have tuple data. Not sure what does that have
> to do with wal_level of master. Everything should be there on slave
> after it replays the inserts; and also slave wal_level is logical.

The standby doesn't write its own WAL, only primaries do. I thought we
forbade running with wal_level=logical on a standby, when the primary is
only set to replica.  But that's not what we do, see
CheckRequiredParameterValues().

I've not yet thought this through, but I think we'll have to somehow
error out in this case.  I guess we could just check at the start of
decoding what ControlFile->wal_level is set to, and then raise an error
in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
wal_level to something lower?

Could you try to implement that?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-04-12 23:34:02 +0530, Amit Khandekar wrote:
> > I tried to see if I can quickly understand what's going on.
> >
> > Here, master wal_level is hot_standby, not logical, though slave
> > wal_level is logical.
>
> Oh, that's well diagnosed.  Cool.  Also nicely tested - this'd be ugly
> in production.

Tushar had made me aware of the fact that this reproduces only when
master wal_level is hot_standby.

>
> I assume the problem isn't present if you set the primary to wal_level =
> logical?

Right.

>
>
> > Not sure why this is happening. On slave, wal_level is logical, so
> > logical records should have tuple data. Not sure what does that have
> > to do with wal_level of master. Everything should be there on slave
> > after it replays the inserts; and also slave wal_level is logical.
>
> The standby doesn't write its own WAL, only primaries do. I thought we
> forbade running with wal_level=logical on a standby, when the primary is
> only set to replica.  But that's not what we do, see
> CheckRequiredParameterValues().
>
> I've not yet thought this through, but I think we'll have to somehow
> error out in this case.  I guess we could just check at the start of
> decoding what ControlFile->wal_level is set to,

By "start of decoding", I didn't get where exactly. Do you mean
CheckLogicalDecodingRequirements() ?

> and then raise an error
> in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
> wal_level to something lower?

Didn't get where exactly we should error out. We don't do
XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
something else, which I didn't understand.

What I am thinking is :
In CheckLogicalDecodingRequirements(), besides checking wal_level,
also check ControlFile->wal_level when InHotStandby. I mean, when we
are InHotStandby, both wal_level and ControlFile->wal_level should be
>= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
slot when master has incompatible wal_level.

ControlFile is not accessible outside xlog.c so need to have an API to
extract this field.


>
> Could you try to implement that?
>
> Greetings,
>
> Andres Freund


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

Sorry for the late response.

On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:
> On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:
> > > Not sure why this is happening. On slave, wal_level is logical, so
> > > logical records should have tuple data. Not sure what does that have
> > > to do with wal_level of master. Everything should be there on slave
> > > after it replays the inserts; and also slave wal_level is logical.
> >
> > The standby doesn't write its own WAL, only primaries do. I thought we
> > forbade running with wal_level=logical on a standby, when the primary is
> > only set to replica.  But that's not what we do, see
> > CheckRequiredParameterValues().
> >
> > I've not yet thought this through, but I think we'll have to somehow
> > error out in this case.  I guess we could just check at the start of
> > decoding what ControlFile->wal_level is set to,
> 
> By "start of decoding", I didn't get where exactly. Do you mean
> CheckLogicalDecodingRequirements() ?

Right.


> > and then raise an error
> > in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
> > wal_level to something lower?
> 
> Didn't get where exactly we should error out. We don't do
> XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
> something else, which I didn't understand.

I was indeed thinking of checking XLOG_PARAMETER_CHANGE in
decode.c. Adding handling for that, and just checking wal_level, ought
to be fairly doable? But, see below:


> What I am thinking is :
> In CheckLogicalDecodingRequirements(), besides checking wal_level,
> also check ControlFile->wal_level when InHotStandby. I mean, when we
> are InHotStandby, both wal_level and ControlFile->wal_level should be
> >= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
> slot when master has incompatible wal_level.

That still allows the primary to change wal_level after logical decoding
has started, so we need the additional checks.

I'm not yet sure how to best deal with the fact that wal_level might be
changed by the primary at basically all times. We would eventually get
an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
that's not necessarily sufficient - if a primary changes its wal_level
to lower, it could remove information logical decoding needs *before*
logical decoding reaches the XLOG_PARAMETER_CHANGE record.

So I suspect we need conflict handling in xlog_redo's
XLOG_PARAMETER_CHANGE case. If we there check against existing logical
slots, we ought to be safe.

Therefore I think the check in CheckLogicalDecodingRequirements() needs
to be something like:

if (RecoveryInProgress())
{
    if (!InHotStandby)
        ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
    /*
     * This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
     * wal_level has changed, we verify that there are no existin glogical
     * replication slots. And to avoid races around creating a new slot,
     * CheckLogicalDecodingRequirements() is called once before creating the slot,
     * andd once when logical decoding is initially starting up.
     */
    if (ControlFile->wal_level != LOGICAL)
        ereport(ERROR, "...");
}

And then add a second CheckLogicalDecodingRequirements() call into
CreateInitDecodingContext().

What do you think?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
Hi,

I am going through you comments. Meanwhile, attached is a rebased
version of the v4 patch.

On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Sorry for the late response.
>
> On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:
> > On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:
> > > > Not sure why this is happening. On slave, wal_level is logical, so
> > > > logical records should have tuple data. Not sure what does that have
> > > > to do with wal_level of master. Everything should be there on slave
> > > > after it replays the inserts; and also slave wal_level is logical.
> > >
> > > The standby doesn't write its own WAL, only primaries do. I thought we
> > > forbade running with wal_level=logical on a standby, when the primary is
> > > only set to replica.  But that's not what we do, see
> > > CheckRequiredParameterValues().
> > >
> > > I've not yet thought this through, but I think we'll have to somehow
> > > error out in this case.  I guess we could just check at the start of
> > > decoding what ControlFile->wal_level is set to,
> >
> > By "start of decoding", I didn't get where exactly. Do you mean
> > CheckLogicalDecodingRequirements() ?
>
> Right.
>
>
> > > and then raise an error
> > > in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
> > > wal_level to something lower?
> >
> > Didn't get where exactly we should error out. We don't do
> > XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
> > something else, which I didn't understand.
>
> I was indeed thinking of checking XLOG_PARAMETER_CHANGE in
> decode.c. Adding handling for that, and just checking wal_level, ought
> to be fairly doable? But, see below:
>
>
> > What I am thinking is :
> > In CheckLogicalDecodingRequirements(), besides checking wal_level,
> > also check ControlFile->wal_level when InHotStandby. I mean, when we
> > are InHotStandby, both wal_level and ControlFile->wal_level should be
> > >= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
> > slot when master has incompatible wal_level.
>
> That still allows the primary to change wal_level after logical decoding
> has started, so we need the additional checks.
>
> I'm not yet sure how to best deal with the fact that wal_level might be
> changed by the primary at basically all times. We would eventually get
> an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
> that's not necessarily sufficient - if a primary changes its wal_level
> to lower, it could remove information logical decoding needs *before*
> logical decoding reaches the XLOG_PARAMETER_CHANGE record.
>
> So I suspect we need conflict handling in xlog_redo's
> XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> slots, we ought to be safe.
>
> Therefore I think the check in CheckLogicalDecodingRequirements() needs
> to be something like:
>
> if (RecoveryInProgress())
> {
>     if (!InHotStandby)
>         ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
>     /*
>      * This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
>      * wal_level has changed, we verify that there are no existin glogical
>      * replication slots. And to avoid races around creating a new slot,
>      * CheckLogicalDecodingRequirements() is called once before creating the slot,
>      * andd once when logical decoding is initially starting up.
>      */
>     if (ControlFile->wal_level != LOGICAL)
>         ereport(ERROR, "...");
> }
>
> And then add a second CheckLogicalDecodingRequirements() call into
> CreateInitDecodingContext().
>
> What do you think?
>
> Greetings,
>
> Andres Freund



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 9 Apr 2019 at 22:23, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Sat, 6 Apr 2019 at 04:45, Andres Freund <andres@anarazel.de> wrote:
> > > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > > index 006446b..5785d2f 100644
> > > --- a/src/backend/replication/slot.c
> > > +++ b/src/backend/replication/slot.c
> > > @@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
> > >       }
> > >  }
> > >
> > > +void
> > > +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> > > +{
> > > +     int                     i;
> > > +     bool            found_conflict = false;
> > > +
> > > +     if (max_replication_slots <= 0)
> > > +             return;
> > > +
> > > +restart:
> > > +     if (found_conflict)
> > > +     {
> > > +             CHECK_FOR_INTERRUPTS();
> > > +             /*
> > > +              * Wait awhile for them to die so that we avoid flooding an
> > > +              * unresponsive backend when system is heavily loaded.
> > > +              */
> > > +             pg_usleep(100000);
> > > +             found_conflict = false;
> > > +     }
> > > +
> > > +     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > > +     for (i = 0; i < max_replication_slots; i++)
> > > +     {
> > > +             ReplicationSlot *s;
> > > +             NameData        slotname;
> > > +             TransactionId slot_xmin;
> > > +             TransactionId slot_catalog_xmin;
> > > +
> > > +             s = &ReplicationSlotCtl->replication_slots[i];
> > > +
> > > +             /* cannot change while ReplicationSlotCtlLock is held */
> > > +             if (!s->in_use)
> > > +                     continue;
> > > +
> > > +             /* not our database, skip */
> > > +             if (s->data.database != InvalidOid && s->data.database != dboid)
> > > +                     continue;
> > > +
> > > +             SpinLockAcquire(&s->mutex);
> > > +             slotname = s->data.name;
> > > +             slot_xmin = s->data.xmin;
> > > +             slot_catalog_xmin = s->data.catalog_xmin;
> > > +             SpinLockRelease(&s->mutex);
> > > +
> > > +             if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
> > > +             {
> > > +                     found_conflict = true;
> > > +
> > > +                     ereport(WARNING,
> > > +                                     (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
> > > +                                                     NameStr(slotname), slot_xmin, xid)));
> > > +             }
> > > +
> > > +             if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin,
xid))
> > > +             {
> > > +                     found_conflict = true;
> > > +
> > > +                     ereport(WARNING,
> > > +                                     (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> > > +                                                     NameStr(slotname), slot_catalog_xmin, xid)));
> > > +             }
> > > +
> > > +
> > > +             if (found_conflict)
> > > +             {
> > > +                     elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
> > > +                     LWLockRelease(ReplicationSlotControlLock);      /* avoid deadlock */
> > > +                     ReplicationSlotDropPtr(s);
> > > +
> > > +                     /* We released the lock above; so re-scan the slots. */
> > > +                     goto restart;
> > > +             }
> > > +     }
> > >
> > I think this should be refactored so that the two found_conflict cases
> > set a 'reason' variable (perhaps an enum?) to the particular reason, and
> > then only one warning should be emitted.  I also think that LOG might be
> > more appropriate than WARNING - as confusing as that is, LOG is more
> > severe than WARNING (see docs about log_min_messages).
>
> What I have in mind is :
>
> ereport(LOG,
> (errcode(ERRCODE_INTERNAL_ERROR),
> errmsg("Dropping conflicting slot %s", s->data.name.data),
> errdetail("%s, removed xid %d.", conflict_str, xid)));
> where conflict_str is a dynamically generated string containing
> something like : "slot xmin : 1234, slot catalog_xmin: 5678"
> So for the user, the errdetail will look like :
> "slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
> I think the user can figure out whether it was xmin or catalog_xmin or
> both that conflicted with removed xid.
> If we don't do this way, we may not be able to show in a single
> message if both xmin and catalog_xmin are conflicting at the same
> time.
>
> Does this message look good to you, or you had in mind something quite
> different ?

The above one is yet another point that needs to be concluded on. Till
then I will use the above way to display the error message in the
upcoming patch version.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Sorry for the late response.
>
> On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:
> > On Sat, 13 Apr 2019 at 00:57, Andres Freund <andres@anarazel.de> wrote:
> > > > Not sure why this is happening. On slave, wal_level is logical, so
> > > > logical records should have tuple data. Not sure what does that have
> > > > to do with wal_level of master. Everything should be there on slave
> > > > after it replays the inserts; and also slave wal_level is logical.
> > >
> > > The standby doesn't write its own WAL, only primaries do. I thought we
> > > forbade running with wal_level=logical on a standby, when the primary is
> > > only set to replica.  But that's not what we do, see
> > > CheckRequiredParameterValues().
> > >
> > > I've not yet thought this through, but I think we'll have to somehow
> > > error out in this case.  I guess we could just check at the start of
> > > decoding what ControlFile->wal_level is set to,
> >
> > By "start of decoding", I didn't get where exactly. Do you mean
> > CheckLogicalDecodingRequirements() ?
>
> Right.
>
>
> > > and then raise an error
> > > in decode.c when we pass an XLOG_PARAMETER_CHANGE record that sets
> > > wal_level to something lower?
> >
> > Didn't get where exactly we should error out. We don't do
> > XLOG_PARAMETER_CHANGE handling in decode.c , so obviously you meant
> > something else, which I didn't understand.
>
> I was indeed thinking of checking XLOG_PARAMETER_CHANGE in
> decode.c. Adding handling for that, and just checking wal_level, ought
> to be fairly doable? But, see below:
>
>
> > What I am thinking is :
> > In CheckLogicalDecodingRequirements(), besides checking wal_level,
> > also check ControlFile->wal_level when InHotStandby. I mean, when we
> > are InHotStandby, both wal_level and ControlFile->wal_level should be
> > >= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
> > slot when master has incompatible wal_level.
>
> That still allows the primary to change wal_level after logical decoding
> has started, so we need the additional checks.
>
> I'm not yet sure how to best deal with the fact that wal_level might be
> changed by the primary at basically all times. We would eventually get
> an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
> that's not necessarily sufficient - if a primary changes its wal_level
> to lower, it could remove information logical decoding needs *before*
> logical decoding reaches the XLOG_PARAMETER_CHANGE record.
>
> So I suspect we need conflict handling in xlog_redo's
> XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> slots, we ought to be safe.
>
> Therefore I think the check in CheckLogicalDecodingRequirements() needs
> to be something like:
>
> if (RecoveryInProgress())
> {
>     if (!InHotStandby)
>         ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
>     /*
>      * This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
>      * wal_level has changed, we verify that there are no existin glogical
>      * replication slots. And to avoid races around creating a new slot,
>      * CheckLogicalDecodingRequirements() is called once before creating the slot,
>      * andd once when logical decoding is initially starting up.
>      */
>     if (ControlFile->wal_level != LOGICAL)
>         ereport(ERROR, "...");
> }
>
> And then add a second CheckLogicalDecodingRequirements() call into
> CreateInitDecodingContext().
>
> What do you think?

Yeah, I agree we should add such checks to minimize the possibility of
reading logical records from a master that has insufficient wal_level.
So to summarize :
a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
b. Call this function call in CreateInitDecodingContext() as well.
c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
error if there is an existing logical slot.

This made me think more of the race conditions. For instance, in
pg_create_logical_replication_slot(), just after
CheckLogicalDecodingRequirements and before actually creating the
slot, suppose concurrently Controlfile->wal_level is changed from
logical to replica.  So suppose a new slot does get created. Later the
slot is read, so in pg_logical_slot_get_changes_guts(),
CheckLogicalDecodingRequirements() is called where it checks
ControlFile->wal_level value. But just before it does that,
ControlFile->wal_level concurrently changes back to logical, because
of replay of another param-change record. So this logical reader will
think that the wal_level is sufficient, and will proceed to read the
records, but those records are *before* the wal_level change, so these
records don't have logical data.

Do you think this is possible, or I am missing something? If that's
possible, I was considering some other mechanisms. Like, while reading
each wal_level-change record by a logical reader, save the value in
the ReplicationSlotPersistentData. So while reading the WAL records,
the reader knows whether the records have logical data. If they don't
have, error out. But not sure how will the reader know the very first
record status, i.e. before it gets the wal_level-change record.

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Thu, May 23, 2019 at 8:08 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> This made me think more of the race conditions. For instance, in
> pg_create_logical_replication_slot(), just after
> CheckLogicalDecodingRequirements and before actually creating the
> slot, suppose concurrently Controlfile->wal_level is changed from
> logical to replica.  So suppose a new slot does get created. Later the
> slot is read, so in pg_logical_slot_get_changes_guts(),
> CheckLogicalDecodingRequirements() is called where it checks
> ControlFile->wal_level value. But just before it does that,
> ControlFile->wal_level concurrently changes back to logical, because
> of replay of another param-change record. So this logical reader will
> think that the wal_level is sufficient, and will proceed to read the
> records, but those records are *before* the wal_level change, so these
> records don't have logical data.
>
> Do you think this is possible, or I am missing something?

wal_level is PGC_POSTMASTER.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Sergei Kornilov
Date:
Hello

> wal_level is PGC_POSTMASTER.

But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only
logical)on standby. So online change from logical to replica wal_level is possible on standby's controlfile.
 

regards, Sergei



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:
> > wal_level is PGC_POSTMASTER.
>
> But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only
logical)on standby. So online change from logical to replica wal_level is possible on standby's controlfile.
 

That's true, but Amit's scenario involved a change in wal_level during
the execution of pg_create_logical_replication_slot(), which I think
can't happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
> Yeah, I agree we should add such checks to minimize the possibility of
> reading logical records from a master that has insufficient wal_level.
> So to summarize :
> a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> b. Call this function call in CreateInitDecodingContext() as well.
> c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> error if there is an existing logical slot.
> 
> This made me think more of the race conditions. For instance, in
> pg_create_logical_replication_slot(), just after
> CheckLogicalDecodingRequirements and before actually creating the
> slot, suppose concurrently Controlfile->wal_level is changed from
> logical to replica.  So suppose a new slot does get created. Later the
> slot is read, so in pg_logical_slot_get_changes_guts(),
> CheckLogicalDecodingRequirements() is called where it checks
> ControlFile->wal_level value. But just before it does that,
> ControlFile->wal_level concurrently changes back to logical, because
> of replay of another param-change record. So this logical reader will
> think that the wal_level is sufficient, and will proceed to read the
> records, but those records are *before* the wal_level change, so these
> records don't have logical data.

I don't think that's an actual problem, because there's no decoding
before the slot exists and CreateInitDecodingContext() has determined
the start LSN. And by that point the slot exists, slo
XLOG_PARAMETER_CHANGE replay can error out.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-05-23 09:37:50 -0400, Robert Haas wrote:
> On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:
> > > wal_level is PGC_POSTMASTER.
> >
> > But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only
logical)on standby. So online change from logical to replica wal_level is possible on standby's controlfile.
 
> 
> That's true, but Amit's scenario involved a change in wal_level during
> the execution of pg_create_logical_replication_slot(), which I think
> can't happen.

I don't see why not - we're talking about the wal_level in the WAL
stream, not the setting on the standby. And that can change during the
execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
record is replayed. I don't think it's actually a problem, as I
outlined in my response to Amit, though.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 23 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> > On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
> > Yeah, I agree we should add such checks to minimize the possibility of
> > reading logical records from a master that has insufficient wal_level.
> > So to summarize :
> > a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> > b. Call this function call in CreateInitDecodingContext() as well.
> > c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> > error if there is an existing logical slot.
> >
> > This made me think more of the race conditions. For instance, in
> > pg_create_logical_replication_slot(), just after
> > CheckLogicalDecodingRequirements and before actually creating the
> > slot, suppose concurrently Controlfile->wal_level is changed from
> > logical to replica.  So suppose a new slot does get created. Later the
> > slot is read, so in pg_logical_slot_get_changes_guts(),
> > CheckLogicalDecodingRequirements() is called where it checks
> > ControlFile->wal_level value. But just before it does that,
> > ControlFile->wal_level concurrently changes back to logical, because
> > of replay of another param-change record. So this logical reader will
> > think that the wal_level is sufficient, and will proceed to read the
> > records, but those records are *before* the wal_level change, so these
> > records don't have logical data.
>
> I don't think that's an actual problem, because there's no decoding
> before the slot exists and CreateInitDecodingContext() has determined
> the start LSN. And by that point the slot exists, slo
> XLOG_PARAMETER_CHANGE replay can error out.

So between the start lsn and the lsn for
parameter-change(logical=>replica) record, there can be some records ,
and these don't have logical data. So the slot created will read from
the start lsn, and proceed to read these records, before reading the
parameter-change record.

Can you re-write the below phrase please ? I suspect there is some
letters missing there :
"And by that point the slot exists, slo XLOG_PARAMETER_CHANGE replay
can error out"

Are you saying we want to error out when the postgres replays the
param change record and there is existing logical slot ? I thought you
were suggesting earlier that it's the decoder.c code which should
error out when reading the param-change record.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-05-23 23:08:55 +0530, Amit Khandekar wrote:
> On Thu, 23 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:
> > On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> > > On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
> > > Yeah, I agree we should add such checks to minimize the possibility of
> > > reading logical records from a master that has insufficient wal_level.
> > > So to summarize :
> > > a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> > > b. Call this function call in CreateInitDecodingContext() as well.
> > > c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> > > error if there is an existing logical slot.
> > >
> > > This made me think more of the race conditions. For instance, in
> > > pg_create_logical_replication_slot(), just after
> > > CheckLogicalDecodingRequirements and before actually creating the
> > > slot, suppose concurrently Controlfile->wal_level is changed from
> > > logical to replica.  So suppose a new slot does get created. Later the
> > > slot is read, so in pg_logical_slot_get_changes_guts(),
> > > CheckLogicalDecodingRequirements() is called where it checks
> > > ControlFile->wal_level value. But just before it does that,
> > > ControlFile->wal_level concurrently changes back to logical, because
> > > of replay of another param-change record. So this logical reader will
> > > think that the wal_level is sufficient, and will proceed to read the
> > > records, but those records are *before* the wal_level change, so these
> > > records don't have logical data.
> >
> > I don't think that's an actual problem, because there's no decoding
> > before the slot exists and CreateInitDecodingContext() has determined
> > the start LSN. And by that point the slot exists, slo
> > XLOG_PARAMETER_CHANGE replay can error out.
> 
> So between the start lsn and the lsn for
> parameter-change(logical=>replica) record, there can be some records ,
> and these don't have logical data. So the slot created will read from
> the start lsn, and proceed to read these records, before reading the
> parameter-change record.

I don't think that's possible. By the time CreateInitDecodingContext()
is called, the slot *already* exists (but in a state that'll cause it to
be throw away on error). But the restart point has not yet been
determined. Thus, if there is a XLOG_PARAMETER_CHANGE with a wal_level
change it can error out. And to handle the race of wal_level changing
between CheckLogicalDecodingRequirements() and the slot creation, we
recheck in CreateInitDecodingContext().

Think we might nee dto change ReplicationSlotReserveWal() to use the
replay, rather than the redo pointer for logical slots though.


> Can you re-write the below phrase please ? I suspect there is some
> letters missing there :
> "And by that point the slot exists, slo XLOG_PARAMETER_CHANGE replay
> can error out"

I think it's just one additional letter, namely s/slo/so/


> Are you saying we want to error out when the postgres replays the
> param change record and there is existing logical slot ? I thought you
> were suggesting earlier that it's the decoder.c code which should
> error out when reading the param-change record.

Yes, that's what I'm saying. See this portion of my previous email on
the topic:

On 2019-05-21 09:19:37 -0700, Andres Freund wrote:
> On 2019-04-16 12:27:46 +0530, Amit Khandekar wrote:
> > What I am thinking is :
> > In CheckLogicalDecodingRequirements(), besides checking wal_level,
> > also check ControlFile->wal_level when InHotStandby. I mean, when we
> > are InHotStandby, both wal_level and ControlFile->wal_level should be
> > >= WAL_LEVEL_LOGICAL. This will allow us to error out when using logical
> > slot when master has incompatible wal_level.
> 
> That still allows the primary to change wal_level after logical decoding
> has started, so we need the additional checks.
> 
> I'm not yet sure how to best deal with the fact that wal_level might be
> changed by the primary at basically all times. We would eventually get
> an error when logical decoding reaches the XLOG_PARAMETER_CHANGE. But
> that's not necessarily sufficient - if a primary changes its wal_level
> to lower, it could remove information logical decoding needs *before*
> logical decoding reaches the XLOG_PARAMETER_CHANGE record.
> 
> So I suspect we need conflict handling in xlog_redo's
> XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> slots, we ought to be safe.
> 
> Therefore I think the check in CheckLogicalDecodingRequirements() needs
> to be something like:
> 
> if (RecoveryInProgress())
> {
>     if (!InHotStandby)
>         ereport(ERROR, "logical decoding on a standby required hot_standby to be enabled");
>     /*
>      * This check is racy, but whenever XLOG_PARAMETER_CHANGE indicates that
>      * wal_level has changed, we verify that there are no existin glogical
>      * replication slots. And to avoid races around creating a new slot,
>      * CheckLogicalDecodingRequirements() is called once before creating the slot,
>      * andd once when logical decoding is initially starting up.
>      */
>     if (ControlFile->wal_level != LOGICAL)
>         ereport(ERROR, "...");
> }
> 
> And then add a second CheckLogicalDecodingRequirements() call into
> CreateInitDecodingContext().
> 
> What do you think?


Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 23 May 2019 at 23:18, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-05-23 23:08:55 +0530, Amit Khandekar wrote:
> > On Thu, 23 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:
> > > On 2019-05-23 17:39:21 +0530, Amit Khandekar wrote:
> > > > On Tue, 21 May 2019 at 21:49, Andres Freund <andres@anarazel.de> wrote:
> > > > Yeah, I agree we should add such checks to minimize the possibility of
> > > > reading logical records from a master that has insufficient wal_level.
> > > > So to summarize :
> > > > a. CheckLogicalDecodingRequirements() : Add Controlfile wal_level checks
> > > > b. Call this function call in CreateInitDecodingContext() as well.
> > > > c. While decoding XLOG_PARAMETER_CHANGE record, emit recovery conflict
> > > > error if there is an existing logical slot.
> > > >
> > > > This made me think more of the race conditions. For instance, in
> > > > pg_create_logical_replication_slot(), just after
> > > > CheckLogicalDecodingRequirements and before actually creating the
> > > > slot, suppose concurrently Controlfile->wal_level is changed from
> > > > logical to replica.  So suppose a new slot does get created. Later the
> > > > slot is read, so in pg_logical_slot_get_changes_guts(),
> > > > CheckLogicalDecodingRequirements() is called where it checks
> > > > ControlFile->wal_level value. But just before it does that,
> > > > ControlFile->wal_level concurrently changes back to logical, because
> > > > of replay of another param-change record. So this logical reader will
> > > > think that the wal_level is sufficient, and will proceed to read the
> > > > records, but those records are *before* the wal_level change, so these
> > > > records don't have logical data.
> > >
> > > I don't think that's an actual problem, because there's no decoding
> > > before the slot exists and CreateInitDecodingContext() has determined
> > > the start LSN. And by that point the slot exists, slo
> > > XLOG_PARAMETER_CHANGE replay can error out.
> >
> > So between the start lsn and the lsn for
> > parameter-change(logical=>replica) record, there can be some records ,
> > and these don't have logical data. So the slot created will read from
> > the start lsn, and proceed to read these records, before reading the
> > parameter-change record.
>
> I don't think that's possible. By the time CreateInitDecodingContext()
> is called, the slot *already* exists (but in a state that'll cause it to
> be throw away on error). But the restart point has not yet been
> determined. Thus, if there is a XLOG_PARAMETER_CHANGE with a wal_level
> change it can error out. And to handle the race of wal_level changing
> between CheckLogicalDecodingRequirements() and the slot creation, we
> recheck in CreateInitDecodingContext().

ok, got it now. I was concerned that there might be some such cases
unhandled because we are not using locks to handle such concurrency
conditions. But as you have explained, the checks we are adding will
avoid this race condition.

>
> Think we might nee dto change ReplicationSlotReserveWal() to use the
> replay, rather than the redo pointer for logical slots though.

Not thought of this; will get back.

Working on the patch now ....

> > Are you saying we want to error out when the postgres replays the
> > param change record and there is existing logical slot ? I thought you
> > were suggesting earlier that it's the decoder.c code which should
> > error out when reading the param-change record.
>
> Yes, that's what I'm saying. See this portion of my previous email on
> the topic:
Yeah, thanks for pointing that.
>
> On 2019-05-21 09:19:37 -0700, Andres Freund wrote:
> > So I suspect we need conflict handling in xlog_redo's
> > XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> > slots, we ought to be safe.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Working on the patch now ....

Attached is an incremental WIP patch
handle_wal_level_changes_WIP.patch to be applied over the earlier main
patch logical-decoding-on-standby_v4_rebased.patch.

> >
> > On 2019-05-21 09:19:37 -0700, Andres Freund wrote:
> > > So I suspect we need conflict handling in xlog_redo's
> > > XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> > > slots, we ought to be safe.

Yet to do this. Andres, how do you want to handle this scenario ? Just
drop all the existing logical slots like what we decided for conflict
recovery for conflicting xids ?


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 24 May 2019 at 21:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Working on the patch now ....
>
> Attached is an incremental WIP patch
> handle_wal_level_changes_WIP.patch to be applied over the earlier main
> patch logical-decoding-on-standby_v4_rebased.patch.

I found an issue with these changes : When we change master wal_level
from logical to hot_standby, and again back to logical, and then
create a logical replication slot on slave, it gets created; but when
I do pg_logical_slot_get_changes() with that slot, it seems to read
records *before* I created the logical slot, so it encounters
parameter-change(logical=>hot_standby) record, so returns an error as
per the patch, because now in DecodeXLogOp() I error out when
XLOG_PARAMETER_CHANGE is found :

@@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
             * can restart from there.
             */
            break;
+         case XLOG_PARAMETER_CHANGE:
+         {
+           xl_parameter_change *xlrec =
+             (xl_parameter_change *) XLogRecGetData(buf->record);
+
+           /* Cannot proceed if master itself does not have logical data */
+           if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+             ereport(ERROR,
+                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                  errmsg("logical decoding on standby requires "
+                     "wal_level >= logical on master")));
+           break;
+         }

I thought it won't read records *before* the slot was created. Am I
missing something ?

>
> > >
> > > On 2019-05-21 09:19:37 -0700, Andres Freund wrote:
> > > > So I suspect we need conflict handling in xlog_redo's
> > > > XLOG_PARAMETER_CHANGE case. If we there check against existing logical
> > > > slots, we ought to be safe.
>
> Yet to do this. Andres, how do you want to handle this scenario ? Just
> drop all the existing logical slots like what we decided for conflict
> recovery for conflicting xids ?

I went ahead and added handling that drops existing slots when we
encounter XLOG_PARAMETER_CHANGE in xlog_redo().

Attached is logical-decoding-on-standby_v5.patch, that contains all
the changes so far.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
On 2019-05-27 17:04:44 +0530, Amit Khandekar wrote:
> On Fri, 24 May 2019 at 21:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > > Working on the patch now ....
> >
> > Attached is an incremental WIP patch
> > handle_wal_level_changes_WIP.patch to be applied over the earlier main
> > patch logical-decoding-on-standby_v4_rebased.patch.
> 
> I found an issue with these changes : When we change master wal_level
> from logical to hot_standby, and again back to logical, and then
> create a logical replication slot on slave, it gets created; but when
> I do pg_logical_slot_get_changes() with that slot, it seems to read
> records *before* I created the logical slot, so it encounters
> parameter-change(logical=>hot_standby) record, so returns an error as
> per the patch, because now in DecodeXLogOp() I error out when
> XLOG_PARAMETER_CHANGE is found :


> @@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf)
>              * can restart from there.
>              */
>             break;
> +         case XLOG_PARAMETER_CHANGE:
> +         {
> +           xl_parameter_change *xlrec =
> +             (xl_parameter_change *) XLogRecGetData(buf->record);
> +
> +           /* Cannot proceed if master itself does not have logical data */
> +           if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> +             ereport(ERROR,
> +                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                  errmsg("logical decoding on standby requires "
> +                     "wal_level >= logical on master")));
> +           break;
> +         }
> 
> I thought it won't read records *before* the slot was created. Am I
> missing something ?

That's why I had mentioned that you'd need to adapt
ReplicationSlotReserveWal(), to use the replay LSN or such.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Mon, 27 May 2019 at 19:26, Andres Freund <andres@anarazel.de> wrote:
>
> On 2019-05-27 17:04:44 +0530, Amit Khandekar wrote:
> > On Fri, 24 May 2019 at 21:00, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > >
> > > On Fri, 24 May 2019 at 19:26, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > > > Working on the patch now ....
> > >
> > > Attached is an incremental WIP patch
> > > handle_wal_level_changes_WIP.patch to be applied over the earlier main
> > > patch logical-decoding-on-standby_v4_rebased.patch.
> >
> > I found an issue with these changes : When we change master wal_level
> > from logical to hot_standby, and again back to logical, and then
> > create a logical replication slot on slave, it gets created; but when
> > I do pg_logical_slot_get_changes() with that slot, it seems to read
> > records *before* I created the logical slot, so it encounters
> > parameter-change(logical=>hot_standby) record, so returns an error as
> > per the patch, because now in DecodeXLogOp() I error out when
> > XLOG_PARAMETER_CHANGE is found :
>
>
> > @@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx,
> > XLogRecordBuffer *buf)
> >              * can restart from there.
> >              */
> >             break;
> > +         case XLOG_PARAMETER_CHANGE:
> > +         {
> > +           xl_parameter_change *xlrec =
> > +             (xl_parameter_change *) XLogRecGetData(buf->record);
> > +
> > +           /* Cannot proceed if master itself does not have logical data */
> > +           if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> > +             ereport(ERROR,
> > +                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                  errmsg("logical decoding on standby requires "
> > +                     "wal_level >= logical on master")));
> > +           break;
> > +         }
> >
> > I thought it won't read records *before* the slot was created. Am I
> > missing something ?
>
> That's why I had mentioned that you'd need to adapt
> ReplicationSlotReserveWal(), to use the replay LSN or such.

Yeah ok. I tried to do this :

@@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
 if (!RecoveryInProgress() && SlotIsLogical(slot))
 {
    ....
 }
 else
 {
-   restart_lsn = GetRedoRecPtr();
+   restart_lsn = SlotIsLogical(slot) ?
+                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();

But then when I do pg_create_logical_replication_slot(), it hangs in
DecodingContextFindStartpoint(), waiting to find new records
(XLogReadRecord).

Working on it ...



--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:
> @@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
>  if (!RecoveryInProgress() && SlotIsLogical(slot))
>  {
>     ....
>  }
>  else
>  {
> -   restart_lsn = GetRedoRecPtr();
> +   restart_lsn = SlotIsLogical(slot) ?
> +                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
> 
> But then when I do pg_create_logical_replication_slot(), it hangs in
> DecodingContextFindStartpoint(), waiting to find new records
> (XLogReadRecord).

But just till the primary has logged the necessary WAL records? If you
just do CHECKPOINT; or such on the primary, it should succeed quickly?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:
> > @@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
> >  if (!RecoveryInProgress() && SlotIsLogical(slot))
> >  {
> >     ....
> >  }
> >  else
> >  {
> > -   restart_lsn = GetRedoRecPtr();
> > +   restart_lsn = SlotIsLogical(slot) ?
> > +                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
> >
> > But then when I do pg_create_logical_replication_slot(), it hangs in
> > DecodingContextFindStartpoint(), waiting to find new records
> > (XLogReadRecord).
>
> But just till the primary has logged the necessary WAL records? If you
> just do CHECKPOINT; or such on the primary, it should succeed quickly?

Yes, it waits until there is a commit record, or (just tried) until a
checkpoint command.

>
> Greetings,
>
> Andres Freund



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 31 May 2019 at 11:08, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:
> > > @@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
> > >  if (!RecoveryInProgress() && SlotIsLogical(slot))
> > >  {
> > >     ....
> > >  }
> > >  else
> > >  {
> > > -   restart_lsn = GetRedoRecPtr();
> > > +   restart_lsn = SlotIsLogical(slot) ?
> > > +                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
> > >
> > > But then when I do pg_create_logical_replication_slot(), it hangs in
> > > DecodingContextFindStartpoint(), waiting to find new records
> > > (XLogReadRecord).
> >
> > But just till the primary has logged the necessary WAL records? If you
> > just do CHECKPOINT; or such on the primary, it should succeed quickly?
>
> Yes, it waits until there is a commit record, or (just tried) until a
> checkpoint command.

Is XLOG_RUNNING_XACTS record essential for the logical decoding to
build a consistent snapshot ?
Since the restart_lsn is now ReplayRecPtr, there is no
XLOG_RUNNING_XACTS record, and so the snapshot state is not yet
SNAPBUILD_CONSISTENT. And so
DecodingContextFindStartpoint()=>DecodingContextReady() never returns
true, and hence DecodingContextFindStartpoint() goes in an infinite
loop, until it gets XLOG_RUNNING_XACTS.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 31 May 2019 at 17:31, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Fri, 31 May 2019 at 11:08, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2019-05-30 19:46:26 +0530, Amit Khandekar wrote:
> > > > @@ -1042,7 +1042,8 @@ ReplicationSlotReserveWal(void)
> > > >  if (!RecoveryInProgress() && SlotIsLogical(slot))
> > > >  {
> > > >     ....
> > > >  }
> > > >  else
> > > >  {
> > > > -   restart_lsn = GetRedoRecPtr();
> > > > +   restart_lsn = SlotIsLogical(slot) ?
> > > > +                        GetXLogReplayRecPtr(&ThisTimeLineID) : GetRedoRecPtr();
> > > >
> > > > But then when I do pg_create_logical_replication_slot(), it hangs in
> > > > DecodingContextFindStartpoint(), waiting to find new records
> > > > (XLogReadRecord).
> > >
> > > But just till the primary has logged the necessary WAL records? If you
> > > just do CHECKPOINT; or such on the primary, it should succeed quickly?
> >
> > Yes, it waits until there is a commit record, or (just tried) until a
> > checkpoint command.
>
> Is XLOG_RUNNING_XACTS record essential for the logical decoding to
> build a consistent snapshot ?
> Since the restart_lsn is now ReplayRecPtr, there is no
> XLOG_RUNNING_XACTS record, and so the snapshot state is not yet
> SNAPBUILD_CONSISTENT. And so
> DecodingContextFindStartpoint()=>DecodingContextReady() never returns
> true, and hence DecodingContextFindStartpoint() goes in an infinite
> loop, until it gets XLOG_RUNNING_XACTS.

After giving more thought on this, I think it might make sense to
arrange for the xl_running_xact record to be sent from master to the
standby, when a logical slot is to be created on standby. How about
standby sending a new message type to the master, requesting for
xl_running_xact record ? Then on master, ProcessStandbyMessage() will
process this new message type and call LogStandbySnapshot().


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-05-31 17:31:34 +0530, Amit Khandekar wrote:
> On Fri, 31 May 2019 at 11:08, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > On Thu, 30 May 2019 at 20:13, Andres Freund <andres@anarazel.de> wrote:
> > Yes, it waits until there is a commit record, or (just tried) until a
> > checkpoint command.

That's fine with me.


> Is XLOG_RUNNING_XACTS record essential for the logical decoding to
> build a consistent snapshot ?

Yes.


> Since the restart_lsn is now ReplayRecPtr, there is no
> XLOG_RUNNING_XACTS record, and so the snapshot state is not yet
> SNAPBUILD_CONSISTENT. And so
> DecodingContextFindStartpoint()=>DecodingContextReady() never returns
> true, and hence DecodingContextFindStartpoint() goes in an infinite
> loop, until it gets XLOG_RUNNING_XACTS.

These seem like conflicting statements? Infinite loops don't terminate
until a record is logged?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:
> After giving more thought on this, I think it might make sense to
> arrange for the xl_running_xact record to be sent from master to the
> standby, when a logical slot is to be created on standby. How about
> standby sending a new message type to the master, requesting for
> xl_running_xact record ? Then on master, ProcessStandbyMessage() will
> process this new message type and call LogStandbySnapshot().

I think that should be a secondary feature. You don't necessarily know
the upstream master, as the setup could be cascading one. I think for
now just having to wait, perhaps with a comment to manually start a
checkpoint, ought to suffice?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 4 Jun 2019 at 21:28, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:
> > After giving more thought on this, I think it might make sense to
> > arrange for the xl_running_xact record to be sent from master to the
> > standby, when a logical slot is to be created on standby. How about
> > standby sending a new message type to the master, requesting for
> > xl_running_xact record ? Then on master, ProcessStandbyMessage() will
> > process this new message type and call LogStandbySnapshot().
>
> I think that should be a secondary feature. You don't necessarily know
> the upstream master, as the setup could be cascading one.
Oh yeah, cascading setup makes it more complicated.

> I think for
> now just having to wait, perhaps with a comment to manually start a
> checkpoint, ought to suffice?

Ok.

Since this requires the test to handle the
fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
modifying the test file to do this. After doing that, I found that the
slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
This happens only when the restart_lsn is set to ReplayRecPtr.
Somehow, this does not happen when I manually create the logical slot.
It happens only while running testcase. Working on it ...



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Mon, 10 Jun 2019 at 10:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Tue, 4 Jun 2019 at 21:28, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:
> > > After giving more thought on this, I think it might make sense to
> > > arrange for the xl_running_xact record to be sent from master to the
> > > standby, when a logical slot is to be created on standby. How about
> > > standby sending a new message type to the master, requesting for
> > > xl_running_xact record ? Then on master, ProcessStandbyMessage() will
> > > process this new message type and call LogStandbySnapshot().
> >
> > I think that should be a secondary feature. You don't necessarily know
> > the upstream master, as the setup could be cascading one.
> Oh yeah, cascading setup makes it more complicated.
>
> > I think for
> > now just having to wait, perhaps with a comment to manually start a
> > checkpoint, ought to suffice?
>
> Ok.
>
> Since this requires the test to handle the
> fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
> modifying the test file to do this. After doing that, I found that the
> slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
> This happens only when the restart_lsn is set to ReplayRecPtr.
> Somehow, this does not happen when I manually create the logical slot.
> It happens only while running testcase. Working on it ...

Like I mentioned above, I get an assertion failure for
Assert(XRecOffIsValid(RecPtr)) while reading WAL records looking for a
start position (DecodingContextFindStartpoint()). This is because in
CreateInitDecodingContext()=>ReplicationSlotReserveWal(), I now set
the logical slot's restart_lsn to XLogCtl->lastReplayedEndRecPtr. And
just after bringing up slave, lastReplayedEndRecPtr's initial values
are in this order : 0/2000028, 0/2000060, 0/20000D8, 0/2000100,
0/3000000, 0/3000060. You can see that 0/3000000 is not a valid value
because it points to the start of a WAL block, meaning it points to
the XLog page header (I think it's possible because it is 1 + endof
last replayed record, which can be start of next block). So when we
try to create a slot when it's in that position, then XRecOffIsValid()
fails while looking for a starting point.

One option I considered was : If lastReplayedEndRecPtr points to XLog
page header, get a position of the first record on that WAL block,
probably with XLogFindNextRecord(). But it is not trivial because
while in ReplicationSlotReserveWal(), XLogReaderState is not created
yet. Or else, do you think we can just increment the record pointer by
doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
SizeOfXLogShortPHD() ?

Do you think that we can solve this using some other approach ? I am
not sure whether it's only the initial conditions that cause
lastReplayedEndRecPtr value to *not* point to a valid record, or is it
just a coincidence and that lastReplayedEndRecPtr can also have such a
value any time afterwards. If it's only possible initially, we can
just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
lastReplayedEndRecPtr is invalid.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
On 2019-May-23, Andres Freund wrote:

> On 2019-05-23 09:37:50 -0400, Robert Haas wrote:
> > On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:
> > > > wal_level is PGC_POSTMASTER.
> > >
> > > But primary can be restarted without restart on standby. We require wal_level replica or highter (currently only
logical)on standby. So online change from logical to replica wal_level is possible on standby's controlfile.
 
> > 
> > That's true, but Amit's scenario involved a change in wal_level during
> > the execution of pg_create_logical_replication_slot(), which I think
> > can't happen.
> 
> I don't see why not - we're talking about the wal_level in the WAL
> stream, not the setting on the standby. And that can change during the
> execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
> record is replayed. I don't think it's actually a problem, as I
> outlined in my response to Amit, though.

I don't know if this is directly relevant, but in commit_ts.c we go to
great lengths to ensure that things continue to work across restarts and
changes of the GUC in the primary, by decoupling activation and
deactivation of the module from start-time initialization.  Maybe that
idea is applicable for this too?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 11 Jun 2019 at 12:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Mon, 10 Jun 2019 at 10:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > On Tue, 4 Jun 2019 at 21:28, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2019-06-04 15:51:01 +0530, Amit Khandekar wrote:
> > > > After giving more thought on this, I think it might make sense to
> > > > arrange for the xl_running_xact record to be sent from master to the
> > > > standby, when a logical slot is to be created on standby. How about
> > > > standby sending a new message type to the master, requesting for
> > > > xl_running_xact record ? Then on master, ProcessStandbyMessage() will
> > > > process this new message type and call LogStandbySnapshot().
> > >
> > > I think that should be a secondary feature. You don't necessarily know
> > > the upstream master, as the setup could be cascading one.
> > Oh yeah, cascading setup makes it more complicated.
> >
> > > I think for
> > > now just having to wait, perhaps with a comment to manually start a
> > > checkpoint, ought to suffice?
> >
> > Ok.
> >
> > Since this requires the test to handle the
> > fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
> > modifying the test file to do this. After doing that, I found that the
> > slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
> > This happens only when the restart_lsn is set to ReplayRecPtr.
> > Somehow, this does not happen when I manually create the logical slot.
> > It happens only while running testcase. Working on it ...
>
> Like I mentioned above, I get an assertion failure for
> Assert(XRecOffIsValid(RecPtr)) while reading WAL records looking for a
> start position (DecodingContextFindStartpoint()). This is because in
> CreateInitDecodingContext()=>ReplicationSlotReserveWal(), I now set
> the logical slot's restart_lsn to XLogCtl->lastReplayedEndRecPtr. And
> just after bringing up slave, lastReplayedEndRecPtr's initial values
> are in this order : 0/2000028, 0/2000060, 0/20000D8, 0/2000100,
> 0/3000000, 0/3000060. You can see that 0/3000000 is not a valid value
> because it points to the start of a WAL block, meaning it points to
> the XLog page header (I think it's possible because it is 1 + endof
> last replayed record, which can be start of next block). So when we
> try to create a slot when it's in that position, then XRecOffIsValid()
> fails while looking for a starting point.
>
> One option I considered was : If lastReplayedEndRecPtr points to XLog
> page header, get a position of the first record on that WAL block,
> probably with XLogFindNextRecord(). But it is not trivial because
> while in ReplicationSlotReserveWal(), XLogReaderState is not created
> yet.

In the attached v6 version of the patch, I did the above. That is, I
used XLogFindNextRecord() to bump up the restart_lsn of the slot to
the first valid record. But since XLogReaderState is not available in
ReplicationSlotReserveWal(), I did this in
DecodingContextFindStartpoint(). And then updated the slot restart_lsn
with this corrected position.

Since XLogFindNextRecord() is currently disabled using #if 0, removed
this directive.

> Or else, do you think we can just increment the record pointer by
> doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
> SizeOfXLogShortPHD() ?

I found out that we can't do this, because we don't know whether the
xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
our context, it is SizeOfXLogLongPHD. So we indeed need the
XLogReaderState handle.

>
> Do you think that we can solve this using some other approach ? I am
> not sure whether it's only the initial conditions that cause
> lastReplayedEndRecPtr value to *not* point to a valid record, or is it
> just a coincidence and that lastReplayedEndRecPtr can also have such a
> value any time afterwards. If it's only possible initially, we can
> just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
> lastReplayedEndRecPtr is invalid.

So now as the v6 patch stands, lastReplayedEndRecPtr is used to set
the restart_lsn, but its position is later adjusted in
DecodingContextFindStartpoint().

Also, modified the test to handle the requirement that the logical
slot creation on standby requires a checkpoint (or any other
transaction commit) to be given from master. For that, in
src/test/perl/PostgresNode.pm, added a new function
create_logical_slot_on_standby() which does the reqiured steps.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Wed, 22 May 2019 at 15:05, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Tue, 9 Apr 2019 at 22:23, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > On Sat, 6 Apr 2019 at 04:45, Andres Freund <andres@anarazel.de> wrote:
> > > > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > > > index 006446b..5785d2f 100644
> > > > --- a/src/backend/replication/slot.c
> > > > +++ b/src/backend/replication/slot.c
> > > > @@ -1064,6 +1064,85 @@ ReplicationSlotReserveWal(void)
> > > >       }
> > > >  }
> > > >
> > > > +void
> > > > +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> > > > +{
> > > > +     int                     i;
> > > > +     bool            found_conflict = false;
> > > > +
> > > > +     if (max_replication_slots <= 0)
> > > > +             return;
> > > > +
> > > > +restart:
> > > > +     if (found_conflict)
> > > > +     {
> > > > +             CHECK_FOR_INTERRUPTS();
> > > > +             /*
> > > > +              * Wait awhile for them to die so that we avoid flooding an
> > > > +              * unresponsive backend when system is heavily loaded.
> > > > +              */
> > > > +             pg_usleep(100000);
> > > > +             found_conflict = false;
> > > > +     }
> > > > +
> > > > +     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > > > +     for (i = 0; i < max_replication_slots; i++)
> > > > +     {
> > > > +             ReplicationSlot *s;
> > > > +             NameData        slotname;
> > > > +             TransactionId slot_xmin;
> > > > +             TransactionId slot_catalog_xmin;
> > > > +
> > > > +             s = &ReplicationSlotCtl->replication_slots[i];
> > > > +
> > > > +             /* cannot change while ReplicationSlotCtlLock is held */
> > > > +             if (!s->in_use)
> > > > +                     continue;
> > > > +
> > > > +             /* not our database, skip */
> > > > +             if (s->data.database != InvalidOid && s->data.database != dboid)
> > > > +                     continue;
> > > > +
> > > > +             SpinLockAcquire(&s->mutex);
> > > > +             slotname = s->data.name;
> > > > +             slot_xmin = s->data.xmin;
> > > > +             slot_catalog_xmin = s->data.catalog_xmin;
> > > > +             SpinLockRelease(&s->mutex);
> > > > +
> > > > +             if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
> > > > +             {
> > > > +                     found_conflict = true;
> > > > +
> > > > +                     ereport(WARNING,
> > > > +                                     (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
> > > > +                                                     NameStr(slotname), slot_xmin, xid)));
> > > > +             }
> > > > +
> > > > +             if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin,
xid))
> > > > +             {
> > > > +                     found_conflict = true;
> > > > +
> > > > +                     ereport(WARNING,
> > > > +                                     (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> > > > +                                                     NameStr(slotname), slot_catalog_xmin, xid)));
> > > > +             }
> > > > +
> > > > +
> > > > +             if (found_conflict)
> > > > +             {
> > > > +                     elog(WARNING, "Dropping conflicting slot %s", s->data.name.data);
> > > > +                     LWLockRelease(ReplicationSlotControlLock);      /* avoid deadlock */
> > > > +                     ReplicationSlotDropPtr(s);
> > > > +
> > > > +                     /* We released the lock above; so re-scan the slots. */
> > > > +                     goto restart;
> > > > +             }
> > > > +     }
> > > >
> > > I think this should be refactored so that the two found_conflict cases
> > > set a 'reason' variable (perhaps an enum?) to the particular reason, and
> > > then only one warning should be emitted.  I also think that LOG might be
> > > more appropriate than WARNING - as confusing as that is, LOG is more
> > > severe than WARNING (see docs about log_min_messages).
> >
> > What I have in mind is :
> >
> > ereport(LOG,
> > (errcode(ERRCODE_INTERNAL_ERROR),
> > errmsg("Dropping conflicting slot %s", s->data.name.data),
> > errdetail("%s, removed xid %d.", conflict_str, xid)));
> > where conflict_str is a dynamically generated string containing
> > something like : "slot xmin : 1234, slot catalog_xmin: 5678"
> > So for the user, the errdetail will look like :
> > "slot xmin: 1234, catalog_xmin: 5678, removed xid : 9012"
> > I think the user can figure out whether it was xmin or catalog_xmin or
> > both that conflicted with removed xid.
> > If we don't do this way, we may not be able to show in a single
> > message if both xmin and catalog_xmin are conflicting at the same
> > time.
> >
> > Does this message look good to you, or you had in mind something quite
> > different ?
>
> The above one is yet another point that needs to be concluded on. Till
> then I will use the above way to display the error message in the
> upcoming patch version.

Attached is v7 version that has the above changes regarding having a
single error message.

>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Wed, 12 Jun 2019 at 00:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-May-23, Andres Freund wrote:
>
> > On 2019-05-23 09:37:50 -0400, Robert Haas wrote:
> > > On Thu, May 23, 2019 at 9:30 AM Sergei Kornilov <sk@zsrv.org> wrote:
> > > > > wal_level is PGC_POSTMASTER.
> > > >
> > > > But primary can be restarted without restart on standby. We require wal_level replica or highter (currently
onlylogical) on standby. So online change from logical to replica wal_level is possible on standby's controlfile. 
> > >
> > > That's true, but Amit's scenario involved a change in wal_level during
> > > the execution of pg_create_logical_replication_slot(), which I think
> > > can't happen.
> >
> > I don't see why not - we're talking about the wal_level in the WAL
> > stream, not the setting on the standby. And that can change during the
> > execution of pg_create_logical_replication_slot(), if a PARAMTER_CHANGE
> > record is replayed. I don't think it's actually a problem, as I
> > outlined in my response to Amit, though.
>
> I don't know if this is directly relevant, but in commit_ts.c we go to
> great lengths to ensure that things continue to work across restarts and
> changes of the GUC in the primary, by decoupling activation and
> deactivation of the module from start-time initialization.  Maybe that
> idea is applicable for this too?

We do kind of handle change in wal_level differently at run-time
versus at initialization. E.g. we drop the existing slots if the
wal_level becomes less than logical. But I think we don't have to do a
significant work unlike how it seems to have been done in
ActivateCommitTs when commit_ts is activated.

>
> --
> Álvaro Herrera                https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-06-12 17:30:02 +0530, Amit Khandekar wrote:
> On Tue, 11 Jun 2019 at 12:24, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > On Mon, 10 Jun 2019 at 10:37, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > > Since this requires the test to handle the
> > > fire-create-slot-and-then-fire-checkpoint-from-master actions, I was
> > > modifying the test file to do this. After doing that, I found that the
> > > slave gets an assertion failure in XLogReadRecord()=>XRecOffIsValid().
> > > This happens only when the restart_lsn is set to ReplayRecPtr.
> > > Somehow, this does not happen when I manually create the logical slot.
> > > It happens only while running testcase. Working on it ...
> >
> > Like I mentioned above, I get an assertion failure for
> > Assert(XRecOffIsValid(RecPtr)) while reading WAL records looking for a
> > start position (DecodingContextFindStartpoint()). This is because in
> > CreateInitDecodingContext()=>ReplicationSlotReserveWal(), I now set
> > the logical slot's restart_lsn to XLogCtl->lastReplayedEndRecPtr. And
> > just after bringing up slave, lastReplayedEndRecPtr's initial values
> > are in this order : 0/2000028, 0/2000060, 0/20000D8, 0/2000100,
> > 0/3000000, 0/3000060. You can see that 0/3000000 is not a valid value
> > because it points to the start of a WAL block, meaning it points to
> > the XLog page header (I think it's possible because it is 1 + endof
> > last replayed record, which can be start of next block). So when we
> > try to create a slot when it's in that position, then XRecOffIsValid()
> > fails while looking for a starting point.
> >
> > One option I considered was : If lastReplayedEndRecPtr points to XLog
> > page header, get a position of the first record on that WAL block,
> > probably with XLogFindNextRecord(). But it is not trivial because
> > while in ReplicationSlotReserveWal(), XLogReaderState is not created
> > yet.
>
> In the attached v6 version of the patch, I did the above. That is, I
> used XLogFindNextRecord() to bump up the restart_lsn of the slot to
> the first valid record. But since XLogReaderState is not available in
> ReplicationSlotReserveWal(), I did this in
> DecodingContextFindStartpoint(). And then updated the slot restart_lsn
> with this corrected position.

> Since XLogFindNextRecord() is currently disabled using #if 0, removed
> this directive.

Well, ifdef FRONTEND. I don't think that's a problem. It's a bit
overkill here, because I think we know the address has to be on a record
boundary (rather than being in the middle of a page spanning WAL
record). So we could just add add the size of the header manually - but
I think that's not worth doing.


> > Or else, do you think we can just increment the record pointer by
> > doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
> > SizeOfXLogShortPHD() ?
>
> I found out that we can't do this, because we don't know whether the
> xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
> our context, it is SizeOfXLogLongPHD. So we indeed need the
> XLogReaderState handle.

Well, we can determine whether a long or a short header is going to be
used, as that's solely dependent on the LSN:

        /*
         * If first page of an XLOG segment file, make it a long header.
         */
        if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
        {
            XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;

            NewLongPage->xlp_sysid = ControlFile->system_identifier;
            NewLongPage->xlp_seg_size = wal_segment_size;
            NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
            NewPage->xlp_info |= XLP_LONG_HEADER;
        }

but I don't think that's worth it.


> > Do you think that we can solve this using some other approach ? I am
> > not sure whether it's only the initial conditions that cause
> > lastReplayedEndRecPtr value to *not* point to a valid record, or is it
> > just a coincidence and that lastReplayedEndRecPtr can also have such a
> > value any time afterwards.

It's always possible. All that means is that the last record filled the
entire last WAL page.


> > If it's only possible initially, we can
> > just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
> > lastReplayedEndRecPtr is invalid.

I don't think so? The redo pointer will point to something *much*
earlier, where we'll not yet have done all the necessary conflict
handling during recovery? So we'd not necessarily notice that a slot
is not actually usable for decoding.

We could instead just handle that by starting decoding at the redo
pointer, and just ignore all WAL records until they're after
lastReplayedEndRecPtr, but that has no advantages, and will read a lot
more WAL.




>  static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
> @@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
>       */
>
>      /* XLOG stuff */
> +    xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
>      xlrec_reuse.node = rel->rd_node;
>      xlrec_reuse.block = blkno;
>      xlrec_reuse.latestRemovedXid = latestRemovedXid;
> @@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
>          XLogRecPtr    recptr;
>          xl_btree_delete xlrec_delete;
>
> +        xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
>          xlrec_delete.latestRemovedXid = latestRemovedXid;
>          xlrec_delete.nitems = nitems;

Can we instead pass the heap rel down to here? I think there's only one
caller, and it has the heap relation available these days (it didn't at
the time of the prototype, possibly).  There's a few other users of
get_rel_logical_catalog() where that might be harder, but it's easy
here.


> @@ -27,6 +27,7 @@
>  #include "storage/indexfsm.h"
>  #include "storage/lmgr.h"
>  #include "utils/snapmgr.h"
> +#include "utils/lsyscache.h"
>
>
>  /* Entry in pending-list of TIDs we need to revisit */
> @@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
>      OffsetNumber itemnos[MaxIndexTuplesPerPage];
>      spgxlogVacuumRedirect xlrec;
>
> +    xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
>      xlrec.nToPlaceholder = 0;
>      xlrec.newestRedirectXid = InvalidTransactionId;

This one seems harder, but I'm not actually sure why we make it so
hard. It seems like we just ought to add the table to IndexVacuumInfo.


>  /*
> + * Get the wal_level from the control file.
> + */
> +int
> +ControlFileWalLevel(void)
> +{
> +    return ControlFile->wal_level;
> +}

Any reason not to return the type enum WalLevel instead?  I'm not sure I
like the function name - perhaps something like GetActiveWalLevel() or
such? The fact that it's in the control file doesn't seem relevant
here.  I think it should be close to DataChecksumsEnabled() etc, which
all return information from the control file.


> +/*
>   * Initialization of shared memory for XLOG
>   */
>  Size
> @@ -9843,6 +9852,17 @@ xlog_redo(XLogReaderState *record)
>          /* Update our copy of the parameters in pg_control */
>          memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
>
> +        /*
> +         * Drop logical slots if we are in hot standby and master does not have
> +         * logical data. Don't bother to search for the slots if standby is
> +         * running with wal_level lower than logical, because in that case,
> +         * we would have disallowed creation of logical slots.
> +         */

s/disallowed creation/disallowed creation or previously dropped/

> +        if (InRecovery && InHotStandby &&
> +            xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> +            wal_level >= WAL_LEVEL_LOGICAL)
> +            ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId);
> +
>          LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
>          ControlFile->MaxConnections = xlrec.MaxConnections;
>          ControlFile->max_worker_processes =
>          xlrec.max_worker_processes;

Not for this patch, but I kinda feel the individual replay routines
ought to be broken out of xlog_redo().


>  /* ----------------------------------------
>   * Functions for decoding the data and block references in a record.
>   * ----------------------------------------
> diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
> index 151c3ef..c1bd028 100644
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>               * can restart from there.
>               */
>              break;
> +        case XLOG_PARAMETER_CHANGE:
> +        {
> +            xl_parameter_change *xlrec =
> +                (xl_parameter_change *) XLogRecGetData(buf->record);
> +
> +            /* Cannot proceed if master itself does not have logical data */
> +            if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                         errmsg("logical decoding on standby requires "
> +                                "wal_level >= logical on master")));
> +            break;
> +        }

This should also HINT to drop the replication slot.


> +    /*
> +     * It is not guaranteed that the restart_lsn points to a valid
> +     * record location. E.g. on standby, restart_lsn initially points to lastReplayedEndRecPtr,
> +     * which is 1 + the end of last replayed record, which means it can point the next
> +     * block header start. So bump it to the next valid record.
> +     */

I'd rephrase this as something like:

restart_lsn initially may point one past the end of the record. If that
is a XLOG page boundary, it will not be a valid LSN for the start of a
record. If that's the case, look for the start of the first record.


> +    if (!XRecOffIsValid(startptr))
> +    {

Hm, could you before this add an Assert(startptr != InvalidXLogRecPtr)
or such?


> +        elog(DEBUG1, "Invalid restart lsn %X/%X",
> +                     (uint32) (startptr >> 32), (uint32) startptr);
> +        startptr = XLogFindNextRecord(ctx->reader, startptr);
> +
> +        SpinLockAcquire(&slot->mutex);
> +        slot->data.restart_lsn = startptr;
> +        SpinLockRelease(&slot->mutex);
> +        elog(DEBUG1, "Moved slot restart lsn to %X/%X",
> +                     (uint32) (startptr >> 32), (uint32) startptr);
> +    }

Minor nit: normally debug messages don't start with upper case.


>      /* Wait for a consistent starting point */
>      for (;;)
>      {
> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index 55c306e..7ffd264 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1016,37 +1016,37 @@ ReplicationSlotReserveWal(void)
>          /*
>           * For logical slots log a standby snapshot and start logical decoding
>           * at exactly that position. That allows the slot to start up more
> -         * quickly.
> +         * quickly. But on a standby we cannot do WAL writes, so just use the
> +         * replay pointer; effectively, an attempt to create a logical slot on
> +         * standby will cause it to wait for an xl_running_xact record so that
> +         * a snapshot can be built using the record.

I'd add "to be logged independently on the primary" after "wait for an
xl_running_xact record".


> -         * That's not needed (or indeed helpful) for physical slots as they'll
> -         * start replay at the last logged checkpoint anyway. Instead return
> -         * the location of the last redo LSN. While that slightly increases
> -         * the chance that we have to retry, it's where a base backup has to
> -         * start replay at.
> +         * None of this is needed (or indeed helpful) for physical slots as
> +         * they'll start replay at the last logged checkpoint anyway. Instead
> +         * return the location of the last redo LSN. While that slightly
> +         * increases the chance that we have to retry, it's where a base backup
> +         * has to start replay at.
>           */
> +
> +        restart_lsn =
> +            (SlotIsPhysical(slot) ? GetRedoRecPtr() :
> +            (RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
> +                                    GetXLogInsertRecPtr()));

Please rewrite this to use normal if blocks. I'm also not convinced that
it's useful to have this if block, and then another if block that
basically tests the same conditions again.


> +        SpinLockAcquire(&slot->mutex);
> +        slot->data.restart_lsn = restart_lsn;
> +        SpinLockRelease(&slot->mutex);
> +
>          if (!RecoveryInProgress() && SlotIsLogical(slot))
>          {
>              XLogRecPtr    flushptr;
>
> -            /* start at current insert position */
> -            restart_lsn = GetXLogInsertRecPtr();
> -            SpinLockAcquire(&slot->mutex);
> -            slot->data.restart_lsn = restart_lsn;
> -            SpinLockRelease(&slot->mutex);
> -
>              /* make sure we have enough information to start */
>              flushptr = LogStandbySnapshot();
>
>              /* and make sure it's fsynced to disk */
>              XLogFlush(flushptr);
>          }
> -        else
> -        {
> -            restart_lsn = GetRedoRecPtr();
> -            SpinLockAcquire(&slot->mutex);
> -            slot->data.restart_lsn = restart_lsn;
> -            SpinLockRelease(&slot->mutex);
> -        }



>  /*
> + * Resolve recovery conflicts with slots.
> + *
> + * When xid is valid, it means it's a removed-xid kind of conflict, so need to
> + * drop the appropriate slots whose xmin conflicts with removed xid.

I don't think "removed-xid kind of conflict" is that descriptive. I'd
suggest something like "When xid is valid, it means that rows older than
xid might have been removed. Therefore we need to drop slots that depend
on seeing those rows."


> + * When xid is invalid, drop all logical slots. This is required when the
> + * master wal_level is set back to replica, so existing logical slots need to
> + * be dropped.
> + */
> +void
> +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> +{
> +    int            i;
> +    bool        found_conflict = false;
> +
> +    if (max_replication_slots <= 0)
> +        return;
> +
> +restart:
> +    if (found_conflict)
> +    {
> +        CHECK_FOR_INTERRUPTS();
> +        /*
> +         * Wait awhile for them to die so that we avoid flooding an
> +         * unresponsive backend when system is heavily loaded.
> +         */
> +        pg_usleep(100000);
> +        found_conflict = false;
> +    }

Hm, I wonder if we could use the condition variable the slot
infrastructure has these days for this instead.


> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    for (i = 0; i < max_replication_slots; i++)
> +    {
> +        ReplicationSlot *s;
> +        NameData    slotname;
> +        TransactionId slot_xmin;
> +        TransactionId slot_catalog_xmin;
> +
> +        s = &ReplicationSlotCtl->replication_slots[i];
> +
> +        /* cannot change while ReplicationSlotCtlLock is held */
> +        if (!s->in_use)
> +            continue;
> +
> +        /* Invalid xid means caller is asking to drop all logical slots */
> +        if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
> +            found_conflict = true;

I'd just add

if (!SlotIsLogical(s))
   continue;

because all of this doesn't need to happen for slots that aren't
logical.

> +        else
> +        {
> +            /* not our database, skip */
> +            if (s->data.database != InvalidOid && s->data.database != dboid)
> +                continue;
> +
> +            SpinLockAcquire(&s->mutex);
> +            slotname = s->data.name;
> +            slot_xmin = s->data.xmin;
> +            slot_catalog_xmin = s->data.catalog_xmin;
> +            SpinLockRelease(&s->mutex);
> +
> +            if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
> +            {
> +                found_conflict = true;
> +
> +                ereport(LOG,
> +                        (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
> +                                NameStr(slotname), slot_xmin, xid)));
> +            }

s/removed xid/xid horizon being increased to %u/


> +            if (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
> +            {
> +                found_conflict = true;
> +
> +                ereport(LOG,
> +                        (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> +                                NameStr(slotname), slot_catalog_xmin, xid)));
> +            }
> +
> +        }
> +        if (found_conflict)
> +        {


Hm, as far as I can tell you just ignore that the slot might currently
be in use. You can't just drop a slot that somebody is using. I think
you need to send a recovery conflict to that backend.

I guess the easiest way to do that would be something roughly like:

    SetInvalidVirtualTransactionId(vxid);

    LWLockAcquire(ProcArrayLock, LW_SHARED);
    cancel_proc = BackendPidGetProcWithLock(active_pid);
    if (cancel_proc)
        vxid = GET_VXID_FROM_PGPROC(cancel_proc);
    LWLockRelease(ProcArrayLock);

    if (VirtualTransactionIdIsValid(vixd))
    {
        CancelVirtualTransaction(vxid);

        /* Wait here until we get signaled, and then restart */
        ConditionVariableSleep(&slot->active_cv,
                               WAIT_EVENT_REPLICATION_SLOT_DROP);
    }
    ConditionVariableCancelSleep();

when the slot is currently active.  Part of this would need to be split
into a procarray.c helper function (mainly all the stuff dealing with
ProcArrayLock).


> + elog(LOG, "Dropping conflicting slot %s", s->data.name.data);

This definitely needs to be expanded, and follow the message style
guideline.


> +            LWLockRelease(ReplicationSlotControlLock);    /* avoid deadlock */

Instead of saying "deadlock" I'd just say that ReplicationSlotDropPtr()
will acquire that lock.


> +            ReplicationSlotDropPtr(s);

But more importantly, I don't think this is
correct. ReplicationSlotDropPtr() assumes that the to-be-dropped slot is
acquired by the current backend - without that somebody else could
concurrently acquire that slot.

SO I think you need to do something like ReplicationSlotsDropDBSlots()
does:

        /* acquire slot, so ReplicationSlotDropAcquired can be reused  */
        SpinLockAcquire(&s->mutex);
        /* can't change while ReplicationSlotControlLock is held */
        slotname = NameStr(s->data.name);
        active_pid = s->active_pid;
        if (active_pid == 0)
        {
            MyReplicationSlot = s;
            s->active_pid = MyProcPid;
        }
        SpinLockRelease(&s->mutex);


Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
I am yet to work on Andres's latest detailed review comments, but I
thought before that, I should submit a patch for the below reported
issue because I was almost ready with the fix. Now I will start to
work on Andres's comments, for which I will reply separately.

On Fri, 1 Mar 2019 at 13:33, tushar <tushar.ahuja@enterprisedb.com> wrote:
>
> Hi,
>
> While testing  this feature  found that - if lots of insert happened on
> the master cluster then pg_recvlogical is not showing the DATA
> information  on logical replication slot which created on SLAVE.
>
> Please refer this scenario -
>
> 1)
> Create a Master cluster with wal_level=logcal and create logical
> replication slot -
>   SELECT * FROM pg_create_logical_replication_slot('master_slot',
> 'test_decoding');
>
> 2)
> Create a Standby  cluster using pg_basebackup ( ./pg_basebackup -D
> slave/ -v -R)  and create logical replication slot -
> SELECT * FROM pg_create_logical_replication_slot('standby_slot',
> 'test_decoding');
>
> 3)
> X terminal - start  pg_recvlogical  , provide port=5555 ( slave
> cluster)  and specify slot=standby_slot
> ./pg_recvlogical -d postgres  -p 5555 -s 1 -F 1  -v --slot=standby_slot
> --start -f -
>
> Y terminal - start  pg_recvlogical  , provide port=5432 ( master
> cluster)  and specify slot=master_slot
> ./pg_recvlogical -d postgres  -p 5432 -s 1 -F 1  -v --slot=master_slot
> --start -f -
>
> Z terminal - run pg_bench  against Master cluster ( ./pg_bench -i -s 10
> postgres)
>
> Able to see DATA information on Y terminal  but not on X.
>
> but same able to see by firing this below query on SLAVE cluster -
>
> SELECT * FROM pg_logical_slot_get_changes('standby_slot', NULL, NULL);
>
> Is it expected ?

Actually it shows up records after quite a long time. In general,
walsender on standby is sending each record after significant time (1
sec), and pg_recvlogical shows all the inserted records only after the
commit, so for huge inserts, it looks like it is hanging forever.

In XLogSendLogical(), GetFlushRecPtr() was used to get the flushed
point. On standby, GetFlushRecPtr() does not give a valid value, so it
was wrongly determined that the sent record is beyond flush point, as
a result of which, WalSndCaughtUp was set to true, causing
WalSndLoop() to sleep for some duration after every record. This is
why pg_recvlogical appears to be hanging forever in case of huge
number of rows inserted.

Fix : Use GetStandbyFlushRecPtr() if am_cascading_walsender.
Attached patch v8.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 20 Jun 2019 at 00:31, Andres Freund <andres@anarazel.de> wrote:
> On 2019-06-12 17:30:02 +0530, Amit Khandekar wrote:
> > In the attached v6 version of the patch, I did the above. That is, I
> > used XLogFindNextRecord() to bump up the restart_lsn of the slot to
> > the first valid record. But since XLogReaderState is not available in
> > ReplicationSlotReserveWal(), I did this in
> > DecodingContextFindStartpoint(). And then updated the slot restart_lsn
> > with this corrected position.
>
> > Since XLogFindNextRecord() is currently disabled using #if 0, removed
> > this directive.
>
> Well, ifdef FRONTEND. I don't think that's a problem. It's a bit
> overkill here, because I think we know the address has to be on a record
> boundary (rather than being in the middle of a page spanning WAL
> record). So we could just add add the size of the header manually
> - but I think that's not worth doing.
>
>
> > > Or else, do you think we can just increment the record pointer by
> > > doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
> > > SizeOfXLogShortPHD() ?
> >
> > I found out that we can't do this, because we don't know whether the
> > xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
> > our context, it is SizeOfXLogLongPHD. So we indeed need the
> > XLogReaderState handle.
>
> Well, we can determine whether a long or a short header is going to be
> used, as that's solely dependent on the LSN:
>
>                 /*
>                  * If first page of an XLOG segment file, make it a long header.
>                  */
>                 if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
>                 {
>                         XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
>
>                         NewLongPage->xlp_sysid = ControlFile->system_identifier;
>                         NewLongPage->xlp_seg_size = wal_segment_size;
>                         NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
>                         NewPage->xlp_info |= XLP_LONG_HEADER;
>                 }
>
> but I don't think that's worth it.

Ok, so what you are saying is : In case of ReplayRecPtr, it is always
possible to know whether it is pointing at a long header or short
header, just by looking at its value. And then we just increment it by
the header size after knowing the header size. Why do you think it is
no worth it ? In fact, I thought we *have* to increment it to set it
to the next record. Didn't understand what other option we have.

>
>
> > > Do you think that we can solve this using some other approach ? I am
> > > not sure whether it's only the initial conditions that cause
> > > lastReplayedEndRecPtr value to *not* point to a valid record, or is it
> > > just a coincidence and that lastReplayedEndRecPtr can also have such a
> > > value any time afterwards.
>
> It's always possible. All that means is that the last record filled the
> entire last WAL page.

Ok that means we *have* to bump the pointer ahead.

>
>
> > > If it's only possible initially, we can
> > > just use GetRedoRecPtr() instead of lastReplayedEndRecPtr if
> > > lastReplayedEndRecPtr is invalid.
>
> I don't think so? The redo pointer will point to something *much*
> earlier, where we'll not yet have done all the necessary conflict
> handling during recovery? So we'd not necessarily notice that a slot
> is not actually usable for decoding.
>
> We could instead just handle that by starting decoding at the redo
> pointer, and just ignore all WAL records until they're after
> lastReplayedEndRecPtr, but that has no advantages, and will read a lot
> more WAL.

Yeah I agree : just doing this for initial case is a bad idea.

>
>
>
>
> >  static void _bt_cachemetadata(Relation rel, BTMetaPageData *input);
> > @@ -773,6 +774,7 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
> >        */
> >
> >       /* XLOG stuff */
> > +     xlrec_reuse.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
> >       xlrec_reuse.node = rel->rd_node;
> >       xlrec_reuse.block = blkno;
> >       xlrec_reuse.latestRemovedXid = latestRemovedXid;
> > @@ -1140,6 +1142,7 @@ _bt_delitems_delete(Relation rel, Buffer buf,
> >               XLogRecPtr      recptr;
> >               xl_btree_delete xlrec_delete;
> >
> > +             xlrec_delete.onCatalogTable = get_rel_logical_catalog(rel->rd_index->indrelid);
> >               xlrec_delete.latestRemovedXid = latestRemovedXid;
> >               xlrec_delete.nitems = nitems;
>
> Can we instead pass the heap rel down to here? I think there's only one
> caller, and it has the heap relation available these days (it didn't at
> the time of the prototype, possibly).  There's a few other users of
> get_rel_logical_catalog() where that might be harder, but it's easy
> here.

For _bt_log_reuse_page(), it's only caller is _bt_getbuf() which does
not have heapRel parameter. Let me know which caller you were
referring to that has heapRel.

For _bt_delitems_delete(), it itself has heapRel param, so I will use
this for onCatalogTable.

>
>
> > @@ -27,6 +27,7 @@
> >  #include "storage/indexfsm.h"
> >  #include "storage/lmgr.h"
> >  #include "utils/snapmgr.h"
> > +#include "utils/lsyscache.h"
> >
> >
> >  /* Entry in pending-list of TIDs we need to revisit */
> > @@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
> >       OffsetNumber itemnos[MaxIndexTuplesPerPage];
> >       spgxlogVacuumRedirect xlrec;
> >
> > +     xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
> >       xlrec.nToPlaceholder = 0;
> >       xlrec.newestRedirectXid = InvalidTransactionId;
>
> This one seems harder, but I'm not actually sure why we make it so
> hard. It seems like we just ought to add the table to IndexVacuumInfo.

This means we have to add heapRel assignment wherever we initialize
IndexVacuumInfo structure, namely in lazy_vacuum_index(),
lazy_cleanup_index(), validate_index(), analyze_rel(), and make sure
these functions have a heap rel handle. Do you think we should do this
as part of this patch ?


> > +                     if (TransactionIdIsValid(slot_catalog_xmin) &&
TransactionIdPrecedesOrEquals(slot_catalog_xmin,xid))
 
> > +                     {
> > +                             found_conflict = true;
> > +
> > +                             ereport(LOG,
> > +                                             (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> > +                                                             NameStr(slotname), slot_catalog_xmin, xid)));
> > +                     }
> > +
> > +             }
> > +             if (found_conflict)
> > +             {

The above changes seem to be from the older version (v6) of the patch.
Just wanted to make sure you are using v8 patch.

>
>
> Hm, as far as I can tell you just ignore that the slot might currently
> be in use. You can't just drop a slot that somebody is using. I think
> you need to send a recovery conflict to that backend.

Yeah, I am currently working on this. As you suggested, I am going to
call CancelVirtualTransaction() and for its sigmode parameter, I will
pass a new ProcSignalReason value PROCSIG_RECOVERY_CONFLICT_SLOT.

>
>
>
> > + elog(LOG, "Dropping conflicting slot %s", s->data.name.data);
>
> This definitely needs to be expanded, and follow the message style
> guideline.

This message , with the v8 patch, looks like this :
ereport(LOG,
(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
errdetail("%s", reason)));
where reason is a char string.



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 20 Jun 2019 at 00:31, Andres Freund <andres@anarazel.de> wrote:
>
> > > Or else, do you think we can just increment the record pointer by
> > > doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
> > > SizeOfXLogShortPHD() ?
> >
> > I found out that we can't do this, because we don't know whether the
> > xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
> > our context, it is SizeOfXLogLongPHD. So we indeed need the
> > XLogReaderState handle.
>
> Well, we can determine whether a long or a short header is going to be
> used, as that's solely dependent on the LSN:

Discussion of this point (plus some more points) is in a separate
reply. You can reply to my comments there :
https://www.postgresql.org/message-id/CAJ3gD9f_HjQ6qP%3D%2B1jwzwy77fwcbT4-M3UvVsqpAzsY-jqM8nw%40mail.gmail.com

>
>
> >  /*
> > + * Get the wal_level from the control file.
> > + */
> > +int
> > +ControlFileWalLevel(void)
> > +{
> > +     return ControlFile->wal_level;
> > +}
>
> Any reason not to return the type enum WalLevel instead?  I'm not sure I
> like the function name - perhaps something like GetActiveWalLevel() or
> such? The fact that it's in the control file doesn't seem relevant
> here.  I think it should be close to DataChecksumsEnabled() etc, which
> all return information from the control file.

Done.

>
>
> > +/*
> >   * Initialization of shared memory for XLOG
> >   */
> >  Size
> > @@ -9843,6 +9852,17 @@ xlog_redo(XLogReaderState *record)
> >               /* Update our copy of the parameters in pg_control */
> >               memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
> >
> > +             /*
> > +              * Drop logical slots if we are in hot standby and master does not have
> > +              * logical data. Don't bother to search for the slots if standby is
> > +              * running with wal_level lower than logical, because in that case,
> > +              * we would have disallowed creation of logical slots.
> > +              */
>
> s/disallowed creation/disallowed creation or previously dropped/

Did this :
* we would have either disallowed creation of logical slots or dropped
* existing ones.

>
> > +             if (InRecovery && InHotStandby &&
> > +                     xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> > +                     wal_level >= WAL_LEVEL_LOGICAL)
> > +                     ResolveRecoveryConflictWithSlots(InvalidOid, InvalidTransactionId);
> > +
> >               LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> >               ControlFile->MaxConnections = xlrec.MaxConnections;
> >               ControlFile->max_worker_processes =
> >               xlrec.max_worker_processes;
>
> Not for this patch, but I kinda feel the individual replay routines
> ought to be broken out of xlog_redo().
Yeah, agree.

>
>
> >  /* ----------------------------------------
> >   * Functions for decoding the data and block references in a record.
> >   * ----------------------------------------
> > diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
> > index 151c3ef..c1bd028 100644
> > --- a/src/backend/replication/logical/decode.c
> > +++ b/src/backend/replication/logical/decode.c
> > @@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> >                        * can restart from there.
> >                        */
> >                       break;
> > +             case XLOG_PARAMETER_CHANGE:
> > +             {
> > +                     xl_parameter_change *xlrec =
> > +                             (xl_parameter_change *) XLogRecGetData(buf->record);
> > +
> > +                     /* Cannot proceed if master itself does not have logical data */
> > +                     if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> > +                             ereport(ERROR,
> > +                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                              errmsg("logical decoding on standby requires "
> > +                                                             "wal_level >= logical on master")));
> > +                     break;
> > +             }
>
> This should also HINT to drop the replication slot.

In this case, DecodeXLogOp() is being called because somebody is using
the slot itself. Not sure if it makes sense to hint the user to drop
the very slot that he/she is using. It would have made better sense to
hint about dropping the slot if something else was being done that
does not require a slot, but because the slot is becoming a nuisance,
we hint to drop the slot so as to avoid the error. What do you say ?
Probably the error message itself hints at setting the wal-level back
to logical.

>
>
> > +     /*
> > +      * It is not guaranteed that the restart_lsn points to a valid
> > +      * record location. E.g. on standby, restart_lsn initially points to lastReplayedEndRecPtr,
> > +      * which is 1 + the end of last replayed record, which means it can point the next
> > +      * block header start. So bump it to the next valid record.
> > +      */
>
> I'd rephrase this as something like:
>
> restart_lsn initially may point one past the end of the record. If that
> is a XLOG page boundary, it will not be a valid LSN for the start of a
> record. If that's the case, look for the start of the first record.

Done.

>
>
> > +     if (!XRecOffIsValid(startptr))
> > +     {
>
> Hm, could you before this add an Assert(startptr != InvalidXLogRecPtr)
> or such?

Yeah, done

>
>
> > +             elog(DEBUG1, "Invalid restart lsn %X/%X",
> > +                                      (uint32) (startptr >> 32), (uint32) startptr);
> > +             startptr = XLogFindNextRecord(ctx->reader, startptr);
> > +
> > +             SpinLockAcquire(&slot->mutex);
> > +             slot->data.restart_lsn = startptr;
> > +             SpinLockRelease(&slot->mutex);
> > +             elog(DEBUG1, "Moved slot restart lsn to %X/%X",
> > +                                      (uint32) (startptr >> 32), (uint32) startptr);
> > +     }
>
> Minor nit: normally debug messages don't start with upper case.

Done.

>
>
> >       /* Wait for a consistent starting point */
> >       for (;;)
> >       {
> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index 55c306e..7ffd264 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1016,37 +1016,37 @@ ReplicationSlotReserveWal(void)
> >               /*
> >                * For logical slots log a standby snapshot and start logical decoding
> >                * at exactly that position. That allows the slot to start up more
> > -              * quickly.
> > +              * quickly. But on a standby we cannot do WAL writes, so just use the
> > +              * replay pointer; effectively, an attempt to create a logical slot on
> > +              * standby will cause it to wait for an xl_running_xact record so that
> > +              * a snapshot can be built using the record.
>
> I'd add "to be logged independently on the primary" after "wait for an
> xl_running_xact record".

Done.

>
>
> > -              * That's not needed (or indeed helpful) for physical slots as they'll
> > -              * start replay at the last logged checkpoint anyway. Instead return
> > -              * the location of the last redo LSN. While that slightly increases
> > -              * the chance that we have to retry, it's where a base backup has to
> > -              * start replay at.
> > +              * None of this is needed (or indeed helpful) for physical slots as
> > +              * they'll start replay at the last logged checkpoint anyway. Instead
> > +              * return the location of the last redo LSN. While that slightly
> > +              * increases the chance that we have to retry, it's where a base backup
> > +              * has to start replay at.
> >                */
> > +
> > +             restart_lsn =
> > +                     (SlotIsPhysical(slot) ? GetRedoRecPtr() :
> > +                     (RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
> > +                                                                     GetXLogInsertRecPtr()));
>
> Please rewrite this to use normal if blocks. I'm also not convinced that
> it's useful to have this if block, and then another if block that
> basically tests the same conditions again.

Will check and get back on this one.

>
>
> >  /*
> > + * Resolve recovery conflicts with slots.
> > + *
> > + * When xid is valid, it means it's a removed-xid kind of conflict, so need to
> > + * drop the appropriate slots whose xmin conflicts with removed xid.
>
> I don't think "removed-xid kind of conflict" is that descriptive. I'd
> suggest something like "When xid is valid, it means that rows older than
> xid might have been removed. Therefore we need to drop slots that depend
> on seeing those rows."

Done.

>
>
> > + * When xid is invalid, drop all logical slots. This is required when the
> > + * master wal_level is set back to replica, so existing logical slots need to
> > + * be dropped.
> > + */
> > +void
> > +ResolveRecoveryConflictWithSlots(Oid dboid, TransactionId xid)
> > +{
> > +     int                     i;
> > +     bool            found_conflict = false;
> > +
> > +     if (max_replication_slots <= 0)
> > +             return;
> > +
> > +restart:
> > +     if (found_conflict)
> > +     {
> > +             CHECK_FOR_INTERRUPTS();
> > +             /*
> > +              * Wait awhile for them to die so that we avoid flooding an
> > +              * unresponsive backend when system is heavily loaded.
> > +              */
> > +             pg_usleep(100000);
> > +             found_conflict = false;
> > +     }
>
> Hm, I wonder if we could use the condition variable the slot
> infrastructure has these days for this instead.

Removed the pg_usleep, since in the attached patch, we now sleep on
the condition variable just after sending a recovery conflict signal
is sent. Details down below.

>
>
> > +     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> > +     for (i = 0; i < max_replication_slots; i++)
> > +     {
> > +             ReplicationSlot *s;
> > +             NameData        slotname;
> > +             TransactionId slot_xmin;
> > +             TransactionId slot_catalog_xmin;
> > +
> > +             s = &ReplicationSlotCtl->replication_slots[i];
> > +
> > +             /* cannot change while ReplicationSlotCtlLock is held */
> > +             if (!s->in_use)
> > +                     continue;
> > +
> > +             /* Invalid xid means caller is asking to drop all logical slots */
> > +             if (!TransactionIdIsValid(xid) && SlotIsLogical(s))
> > +                     found_conflict = true;
>
> I'd just add
>
> if (!SlotIsLogical(s))
>    continue;
>
> because all of this doesn't need to happen for slots that aren't
> logical.

Yeah right. Done. Also renamed the function to
ResolveRecoveryConflictWithLogicalSlots() to emphasize that it is only
for logical slots.

>
> > +             else
> > +             {
> > +                     /* not our database, skip */
> > +                     if (s->data.database != InvalidOid && s->data.database != dboid)
> > +                             continue;
> > +
> > +                     SpinLockAcquire(&s->mutex);
> > +                     slotname = s->data.name;
> > +                     slot_xmin = s->data.xmin;
> > +                     slot_catalog_xmin = s->data.catalog_xmin;
> > +                     SpinLockRelease(&s->mutex);
> > +
> > +                     if (TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid))
> > +                     {
> > +                             found_conflict = true;
> > +
> > +                             ereport(LOG,
> > +                                             (errmsg("slot %s w/ xmin %u conflicts with removed xid %u",
> > +                                                             NameStr(slotname), slot_xmin, xid)));
> > +                     }
>
> s/removed xid/xid horizon being increased to %u/

BTW, this message belongs to an older patch. Check v7 onwards for
latest way I used for generating the message. Anyway, I have used the
above suggestion. Now the message detail will look like :
slot xmin: 1234, slot catalog_xmin: 5678, conflicts with xid horizon
being increased to 9012"

>
>
> > +                     if (TransactionIdIsValid(slot_catalog_xmin) &&
TransactionIdPrecedesOrEquals(slot_catalog_xmin,xid))
 
> > +                     {
> > +                             found_conflict = true;
> > +
> > +                             ereport(LOG,
> > +                                             (errmsg("slot %s w/ catalog xmin %u conflicts with removed xid %u",
> > +                                                             NameStr(slotname), slot_catalog_xmin, xid)));
> > +                     }
> > +
> > +             }
> > +             if (found_conflict)
> > +             {
>
>
> Hm, as far as I can tell you just ignore that the slot might currently
> be in use. You can't just drop a slot that somebody is using.

Yeah, I missed that.

> I think
> you need to send a recovery conflict to that backend.
>
> I guess the easiest way to do that would be something roughly like:
>
>     SetInvalidVirtualTransactionId(vxid);
>
>     LWLockAcquire(ProcArrayLock, LW_SHARED);
>     cancel_proc = BackendPidGetProcWithLock(active_pid);
>     if (cancel_proc)
>         vxid = GET_VXID_FROM_PGPROC(cancel_proc);
>     LWLockRelease(ProcArrayLock);
>
>     if (VirtualTransactionIdIsValid(vixd))
>     {
>         CancelVirtualTransaction(vxid);
>
>         /* Wait here until we get signaled, and then restart */
>         ConditionVariableSleep(&slot->active_cv,
>                                WAIT_EVENT_REPLICATION_SLOT_DROP);
>     }
>     ConditionVariableCancelSleep();
>
> when the slot is currently active.

Did that now. Check the new function ReplicationSlotDropConflicting().

Also the below code is something that I added :
 * Note: Even if vxid.localTransactionId is invalid, we need to cancel
 * that backend, because there is no other way to make it release the
 * slot. So don't bother to validate vxid.localTransactionId.
 */
if (vxid.backendId == InvalidBackendId)
continue;

This was done so that we could kill walsender in case pg_recvlogical
made it acquire the slot that we want to drop. walsender does not have
a local transaction id it seems. But CancelVirtualTransaction() works
also if vxid.localTransactionId is invalid. I have added comments to
explain this in CancelVirtualTransaction().


> Part of this would need to be split
> into a procarray.c helper function (mainly all the stuff dealing with
> ProcArrayLock).

I didn't have to split it, by the way.

>
>
> > + elog(LOG, "Dropping conflicting slot %s", s->data.name.data);
>
> This definitely needs to be expanded, and follow the message style
> guideline.

v7 patch onvwards, the message looks :
ereport(LOG,
(errmsg("Dropping conflicting slot %s", NameStr(slotname)),
errdetail("%s", conflict_reason)));
Does that suffice ?

>
>
> > +                     LWLockRelease(ReplicationSlotControlLock);      /* avoid deadlock */
>
> Instead of saying "deadlock" I'd just say that ReplicationSlotDropPtr()
> will acquire that lock.

Done

>
>
> > +                     ReplicationSlotDropPtr(s);
>
> But more importantly, I don't think this is
> correct. ReplicationSlotDropPtr() assumes that the to-be-dropped slot is
> acquired by the current backend - without that somebody else could
> concurrently acquire that slot.
>
> SO I think you need to do something like ReplicationSlotsDropDBSlots()
> does:
>
>                 /* acquire slot, so ReplicationSlotDropAcquired can be reused  */
>                 SpinLockAcquire(&s->mutex);
>                 /* can't change while ReplicationSlotControlLock is held */
>                 slotname = NameStr(s->data.name);
>                 active_pid = s->active_pid;
>                 if (active_pid == 0)
>                 {
>                         MyReplicationSlot = s;
>                         s->active_pid = MyProcPid;
>                 }
>                 SpinLockRelease(&s->mutex);

I have now done this in ReplicationSlotDropConflicting() itself.

>
>
> Greetings,
>
> Andres Freund

I have also removed the code inside #ifdef NOT_ANYMORE that errors out
with "logical decoding cannot be used while in recovery".

I have introduced a new procsignal reason
PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT so that when the conflicting
logical slot is dropped, a new error detail will be shown : "User was
using the logical slot that must be dropped".
Accordingly, added PgStat_StatDBEntry.n_conflict_logicalslot field.

Also, in RecoveryConflictInterrupt(), had to do some special handling
for am_cascading_walsender, so that a conflicting walsender on standby
will be terminated irrespective of the transaction status.

Attached v9 patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Mon, 24 Jun 2019 at 23:58, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 20 Jun 2019 at 00:31, Andres Freund <andres@anarazel.de> wrote:
> >
> > > > Or else, do you think we can just increment the record pointer by
> > > > doing something like (lastReplayedEndRecPtr % XLOG_BLCKSZ) +
> > > > SizeOfXLogShortPHD() ?
> > >
> > > I found out that we can't do this, because we don't know whether the
> > > xlog header is SizeOfXLogShortPHD or SizeOfXLogLongPHD. In fact, in
> > > our context, it is SizeOfXLogLongPHD. So we indeed need the
> > > XLogReaderState handle.
> >
> > Well, we can determine whether a long or a short header is going to be
> > used, as that's solely dependent on the LSN:
>
> Discussion of this point (plus some more points) is in a separate
> reply. You can reply to my comments there :
> https://www.postgresql.org/message-id/CAJ3gD9f_HjQ6qP%3D%2B1jwzwy77fwcbT4-M3UvVsqpAzsY-jqM8nw%40mail.gmail.com
>

As you suggested, I have used XLogSegmentOffset() to know the header
size, and bumped the restart_lsn in ReplicationSlotReserveWal() rather
than DecodingContextFindStartpoint(). Like I mentioned in the above
link, I am not sure why it's not worth doing this like you said.

> >
> >
> > > -              * That's not needed (or indeed helpful) for physical slots as they'll
> > > -              * start replay at the last logged checkpoint anyway. Instead return
> > > -              * the location of the last redo LSN. While that slightly increases
> > > -              * the chance that we have to retry, it's where a base backup has to
> > > -              * start replay at.
> > > +              * None of this is needed (or indeed helpful) for physical slots as
> > > +              * they'll start replay at the last logged checkpoint anyway. Instead
> > > +              * return the location of the last redo LSN. While that slightly
> > > +              * increases the chance that we have to retry, it's where a base backup
> > > +              * has to start replay at.
> > >                */
> > > +
> > > +             restart_lsn =
> > > +                     (SlotIsPhysical(slot) ? GetRedoRecPtr() :
> > > +                     (RecoveryInProgress() ? GetXLogReplayRecPtr(NULL) :
> > > +                                                                     GetXLogInsertRecPtr()));
> >
> > Please rewrite this to use normal if blocks.

Ok, done.

> > I'm also not convinced that
> > it's useful to have this if block, and then another if block that
> > basically tests the same conditions again.
>
> Will check and get back on this one.
Those conditions are not exactly same. restart_lsn is assigned three
different pointers depending upon three different conditions. And
LogStandbySnapshot() is to be done only for combination of two
specific conditions. So we need to have two different condition
blocks.
Also, it's better if we have the
"assign-slot-restart_lsn-under-spinlock" in a common code, rather than
repeating it in two different blocks.

We can do something like :

if (!RecoveryInProgress() && SlotIsLogical(slot))
{
   restart_lsn = GetXLogInsertRecPtr();
   /* Assign restart_lsn to slot restart_lsn under Spinlock */
   /* Log standby snapshot and fsync to disk */
}
else
{
   if (SlotIsPhysical(slot))
      restart_lsn = GetRedoRecPtr();
   else if (RecoveryInProgress())
       restart_lsn = GetXLogReplayRecPtr(NULL);
   else
      restart_lsn = GetXLogInsertRecPtr();

   /* Assign restart_lsn to slot restart_lsn under Spinlock */
}

But I think better/simpler thing would be to take out the
assign-slot-restart_lsn outside of the two condition blocks into a
common location, like this :

if (SlotIsPhysical(slot))
   restart_lsn = GetRedoRecPtr();
else if (RecoveryInProgress())
   restart_lsn = GetXLogReplayRecPtr(NULL);
else
   restart_lsn = GetXLogInsertRecPtr();

/* Assign restart_lsn to slot restart_lsn under Spinlock */

if (!RecoveryInProgress() && SlotIsLogical(slot))
{
/   * Log standby snapshot and fsync to disk */
}

So in the updated patch (v10), I have done as above.

Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Fri, Jun 21, 2019 at 11:50 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > This definitely needs to be expanded, and follow the message style
> > guideline.
>
> This message , with the v8 patch, looks like this :
> ereport(LOG,
> (errmsg("Dropping conflicting slot %s", NameStr(slotname)),
> errdetail("%s", reason)));
> where reason is a char string.

That does not follow the message style guideline.

https://www.postgresql.org/docs/12/error-style-guide.html

From the grammar and punctuation section:

"Primary error messages: Do not capitalize the first letter. Do not
end a message with a period. Do not even think about ending a message
with an exclamation point.

Detail and hint messages: Use complete sentences, and end each with a
period. Capitalize the first word of sentences. Put two spaces after
the period if another sentence follows (for English text; might be
inappropriate in other languages)."


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 25 Jun 2019 at 19:14, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jun 21, 2019 at 11:50 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > > This definitely needs to be expanded, and follow the message style
> > > guideline.
> >
> > This message , with the v8 patch, looks like this :
> > ereport(LOG,
> > (errmsg("Dropping conflicting slot %s", NameStr(slotname)),
> > errdetail("%s", reason)));
> > where reason is a char string.
>
> That does not follow the message style guideline.
>
> https://www.postgresql.org/docs/12/error-style-guide.html
>
> From the grammar and punctuation section:
>
> "Primary error messages: Do not capitalize the first letter. Do not
> end a message with a period. Do not even think about ending a message
> with an exclamation point.
>
> Detail and hint messages: Use complete sentences, and end each with a
> period. Capitalize the first word of sentences. Put two spaces after
> the period if another sentence follows (for English text; might be
> inappropriate in other languages)."

Thanks. In the updated patch, changed the message style. Now it looks
like this :

primary message : dropped conflicting slot slot_name
error detail : Slot conflicted with xid horizon which was being
increased to 9012 (slot xmin: 1234, slot catalog_xmin: 5678).

--------------------

Also, in the updated patch (v11), I have added some scenarios that
verify that slot is dropped when either master wal_level is
insufficient, or when slot is conflicting. Also organized the test
file a bit.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
tushar
Date:
On 07/01/2019 11:04 AM, Amit Khandekar wrote:
Also, in the updated patch (v11), I have added some scenarios that
verify that slot is dropped when either master wal_level is
insufficient, or when slot is conflicting. Also organized the test
file a bit.

One scenario where replication slot removed even after fixing the problem (which Error message suggested to do)

Please refer this below scenario
Master cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5432

Standby cluster-
postgresql,conf file
wal_level=logical
hot_standby_feedback = on
port=5433

both Master/Slave cluster are up and running and are in SYNC with each other
Create a logical replication slot on SLAVE ( SELECT * from   pg_create_logical_replication_slot('m', 'test_decoding'); )

change wal_level='hot_standby' on Master postgresql.conf file / restart the server
Run get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
ERROR:  logical decoding on standby requires wal_level >= logical on master

Correct it on Master postgresql.conf file ,i.e set  wal_level='logical'  again / restart the server
and again fire  get_changes function on Standby -
postgres=# select * from pg_logical_slot_get_changes('m',null,null);
ERROR:  replication slot "m" does not exist

This looks little weird as slot got dropped/removed internally . i guess it should get invalid rather than removed automatically.
Lets user's  delete the slot themself rather than automatically removed  as a surprise.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 4 Jul 2019 at 15:52, tushar <tushar.ahuja@enterprisedb.com> wrote:
>
> On 07/01/2019 11:04 AM, Amit Khandekar wrote:
>
> Also, in the updated patch (v11), I have added some scenarios that
> verify that slot is dropped when either master wal_level is
> insufficient, or when slot is conflicting. Also organized the test
> file a bit.
>
> One scenario where replication slot removed even after fixing the problem (which Error message suggested to do)

Which specific problem are you referring to ? Removing a conflicting
slot, itself is the part of the fix for the conflicting slot problem.

>
> Please refer this below scenario
>
> Master cluster-
> postgresql,conf file
> wal_level=logical
> hot_standby_feedback = on
> port=5432
>
> Standby cluster-
> postgresql,conf file
> wal_level=logical
> hot_standby_feedback = on
> port=5433
>
> both Master/Slave cluster are up and running and are in SYNC with each other
> Create a logical replication slot on SLAVE ( SELECT * from   pg_create_logical_replication_slot('m',
'test_decoding');)
 
>
> change wal_level='hot_standby' on Master postgresql.conf file / restart the server
> Run get_changes function on Standby -
> postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> ERROR:  logical decoding on standby requires wal_level >= logical on master
>
> Correct it on Master postgresql.conf file ,i.e set  wal_level='logical'  again / restart the server
> and again fire  get_changes function on Standby -
> postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> ERROR:  replication slot "m" does not exist
>
> This looks little weird as slot got dropped/removed internally . i guess it should get invalid rather than removed
automatically.
> Lets user's  delete the slot themself rather than automatically removed  as a surprise.

It was earlier discussed about what action should be taken when we
find conflicting slots. Out of the options, one was to drop the slot,
and we chose that because that was simple. See this :
https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de

By the way, you are getting the "logical decoding on standby requires
wal_level >= logical on master" error while using the slot, which is
because we reject the command even before checking the existence of
the slot. Actually the slot is already dropped due to master
wal_level. Then when you correct the master wal_level, the command is
not rejected, and proceeds to use the slot and then it is found that
the slot does not exist.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 4 Jul 2019 at 17:21, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 4 Jul 2019 at 15:52, tushar <tushar.ahuja@enterprisedb.com> wrote:
> >
> > On 07/01/2019 11:04 AM, Amit Khandekar wrote:
> >
> > Also, in the updated patch (v11), I have added some scenarios that
> > verify that slot is dropped when either master wal_level is
> > insufficient, or when slot is conflicting. Also organized the test
> > file a bit.
> >
> > One scenario where replication slot removed even after fixing the problem (which Error message suggested to do)
>
> Which specific problem are you referring to ? Removing a conflicting
> slot, itself is the part of the fix for the conflicting slot problem.
>
> >
> > Please refer this below scenario
> >
> > Master cluster-
> > postgresql,conf file
> > wal_level=logical
> > hot_standby_feedback = on
> > port=5432
> >
> > Standby cluster-
> > postgresql,conf file
> > wal_level=logical
> > hot_standby_feedback = on
> > port=5433
> >
> > both Master/Slave cluster are up and running and are in SYNC with each other
> > Create a logical replication slot on SLAVE ( SELECT * from   pg_create_logical_replication_slot('m',
'test_decoding');)
 
> >
> > change wal_level='hot_standby' on Master postgresql.conf file / restart the server
> > Run get_changes function on Standby -
> > postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> > ERROR:  logical decoding on standby requires wal_level >= logical on master
> >
> > Correct it on Master postgresql.conf file ,i.e set  wal_level='logical'  again / restart the server
> > and again fire  get_changes function on Standby -
> > postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> > ERROR:  replication slot "m" does not exist
> >
> > This looks little weird as slot got dropped/removed internally . i guess it should get invalid rather than removed
automatically.
> > Lets user's  delete the slot themself rather than automatically removed  as a surprise.
>
> It was earlier discussed about what action should be taken when we
> find conflicting slots. Out of the options, one was to drop the slot,
> and we chose that because that was simple. See this :
> https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de

Sorry, the above link is not the one I wanted to refer to. Correct one is this :

https://www.postgresql.org/message-id/20181214005521.jaty2d24lz4nroil%40alap3.anarazel.de

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

Thanks for the new version! Looks like we're making progress towards
something committable here.

I think it'd be good to split the patch into a few pieces. I'd maybe do
that like:
1) WAL format changes (plus required other changes)
2) Recovery conflicts with slots
3) logical decoding on standby
4) tests


> @@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
>       */
>
>      /* XLOG stuff */
> +    xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
>      xlrec_reuse.node = rel->rd_node;
>      xlrec_reuse.block = blkno;
>      xlrec_reuse.latestRemovedXid = latestRemovedXid;

Hm. I think we otherwise only ever use
RelationIsAccessibleInLogicalDecoding() on tables, not on indexes.  And
while I think this would mostly work for builtin catalog tables, it
won't work for "user catalog tables" as RelationIsUsedAsCatalogTable()
won't perform any useful checks for indexes.

So I think we either need to look up the table, or pass it down.


> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index d768b9b..10b7857 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
>   * see comments for vacuum_log_cleanup_info().
>   */
>  XLogRecPtr
> -log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
> +log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
>  {
>      xl_heap_cleanup_info xlrec;
>      XLogRecPtr    recptr;
>
> -    xlrec.node = rnode;
> +    xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
> +    xlrec.node = rel->rd_node;
>      xlrec.latestRemovedXid = latestRemovedXid;
>
>      XLogBeginInsert();
> @@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
>      /* Caller should not call me on a non-WAL-logged relation */
>      Assert(RelationNeedsWAL(reln));
>
> +    xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);

It'd probably be a good idea to add a comment to
RelationIsUsedAsCatalogTable() that it better never invoke anything
performing catalog accesses. Otherwise there's quite the danger with
recursion (some operation doing RelationIsAccessibleInLogicalDecoding(),
that then accessing the catalog, which in turn could again need to
perform said operation, loop).



>  /* Entry in pending-list of TIDs we need to revisit */
> @@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
>      OffsetNumber itemnos[MaxIndexTuplesPerPage];
>      spgxlogVacuumRedirect xlrec;
>
> +    xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
>      xlrec.nToPlaceholder = 0;
>      xlrec.newestRedirectXid = InvalidTransactionId;

We should document that it is safe to do catalog acceses here, because
spgist is never used to back catalogs. Otherwise there would be an a
endless recursion danger here.

Did you check how hard it we to just pass down the heap relation?


>  /*
> + * Get the wal_level from the control file.
> + */
> +WalLevel
> +GetActiveWalLevel(void)
> +{
> +    return ControlFile->wal_level;
> +}

What does "Active" mean here? I assume it's supposed to indicate that it
could be different than what's configured in postgresql.conf, for a
replica? If so, that should be mentioned.


> +/*
>   * Initialization of shared memory for XLOG
>   */
>  Size
> @@ -9843,6 +9852,19 @@ xlog_redo(XLogReaderState *record)
>          /* Update our copy of the parameters in pg_control */
>          memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
>
> +        /*
> +         * Drop logical slots if we are in hot standby and master does not have
> +         * logical data.

nitpick: s/master/the primary/ (mostly adding the "the", but I
personally also prefer primary over master)

s/logical data/a WAL level sufficient for logical decoding/


> Don't bother to search for the slots if standby is
> +         * running with wal_level lower than logical, because in that case,
> +         * we would have either disallowed creation of logical slots or dropped
> +         * existing ones.

s/Don't bother/No need/
s/slots/potentially conflicting logically slots/

> +        if (InRecovery && InHotStandby &&
> +            xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> +            wal_level >= WAL_LEVEL_LOGICAL)
> +            ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
> +                gettext_noop("Logical decoding on standby requires wal_level >= logical on master."));



> diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
> index 151c3ef..c1bd028 100644
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>               * can restart from there.
>               */
>              break;
> +        case XLOG_PARAMETER_CHANGE:
> +        {
> +            xl_parameter_change *xlrec =
> +                (xl_parameter_change *) XLogRecGetData(buf->record);
> +            /* Cannot proceed if master itself does not have logical data */

This needs an explanation as to how this is reachable...


> +            if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                         errmsg("logical decoding on standby requires "
> +                                "wal_level >= logical on master")));
> +            break;

Hm, this strikes me as a not quite good enough error message (same in
other copies of the message). Perhaps something roughly like "could not
continue with logical decoding, the primary's wal level is now too low
(%u)"?


>      if (RecoveryInProgress())
> -        ereport(ERROR,
> -                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                 errmsg("logical decoding cannot be used while in recovery")));
> +    {
> +        /*
> +         * This check may have race conditions, but whenever
> +         * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
> +         * verify that there are no existing logical replication slots. And to
> +         * avoid races around creating a new slot,
> +         * CheckLogicalDecodingRequirements() is called once before creating
> +         * the slot, and once when logical decoding is initially starting up.
> +         */
> +        if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                     errmsg("logical decoding on standby requires "
> +                            "wal_level >= logical on master")));
> +    }
>  }
>
>  /*
> @@ -241,6 +240,8 @@ CreateInitDecodingContext(char *plugin,
>      LogicalDecodingContext *ctx;
>      MemoryContext old_context;
>
> +    CheckLogicalDecodingRequirements();
> +

This should reference the above explanation.




>  /*
> + * Permanently drop a conflicting replication slot. If it's already active by
> + * another backend, send it a recovery conflict signal, and then try again.
> + */
> +static void
> +ReplicationSlotDropConflicting(ReplicationSlot *slot)


> +void
> +ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
> +                                        char *conflict_reason)
> +{
> +            /*
> +             * Build the conflict_str which will look like :
> +             * "Slot conflicted with xid horizon which was being increased
> +             * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
> +             */
> +            initStringInfo(&conflict_xmins);
> +            if (TransactionIdIsValid(slot_xmin) &&
> +                TransactionIdPrecedesOrEquals(slot_xmin, xid))
> +            {
> +                appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
> +            }
> +            if (TransactionIdIsValid(slot_catalog_xmin) &&
> +                TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
> +                appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
> +                                 conflict_xmins.len > 0 ? ", " : "",
> +                                 slot_catalog_xmin);
> +
> +            if (conflict_xmins.len > 0)
> +            {
> +                initStringInfo(&conflict_str);
> +                appendStringInfo(&conflict_str, "%s %d (%s).",
> +                                 conflict_sentence, xid, conflict_xmins.data);
> +                found_conflict = true;
> +                conflict_reason = conflict_str.data;
> +            }
> +        }


I think this is going to be a nightmare for translators, no? I'm not
clear as to why any of this is needed?



> +            /* ReplicationSlotDropPtr() would acquire the lock below */
> +            LWLockRelease(ReplicationSlotControlLock);

"would acquire"? I think it *does* acquire, right?



> @@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>              case PROCSIG_RECOVERY_CONFLICT_LOCK:
>              case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
>              case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
> +            case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
> +                /*
> +                 * For conflicts that require a logical slot to be dropped, the
> +                 * requirement is for the signal receiver to release the slot,
> +                 * so that it could be dropped by the signal sender. So for
> +                 * normal backends, the transaction should be aborted, just
> +                 * like for other recovery conflicts. But if it's walsender on
> +                 * standby, then it has to be killed so as to release an
> +                 * acquired logical slot.
> +                 */
> +                if (am_cascading_walsender &&
> +                    reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
> +                    MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
> +                {
> +                    RecoveryConflictPending = true;
> +                    QueryCancelPending = true;
> +                    InterruptPending = true;
> +                    break;
> +                }

Huh, I'm not following as to why that's needed for walsenders?


> @@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
>                            dbentry->n_conflict_tablespace +
>                            dbentry->n_conflict_lock +
>                            dbentry->n_conflict_snapshot +
> +                          dbentry->n_conflict_logicalslot +
>                            dbentry->n_conflict_bufferpin +
>                            dbentry->n_conflict_startup_deadlock);

I think this probably needs adjustments in a few more places,
e.g. monitoring.sgml...

Thanks!

Andres Freund



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Jul 9, 2019 at 11:14 PM Andres Freund <andres@anarazel.de> wrote:
> > +                     if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> > +                             ereport(ERROR,
> > +                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                              errmsg("logical decoding on standby requires "
> > +                                                             "wal_level >= logical on master")));
> > +                     break;
>
> Hm, this strikes me as a not quite good enough error message (same in
> other copies of the message). Perhaps something roughly like "could not
> continue with logical decoding, the primary's wal level is now too low
> (%u)"?

For what it's worth, I dislike that wording on grammatical grounds --
it sounds like two complete sentences joined by a comma, which is poor
style -- and think Amit's wording is probably fine.  We could fix the
grammatical issue by replacing the comma in your version with the word
"because," but that seems unnecessarily wordy to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Wed, 10 Jul 2019 at 08:44, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Thanks for the new version! Looks like we're making progress towards
> something committable here.
>
> I think it'd be good to split the patch into a few pieces. I'd maybe do
> that like:
> 1) WAL format changes (plus required other changes)
> 2) Recovery conflicts with slots
> 3) logical decoding on standby
> 4) tests

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.

>
>
> > @@ -589,6 +590,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
> >        */
> >
> >       /* XLOG stuff */
> > +     xlrec_reuse.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
> >       xlrec_reuse.node = rel->rd_node;
> >       xlrec_reuse.block = blkno;
> >       xlrec_reuse.latestRemovedXid = latestRemovedXid;
>
> Hm. I think we otherwise only ever use
> RelationIsAccessibleInLogicalDecoding() on tables, not on indexes.  And
> while I think this would mostly work for builtin catalog tables, it
> won't work for "user catalog tables" as RelationIsUsedAsCatalogTable()
> won't perform any useful checks for indexes.
>
> So I think we either need to look up the table, or pass it down.

Done. Passed down the heap rel.

>
>
> > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> > index d768b9b..10b7857 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -7149,12 +7149,13 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
> >   * see comments for vacuum_log_cleanup_info().
> >   */
> >  XLogRecPtr
> > -log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid)
> > +log_heap_cleanup_info(Relation rel, TransactionId latestRemovedXid)
> >  {
> >       xl_heap_cleanup_info xlrec;
> >       XLogRecPtr      recptr;
> >
> > -     xlrec.node = rnode;
> > +     xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(rel);
> > +     xlrec.node = rel->rd_node;
> >       xlrec.latestRemovedXid = latestRemovedXid;
> >
> >       XLogBeginInsert();
> > @@ -7190,6 +7191,7 @@ log_heap_clean(Relation reln, Buffer buffer,
> >       /* Caller should not call me on a non-WAL-logged relation */
> >       Assert(RelationNeedsWAL(reln));
> >
> > +     xlrec.onCatalogTable = RelationIsAccessibleInLogicalDecoding(reln);
>
> It'd probably be a good idea to add a comment to
> RelationIsUsedAsCatalogTable() that it better never invoke anything
> performing catalog accesses. Otherwise there's quite the danger with
> recursion (some operation doing RelationIsAccessibleInLogicalDecoding(),
> that then accessing the catalog, which in turn could again need to
> perform said operation, loop).

Added comments in RelationIsUsedAsCatalogTable() as well as
RelationIsAccessibleInLogicalDecoding() :

 * RelationIsAccessibleInLogicalDecoding
 * True if we need to log enough information to have access via
 * decoding snapshot.
 * This definition should not invoke anything that performs catalog
 * access. Otherwise, e.g. logging a WAL entry for catalog relation may
 * invoke this function, which will in turn do catalog access, which may
 * in turn cause another similar WAL entry to be logged, leading to
 * infinite recursion.

> >  /* Entry in pending-list of TIDs we need to revisit */
> > @@ -502,6 +503,7 @@ vacuumRedirectAndPlaceholder(Relation index, Buffer buffer)
> >       OffsetNumber itemnos[MaxIndexTuplesPerPage];
> >       spgxlogVacuumRedirect xlrec;
> >
> > +     xlrec.onCatalogTable = get_rel_logical_catalog(index->rd_index->indrelid);
> >       xlrec.nToPlaceholder = 0;
> >       xlrec.newestRedirectXid = InvalidTransactionId;
>
> We should document that it is safe to do catalog acceses here, because
> spgist is never used to back catalogs. Otherwise there would be an a
> endless recursion danger here.

Comments added.

>
> Did you check how hard it we to just pass down the heap relation?

It does look hard. Check my comments in an earlier reply, that I have
pasted below :

> This one seems harder, but I'm not actually sure why we make it so
> hard. It seems like we just ought to add the table to IndexVacuumInfo.

This means we have to add heapRel assignment wherever we initialize
IndexVacuumInfo structure, namely in lazy_vacuum_index(),
lazy_cleanup_index(), validate_index(), analyze_rel(), and make sure
these functions have a heap rel handle. Do you think we should do this
as part of this patch ?

>
>
> >  /*
> > + * Get the wal_level from the control file.
> > + */
> > +WalLevel
> > +GetActiveWalLevel(void)
> > +{
> > +     return ControlFile->wal_level;
> > +}
>
> What does "Active" mean here? I assume it's supposed to indicate that it
> could be different than what's configured in postgresql.conf, for a
> replica? If so, that should be mentioned.

Done. Here are the new comments :
 * Get the wal_level from the control file. For a standby, this value should be
 * considered as its active wal_level, because it may be different from what
 * was originally configured on standby.

>
>
> > +/*
> >   * Initialization of shared memory for XLOG
> >   */
> >  Size
> > @@ -9843,6 +9852,19 @@ xlog_redo(XLogReaderState *record)
> >               /* Update our copy of the parameters in pg_control */
> >               memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
> >
> > +             /*
> > +              * Drop logical slots if we are in hot standby and master does not have
> > +              * logical data.
>
> nitpick: s/master/the primary/ (mostly adding the "the", but I
> personally also prefer primary over master)
>
> s/logical data/a WAL level sufficient for logical decoding/
>
>
> > Don't bother to search for the slots if standby is
> > +              * running with wal_level lower than logical, because in that case,
> > +              * we would have either disallowed creation of logical slots or dropped
> > +              * existing ones.
>
> s/Don't bother/No need/
> s/slots/potentially conflicting logically slots/

Done.

>
> > +             if (InRecovery && InHotStandby &&
> > +                     xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> > +                     wal_level >= WAL_LEVEL_LOGICAL)
> > +                     ResolveRecoveryConflictWithLogicalSlots(InvalidOid, InvalidTransactionId,
> > +                             gettext_noop("Logical decoding on standby requires wal_level >= logical on
master."));
>
>
>
> > diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
> > index 151c3ef..c1bd028 100644
> > --- a/src/backend/replication/logical/decode.c
> > +++ b/src/backend/replication/logical/decode.c
> > @@ -190,11 +190,23 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
> >                        * can restart from there.
> >                        */
> >                       break;
> > +             case XLOG_PARAMETER_CHANGE:
> > +             {
> > +                     xl_parameter_change *xlrec =
> > +                             (xl_parameter_change *) XLogRecGetData(buf->record);
> > +                     /* Cannot proceed if master itself does not have logical data */
>
> This needs an explanation as to how this is reachable...

Done. Here are the comments :
 * If wal_level on primary is reduced to less than logical, then we
 * want to prevent existing logical slots from being used.
 * Existing logical slot on standby gets dropped when this WAL
 * record is replayed; and further, slot creation fails when the
 * wal level is not sufficient; but all these operations are not
 * synchronized, so a logical slot may creep in while the wal_level
 * is being reduced.  Hence this extra check.

>
>
> > +                     if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> > +                             ereport(ERROR,
> > +                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                              errmsg("logical decoding on standby requires "
> > +                                                             "wal_level >= logical on master")));
> > +                     break;
>
> Hm, this strikes me as a not quite good enough error message (same in
> other copies of the message). Perhaps something roughly like "could not
> continue with logical decoding, the primary's wal level is now too low
> (%u)"?

Haven't changed this. There is another reply from Robert. I think what
you want to emphasize is that we can't *continue*. I am not sure why
user can't infer that the "logical decoding could not continue" when
we say "logical decoding requires wal_level >= ...."

>
>
> >       if (RecoveryInProgress())
> > -             ereport(ERROR,
> > -                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > -                              errmsg("logical decoding cannot be used while in recovery")));
> > +     {
> > +             /*
> > +              * This check may have race conditions, but whenever
> > +              * XLOG_PARAMETER_CHANGE indicates that wal_level has changed, we
> > +              * verify that there are no existing logical replication slots. And to
> > +              * avoid races around creating a new slot,
> > +              * CheckLogicalDecodingRequirements() is called once before creating
> > +              * the slot, and once when logical decoding is initially starting up.
> > +              */
> > +             if (GetActiveWalLevel() < WAL_LEVEL_LOGICAL)
> > +                     ereport(ERROR,
> > +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                      errmsg("logical decoding on standby requires "
> > +                                                     "wal_level >= logical on master")));
> > +     }
> >  }
> >
> >  /*
> > @@ -241,6 +240,8 @@ CreateInitDecodingContext(char *plugin,
> >       LogicalDecodingContext *ctx;
> >       MemoryContext old_context;
> >
> > +     CheckLogicalDecodingRequirements();
> > +
>
> This should reference the above explanation.

Done.

>
>
>
>
> >  /*
> > + * Permanently drop a conflicting replication slot. If it's already active by
> > + * another backend, send it a recovery conflict signal, and then try again.
> > + */
> > +static void
> > +ReplicationSlotDropConflicting(ReplicationSlot *slot)
>
>
> > +void
> > +ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
> > +                                                                             char *conflict_reason)
> > +{
> > +                     /*
> > +                      * Build the conflict_str which will look like :
> > +                      * "Slot conflicted with xid horizon which was being increased
> > +                      * to 9012 (slot xmin: 1234, slot catalog_xmin: 5678)."
> > +                      */
> > +                     initStringInfo(&conflict_xmins);
> > +                     if (TransactionIdIsValid(slot_xmin) &&
> > +                             TransactionIdPrecedesOrEquals(slot_xmin, xid))
> > +                     {
> > +                             appendStringInfo(&conflict_xmins, "slot xmin: %d", slot_xmin);
> > +                     }
> > +                     if (TransactionIdIsValid(slot_catalog_xmin) &&
> > +                             TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))
> > +                             appendStringInfo(&conflict_xmins, "%sslot catalog_xmin: %d",
> > +                                                              conflict_xmins.len > 0 ? ", " : "",
> > +                                                              slot_catalog_xmin);
> > +
> > +                     if (conflict_xmins.len > 0)
> > +                     {
> > +                             initStringInfo(&conflict_str);
> > +                             appendStringInfo(&conflict_str, "%s %d (%s).",
> > +                                                              conflict_sentence, xid, conflict_xmins.data);
> > +                             found_conflict = true;
> > +                             conflict_reason = conflict_str.data;
> > +                     }
> > +             }
>
>
> I think this is going to be a nightmare for translators, no?

For translators, I think the .po files will have the required text,
because I have used gettext_noop() for both conflict_sentence and the
passed in conflict_reason parameter. And the "dropped conflicting
slot." is passed to ereport() as usual.  The rest portion of errdetail
is not language specific. E.g. "slot" remains "slot".

> I'm not clear as to why any of this is needed?

The conflict can happen for either xmin or catalog_xmin or both, right
? The purpose of the above is to show only conflicting xmin out of the
two.

>
>
>
> > +                     /* ReplicationSlotDropPtr() would acquire the lock below */
> > +                     LWLockRelease(ReplicationSlotControlLock);
>
> "would acquire"? I think it *does* acquire, right?

Yes, Changed to "will".

>
>
>
> > @@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
> >                       case PROCSIG_RECOVERY_CONFLICT_LOCK:
> >                       case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
> >                       case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
> > +                     case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
> > +                             /*
> > +                              * For conflicts that require a logical slot to be dropped, the
> > +                              * requirement is for the signal receiver to release the slot,
> > +                              * so that it could be dropped by the signal sender. So for
> > +                              * normal backends, the transaction should be aborted, just
> > +                              * like for other recovery conflicts. But if it's walsender on
> > +                              * standby, then it has to be killed so as to release an
> > +                              * acquired logical slot.
> > +                              */
> > +                             if (am_cascading_walsender &&
> > +                                     reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
> > +                                     MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
> > +                             {
> > +                                     RecoveryConflictPending = true;
> > +                                     QueryCancelPending = true;
> > +                                     InterruptPending = true;
> > +                                     break;
> > +                             }
>
> Huh, I'm not following as to why that's needed for walsenders?

For normal backends, we ignore this signal if we aren't in a
transaction (block). But for walsender, there is no transaction, but
we cannot ignore the signal. This is because walsender can keep a
logical slot acquired when it was spawned by "pg_recvlogical --start".
So we can't ignore the signal. So the only way that we can make it
release the acquired slot is to kill it.

>
>
> > @@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
> >                                                 dbentry->n_conflict_tablespace +
> >                                                 dbentry->n_conflict_lock +
> >                                                 dbentry->n_conflict_snapshot +
> > +                                               dbentry->n_conflict_logicalslot +
> >                                                 dbentry->n_conflict_bufferpin +
> >                                                 dbentry->n_conflict_startup_deadlock);
>
> I think this probably needs adjustments in a few more places,
> e.g. monitoring.sgml...

Oops, yeah, to search for similar additions, I had looked for
"conflict_snapshot" using cscope. I should have done the same using
"git grep".
Done now.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
tushar
Date:
On 07/10/2019 05:12 PM, Amit Khandekar wrote:
All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.
Thanks Amit for your patch. i am able to see 1 issues  on Standby server - (where  logical replication slot created ) , 
a)size of  pg_wal folder  is NOT decreasing even after firing get_changes function
b)pg_wal files are not recycling  and every time it is creating new files after firing get_changes function

Here are the detailed steps -
create a directory with the name 'archive_dir' under /tmp (mkdir /tmp/archive_dir)
SR setup -
Master
.)Perform initdb  (./initdb -D master --wal-segsize=2)
.)Open postgresql.conf file and add these below parameters  at the end of  file
wal_level='logical'
min_wal_size=4MB
max_wal_size=4MB
hot_standby_feedback = on
archive_mode=on
archive_command='cp %p /tmp/archive_dir/%f'
.)Start the server ( /pg_ctl -D master/ start -l logsM -c )
.)Connect to psql , create physical  slot
    ->SELECT * FROM pg_create_physical_replication_slot('decoding_standby');
Standby -
.)Perform pg_basebackup ( ./pg_basebackup -D standby/ --slot=decoding_standby -R -v)
.)Open postgresql.conf file of standby and add these 2 parameters - at the end of file
  port=5555
primary_slot_name = 'decoding_standby'
.)Start the Standby server  ( ./pg_ctl -D standby/ start -l logsS -c )
.)Connect to psql terminal and create logical replication slot
  ->SELECT * from   pg_create_logical_replication_slot('standby', 'test_decoding');

MISC steps-
.)Connect to master and create table/insert rows ( create table t(n int); insert into t (values (1);)
.)Connect to standby and fire get_changes function ( select * from pg_logical_slot_get_changes('standby',null,null); )
.)Run pgbench ( ./pgbench -i -s 10 postgres)
.)Check the pg_wal directory size of  STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M    standby/pg_wal/
127M    total
[centos@mail-arts bin]$
 
.)Connect to standby and fire get_changes function ( select * from pg_logical_slot_get_changes('standby',null,null); )
.)Check the pg_wal directory size of STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M    standby/pg_wal/
127M    total
[centos@mail-arts bin]$


.)Restart both master and standby ( ./pg_ctl -D master restart -l logsM -c) and (./pg_ctl -D standby restart -l logsS -c )

.)Check the pg_wal directory size of STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M    standby/pg_wal/
127M    total
[centos@mail-arts bin]$

and if we see the pg_wal files ,it is growing rampant and not reusing.
-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2019-07-12 14:53:21 +0530, tushar wrote:
> On 07/10/2019 05:12 PM, Amit Khandekar wrote:
> > All right. Will do that in the next patch set. For now, I have quickly
> > done the below changes in a single patch again (attached), in order to
> > get early comments if any.
> Thanks Amit for your patch. i am able to see 1 issues  on Standby server -
> (where  logical replication slot created ) ,
> a)size of  pg_wal folder  is NOT decreasing even after firing get_changes
> function

Even after calling pg_logical_slot_get_changes() multiple times? What
does
SELECT * FROM pg_replication_slots; before and after multiple calls return?

Does manually forcing a checkpoint with CHECKPOINT; first on the primary
and then the standby "fix" the issue?


> b)pg_wal files are not recycling  and every time it is creating new files
> after firing get_changes function

I'm not sure what you mean by this. Are you saying that
pg_logical_slot_get_changes() causes WAL to be written?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 16 Jul 2019 at 22:56, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-07-12 14:53:21 +0530, tushar wrote:
> > On 07/10/2019 05:12 PM, Amit Khandekar wrote:
> > > All right. Will do that in the next patch set. For now, I have quickly
> > > done the below changes in a single patch again (attached), in order to
> > > get early comments if any.
> > Thanks Amit for your patch. i am able to see 1 issues  on Standby server -
> > (where  logical replication slot created ) ,
> > a)size of  pg_wal folder  is NOT decreasing even after firing get_changes
> > function
>
> Even after calling pg_logical_slot_get_changes() multiple times? What
> does
> SELECT * FROM pg_replication_slots; before and after multiple calls return?
>
> Does manually forcing a checkpoint with CHECKPOINT; first on the primary
> and then the standby "fix" the issue?

I independently tried to reproduce this issue on my machine yesterday.
I observed that :
sometimes, the files get cleaned up after two or more
pg_logical_slot_get_changes().
Sometimes, I have to restart the server to see the pg_wal files cleaned up.
This happens more or less the same even for logical slot on *primary*.

Will investigate further with Tushar.


>
>
> > b)pg_wal files are not recycling  and every time it is creating new files
> > after firing get_changes function
>
> I'm not sure what you mean by this. Are you saying that
> pg_logical_slot_get_changes() causes WAL to be written?
>
> Greetings,
>
> Andres Freund



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
tushar
Date:
On 07/16/2019 10:56 PM, Andres Freund wrote:
Even after calling pg_logical_slot_get_changes() multiple times? What
does
SELECT * FROM pg_replication_slots; before and after multiple calls return?

Does manually forcing a checkpoint with CHECKPOINT; first on the primary
and then the standby "fix" the issue?

Yes,eventually it gets clean up -after firing multiple times get_changes function or checkpoint or even both.
This same behavior we are able to see  on MASTER  -with or without patch.

but is this an old (existing) issue ?
b)pg_wal files are not recycling  and every time it is creating new files
after firing get_changes function
I'm not sure what you mean by this. Are you saying that
pg_logical_slot_get_changes() causes WAL to be written?

No, when i said - created new WAL files , i meant -after each pg_bench run NOT after executing  get_changes.

-- 
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Wed, 10 Jul 2019 at 17:12, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Wed, 10 Jul 2019 at 08:44, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > Thanks for the new version! Looks like we're making progress towards
> > something committable here.
> >
> > I think it'd be good to split the patch into a few pieces. I'd maybe do
> > that like:
> > 1) WAL format changes (plus required other changes)
> > 2) Recovery conflicts with slots
> > 3) logical decoding on standby
> > 4) tests
>
> All right. Will do that in the next patch set. For now, I have quickly
> done the below changes in a single patch again (attached), in order to
> get early comments if any.

Attached are the split patches. Included is an additional patch that
has doc changes. Here is what I have added in the docs. Pasting it
here so that all can easily spot how it is supposed to behave, and to
confirm that we are all on the same page :

"A logical replication slot can also be created on a hot standby. To
prevent VACUUM from removing required rows from the system catalogs,
hot_standby_feedback should be set on the standby. In spite of that,
if any required rows get removed on standby, the slot gets dropped.
Existing logical slots on standby also get dropped if wal_level on
primary is reduced to less than 'logical'.

For a logical slot to be created, it builds a historic snapshot, for
which information of all the currently running transactions is
essential. On primary, this information is available, but on standby,
this information has to be obtained from primary. So, slot creation
may wait for some activity to happen on the primary. If the primary is
idle, creating a logical slot on standby may take a noticeable time."

Attachment

Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
On 2019-Jul-19, Amit Khandekar wrote:

> Attached are the split patches. Included is an additional patch that
> has doc changes. Here is what I have added in the docs. Pasting it
> here so that all can easily spot how it is supposed to behave, and to
> confirm that we are all on the same page :

... Apparently, this patch was not added to the commitfest for some
reason; and another patch that *is* in the commitfest has been said to
depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
which hasn't been updated in quite a while and has received no feedback
at all, not even from the listed reviewer Shaun Thomas).  To make
matters worse, Amit's patchset no longer applies.

What I would like to do is add a link to this thread to CF's 1961 entry
and then update all these patches, in order to get things moving.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Jul-19, Amit Khandekar wrote:
>
> > Attached are the split patches. Included is an additional patch that
> > has doc changes. Here is what I have added in the docs. Pasting it
> > here so that all can easily spot how it is supposed to behave, and to
> > confirm that we are all on the same page :
>
> ... Apparently, this patch was not added to the commitfest for some
> reason; and another patch that *is* in the commitfest has been said to
> depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
> which hasn't been updated in quite a while and has received no feedback
> at all, not even from the listed reviewer Shaun Thomas).  To make
> matters worse, Amit's patchset no longer applies.
>
> What I would like to do is add a link to this thread to CF's 1961 entry
> and then update all these patches, in order to get things moving.

Hi Alvaro,

Thanks for notifying about this. Will work this week on rebasing this
patchset and putting it into the 2019-11 commit fest.



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Mon, 9 Sep 2019 at 16:06, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > On 2019-Jul-19, Amit Khandekar wrote:
> >
> > > Attached are the split patches. Included is an additional patch that
> > > has doc changes. Here is what I have added in the docs. Pasting it
> > > here so that all can easily spot how it is supposed to behave, and to
> > > confirm that we are all on the same page :
> >
> > ... Apparently, this patch was not added to the commitfest for some
> > reason; and another patch that *is* in the commitfest has been said to
> > depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
> > which hasn't been updated in quite a while and has received no feedback
> > at all, not even from the listed reviewer Shaun Thomas).  To make
> > matters worse, Amit's patchset no longer applies.
> >
> > What I would like to do is add a link to this thread to CF's 1961 entry
> > and then update all these patches, in order to get things moving.
>
> Hi Alvaro,
>
> Thanks for notifying about this. Will work this week on rebasing this
> patchset and putting it into the 2019-11 commit fest.

Rebased patch set attached.

Added in the Nov commitfest : https://commitfest.postgresql.org/25/2283/



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Fri, Sep 13, 2019 at 7:20 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Thanks for notifying about this. Will work this week on rebasing this
> > patchset and putting it into the 2019-11 commit fest.
>
> Rebased patch set attached.
>
> Added in the Nov commitfest : https://commitfest.postgresql.org/25/2283/

I took a bit of a look at
0004-New-TAP-test-for-logical-decoding-on-standby.patch and saw some
things I don't like in terms of general code quality:

- Not many comments. I think each set of tests should have a block
comment at the top explaining clearly what it's trying to test.
- print_phys_xmin and print_logical_xmin don't print anything.
- They are also identical to each other except that they each operate
on a different hard-coded slot name.
- They are also identical to wait_for_xmins except that they don't wait.
- create_logical_slots creates two slots whose names are hard-coded
using code that is cut-and-pasted.
- The same code is also cut-and-pasted into two other places in the file.
- Why does that cut-and-pasted code use BAIL_OUT(), which aborts the
entire test run, instead of die, which just aborts the current test
file?
- cmp_ok() message in check_slots_dropped() has trailing whitespace.
- make_slot_active() and check_slots_dropped(), at least, use global
variables; is that really necessary?
- In particular, $return is used only in one function and doesn't need
to survive across calls; why is it not a local variable?
- Depending on whether $return ends up true or false, the number of
executed tests will differ; so besides any actual test failures,
you'll get complaints about not executing exactly 58 tests.
- $backup_name only ever has one value, but for some reason the
variable is created at the top of the test file and then initialized
later. Just do my $backup_name = 'b1' near where it's first used, or
ditch the variable and write 'b1' in each of the three places it's
used.
- Some of the calls to wait_for_xmins() save the return values into
local variables but then do nothing with those values before they are
overwritten. Either it's wrong that we're saving them into local
variables, or it's wrong that we're not doing anything with them.
- test_oldest_xid_retention() is called only once; it basically acts
as a wrapper for one group of tests. You could argue against that
approach, but I actually think it's a nice style which makes the code
more self-documenting. However, it's not used consistently; all the
other groups of tests are written directly as toplevel code.
- The code in that function verifies that oldestXid is found in
pg_controldata's output, but does not check the same for NextXID.
- Is there a reason the code in that function prints debugging output?
Seems like a leftover.
- I think it might be an idea to move the tests for recovery
conflict/slot drop to a separate test file, so that we have one file
for the xmin-related testing and another for the recovery conflict
testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Wed, 18 Sep 2019 at 19:34, Robert Haas <robertmhaas@gmail.com> wrote:
> I took a bit of a look at
> 0004-New-TAP-test-for-logical-decoding-on-standby.patch and saw some
> things I don't like in terms of general code quality:
>
> - Not many comments. I think each set of tests should have a block
> comment at the top explaining clearly what it's trying to test.
Done at initial couple of test groups so that the groups would be
spotted clearly. Please check.

> - print_phys_xmin and print_logical_xmin don't print anything.
> - They are also identical to each other except that they each operate
> on a different hard-coded slot name.
> - They are also identical to wait_for_xmins except that they don't wait.
Re-worked this part of the code. Now a single function
get_slot_xmins(slot_name) is used to return the slot's xmins's. It
figures out by the slot name, whether the slot belongs to master or
slave. Also, avoided the hardcoded 'master_physical' and
'standby_logical' names.
Removed 'node' parameter of wait_for_xmins(), since now we can figure
out node name from slot name.

> - create_logical_slots creates two slots whose names are hard-coded
> using code that is cut-and-pasted.
> - The same code is also cut-and-pasted into two other places in the file.
Didn't remove the hardcoding for slot names, because it's not
convenient to return those from create_logical_slots() and use them in
check_slots_dropped(). But I have cut-pasted code in
create_logical_slots() and the other two places in the file. Now I
have did some of that repeated code in create_logical_slots() itself.

> - Why does that cut-and-pasted code use BAIL_OUT(), which aborts the
> entire test run, instead of die, which just aborts the current test
> file?
Oops. Didn't realize that it bails out from the complete test run.
Replaced it with die().

> - cmp_ok() message in check_slots_dropped() has trailing whitespace.
Remove them.

> - make_slot_active() and check_slots_dropped(), at least, use global
> variables; is that really necessary?
I guess you are referring to $handle. Now made make_slot_active()
return this handle using it's return value, and used this to pass to
check_slots_dropped(). Retained node_replica global variable rather
than passing it as function param, because these functions always use
node_replica, and never node_master.

> - In particular, $return is used only in one function and doesn't need
> to survive across calls; why is it not a local variable?
> - Depending on whether $return ends up true or false, the number of
> executed tests will differ; so besides any actual test failures,
> you'll get complaints about not executing exactly 58 tests.
Right. Made it local.

> - $backup_name only ever has one value, but for some reason the
> variable is created at the top of the test file and then initialized
> later. Just do my $backup_name = 'b1' near where it's first used, or
> ditch the variable and write 'b1' in each of the three places it's
> used.
Declared $backup_name near it's first usage.

> - Some of the calls to wait_for_xmins() save the return values into
> local variables but then do nothing with those values before they are
> overwritten. Either it's wrong that we're saving them into local
> variables, or it's wrong that we're not doing anything with them.
Yeah, at many places, it was redundant to save them into variables, so
removed the function return value assignment part at those places.

> - test_oldest_xid_retention() is called only once; it basically acts
> as a wrapper for one group of tests. You could argue against that
> approach, but I actually think it's a nice style which makes the code
> more self-documenting. However, it's not used consistently; all the
> other groups of tests are written directly as toplevel code.
Removed the function and kept it's code at top level code. I think the
test group header comments look sufficient for documenting each group
of tests, so that there is no need to make a separate function for
each group.

> - The code in that function verifies that oldestXid is found in
> pg_controldata's output, but does not check the same for NextXID.
Actually, there is no need to check NextID. We want to check just
oldest_xid. Removed it's usage.

> - Is there a reason the code in that function prints debugging output?
> Seems like a leftover.
Yeah, right. Removed them.

> - I think it might be an idea to move the tests for recovery
> conflict/slot drop to a separate test file, so that we have one file
> for the xmin-related testing and another for the recovery conflict
> testing.
Actually in some of the conflict-recovery testcases, I am still using
wait_for_xmins() so that we could test the xmin values back after we
drop the slots. So xmin-related testing is embedded in these recovery
tests as well. We can move the wait_for_xmins() function to some
common file and then do the split of this file, but then effectively
some of the xmin-testing would go into the recovery-related test file,
which did not sound sensible to me. What do you say ?

Attached patch series has the test changes addressed.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Thu, Sep 26, 2019 at 5:14 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Actually in some of the conflict-recovery testcases, I am still using
> wait_for_xmins() so that we could test the xmin values back after we
> drop the slots. So xmin-related testing is embedded in these recovery
> tests as well. We can move the wait_for_xmins() function to some
> common file and then do the split of this file, but then effectively
> some of the xmin-testing would go into the recovery-related test file,
> which did not sound sensible to me. What do you say ?

I agree we don't want code duplication, but I think we could reduce
the code duplication to a pretty small amount with a few cleanups.

I don't think wait_for_xmins() looks very well-designed. It goes to
trouble of returning a value, but only 2 of the 6 call sites pay
attention to the returned value.  I think we should change the
function so that it doesn't return anything and have the callers that
want a return value call get_slot_xmins() after wait_for_xmins().

And then I think we should turn around and get rid of get_slot_xmins()
altogether. Instead of:

my ($xmin, $catalog_xmin) = get_slot_xmins($master_slot);
is($xmin, '', "xmin null");
is($catalog_xmin, '', "catalog_xmin null");

We can write:

my $slot = $node_master->slot($master_slot);
is($slot->{'xmin'}, '', "xmin null");
is($slot->{'catalog_xmin'}, '', "catalog xmin null");

...which is not really any longer or harder to read, but does
eliminate the need for one function definition.

Then I think we should change wait_for_xmins so that it takes three
arguments rather than two: $node, $slotname, $check_expr.  With that
and the previous change, we can get rid of get_node_from_slotname().

At that point, the body of wait_for_xmins() would consist of a single
call to $node->poll_query_until() or die(), which doesn't seem like
too much code to duplicate into a new file.

Looking at it at a bit more, though, I wonder why the recovery
conflict scenario is even using wait_for_xmins().  It's hard-coded to
check the state of the master_physical slot, which isn't otherwise
manipulated by the recovery conflict tests. What's the point of
testing that a slot which had xmin and catalog_xmin NULL before the
test started (line 414) and which we haven't changed since still has
those values at two different points during the test (lines 432, 452)?
Perhaps I'm missing something here, but it seems like this is just an
inadvertent entangling of these scenarios with the previous scenarios,
rather than anything that necessarily needs to be connected together.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 27 Sep 2019 at 01:57, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Sep 26, 2019 at 5:14 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Actually in some of the conflict-recovery testcases, I am still using
> > wait_for_xmins() so that we could test the xmin values back after we
> > drop the slots. So xmin-related testing is embedded in these recovery
> > tests as well. We can move the wait_for_xmins() function to some
> > common file and then do the split of this file, but then effectively
> > some of the xmin-testing would go into the recovery-related test file,
> > which did not sound sensible to me. What do you say ?
>
> I agree we don't want code duplication, but I think we could reduce
> the code duplication to a pretty small amount with a few cleanups.
>
> I don't think wait_for_xmins() looks very well-designed. It goes to
> trouble of returning a value, but only 2 of the 6 call sites pay
> attention to the returned value.  I think we should change the
> function so that it doesn't return anything and have the callers that
> want a return value call get_slot_xmins() after wait_for_xmins().
Yeah, that can be done.

>
> And then I think we should turn around and get rid of get_slot_xmins()
> altogether. Instead of:
>
> my ($xmin, $catalog_xmin) = get_slot_xmins($master_slot);
> is($xmin, '', "xmin null");
> is($catalog_xmin, '', "catalog_xmin null");
>
> We can write:
>
> my $slot = $node_master->slot($master_slot);
> is($slot->{'xmin'}, '', "xmin null");
> is($slot->{'catalog_xmin'}, '', "catalog xmin null");
>
> ...which is not really any longer or harder to read, but does
> eliminate the need for one function definition.
Agreed.

>
> Then I think we should change wait_for_xmins so that it takes three
> arguments rather than two: $node, $slotname, $check_expr.  With that
> and the previous change, we can get rid of get_node_from_slotname().
>
> At that point, the body of wait_for_xmins() would consist of a single
> call to $node->poll_query_until() or die(), which doesn't seem like
> too much code to duplicate into a new file.

Earlier it used to have 3 params, the same ones you mentioned. I
removed $node for caller convenience.

>
> Looking at it at a bit more, though, I wonder why the recovery
> conflict scenario is even using wait_for_xmins().  It's hard-coded to
> check the state of the master_physical slot, which isn't otherwise
> manipulated by the recovery conflict tests. What's the point of
> testing that a slot which had xmin and catalog_xmin NULL before the
> test started (line 414) and which we haven't changed since still has
> those values at two different points during the test (lines 432, 452)?
> Perhaps I'm missing something here, but it seems like this is just an
> inadvertent entangling of these scenarios with the previous scenarios,
> rather than anything that necessarily needs to be connected together.

In the "Drop slot" test scenario, we are testing that after we
manually drop the slot on standby, the master catalog_xmin should be
back to NULL. Hence, the call to wait_for_xmins().
And in the "Scenario 1 : hot_standby_feedback off", wait_for_xmins()
is called the first time only as a mechanism to ensure that
"hot_standby_feedback = off" has taken effect. At the end of this
test, wait_for_xmins() again is called only to ensure that
hot_standby_feedback = on has taken effect.

Preferably I want wait_for_xmins() to get rid of the $node parameter,
because we can deduce it using slot name. But that requires having
get_node_from_slotname(). Your suggestion was to remove
get_node_from_slotname() and add back the $node param so as to reduce
duplicate code. Instead, how about keeping  wait_for_xmins() in the
PostgresNode.pm() ? This way, we won't have duplication, and also we
can get rid of param $node. This is just my preference; if you are
quite inclined to not have get_node_from_slotname(), I will go with
your suggestion.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Fri, Sep 27, 2019 at 12:41 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Preferably I want wait_for_xmins() to get rid of the $node parameter,
> because we can deduce it using slot name. But that requires having
> get_node_from_slotname(). Your suggestion was to remove
> get_node_from_slotname() and add back the $node param so as to reduce
> duplicate code. Instead, how about keeping  wait_for_xmins() in the
> PostgresNode.pm() ? This way, we won't have duplication, and also we
> can get rid of param $node. This is just my preference; if you are
> quite inclined to not have get_node_from_slotname(), I will go with
> your suggestion.

I'd be inclined not to have it.  I think having a lookup function to
go from slot name -> node is strange; it doesn't really simplify
things that much for the caller, and it makes the logic harder to
follow. It would break outright if you had the same slot name on
multiple nodes, which is a perfectly reasonable scenario.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 27 Sep 2019 at 23:21, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Sep 27, 2019 at 12:41 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Preferably I want wait_for_xmins() to get rid of the $node parameter,
> > because we can deduce it using slot name. But that requires having
> > get_node_from_slotname(). Your suggestion was to remove
> > get_node_from_slotname() and add back the $node param so as to reduce
> > duplicate code. Instead, how about keeping  wait_for_xmins() in the
> > PostgresNode.pm() ? This way, we won't have duplication, and also we
> > can get rid of param $node. This is just my preference; if you are
> > quite inclined to not have get_node_from_slotname(), I will go with
> > your suggestion.
>
> I'd be inclined not to have it.  I think having a lookup function to
> go from slot name -> node is strange; it doesn't really simplify
> things that much for the caller, and it makes the logic harder to
> follow. It would break outright if you had the same slot name on
> multiple nodes, which is a perfectly reasonable scenario.

Alright. Attached is the updated patch that splits the file into two
files, one that does only xmin related testing, and the other test
file that tests conflict recovery scenarios, and also one scenario
where drop-database drops the slots on the database on standby.
Removed get_slot_xmins() and get_node_from_slotname().
Renamed 'replica' to 'standby'.
Used node->backup() function instead of pg_basebackup command.
Renamed $master_slot to $master_slotname, similarly for $standby_slot.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Mon, Sep 30, 2019 at 7:35 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Alright. Attached is the updated patch that splits the file into two
> files, one that does only xmin related testing, and the other test
> file that tests conflict recovery scenarios, and also one scenario
> where drop-database drops the slots on the database on standby.
> Removed get_slot_xmins() and get_node_from_slotname().
> Renamed 'replica' to 'standby'.
> Used node->backup() function instead of pg_basebackup command.
> Renamed $master_slot to $master_slotname, similarly for $standby_slot.

In general, I think this code is getting a lot clearer and easier to
understand in these last few revisions.

Why does create_logical_slot_on_standby include sleep(1)? Does the
test fail if you take that out? If so, it's probably going to fail on
the buildfarm even with that included, because some of the buildfarm
machines are really slow (e.g. because they use CLOBBER_CACHE_ALWAYS,
or because they're running on a shared system with low hardware
specifications and an ancient disk).

Similarly for the sleep(1) just after you VACUUM FREEZE all the databases.

I'm not sure wait the point of the wait_for_xmins() stuff is in
019_standby_logical_decoding_conflicts.pl. Isn't that just duplicating
stuff we've already tested in 018?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Mon, 30 Sep 2019 at 23:38, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 30, 2019 at 7:35 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Alright. Attached is the updated patch that splits the file into two
> > files, one that does only xmin related testing, and the other test
> > file that tests conflict recovery scenarios, and also one scenario
> > where drop-database drops the slots on the database on standby.
> > Removed get_slot_xmins() and get_node_from_slotname().
> > Renamed 'replica' to 'standby'.
> > Used node->backup() function instead of pg_basebackup command.
> > Renamed $master_slot to $master_slotname, similarly for $standby_slot.
>
> In general, I think this code is getting a lot clearer and easier to
> understand in these last few revisions.
>
> Why does create_logical_slot_on_standby include sleep(1)? Does the
> test fail if you take that out?
It has not failed for me, but I think sometimes it may happen that the
system command 'pg_recvlogical' is so slow to start that before it
tries to even create the slot, the subsequent checkpoint command
concurrently runs, causing a "running transactions" record to arrive
on standby *before* even pg_recvlogical decides the starting point
from which to receive records. So effectively pg_recvlogical can miss
this record.

> If so, it's probably going to fail on
> the buildfarm even with that included, because some of the buildfarm
> machines are really slow (e.g. because they use CLOBBER_CACHE_ALWAYS,
> or because they're running on a shared system with low hardware
> specifications and an ancient disk).

Yeah right, then it makes sense to explicitly wait for the slot to
calculate the restart_lsn, and only then run the checkpoint command.
Did that now.

>
> Similarly for the sleep(1) just after you VACUUM FREEZE all the databases.
I checked that VACUUM command returns only after updating the
pg_database.datfrozenxid. So now I think it's safe to immediately run
the checkpoint command after vacuum. So removed the sleep() now.

Attached is the updated patch series.

>
> I'm not sure wait the point of the wait_for_xmins() stuff is in
> 019_standby_logical_decoding_conflicts.pl. Isn't that just duplicating
> stuff we've already tested in 018?
Actually, in 019, the function call is more to wait for
hot_standby_feedback to take effect.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Craig Ringer
Date:
On Sat, 6 Apr 2019 at 07:15, Andres Freund <andres@anarazel.de> wrote:
Hi,

Thanks for the new version of the patch.  Btw, could you add Craig as a
co-author in the commit message of the next version of the patch? Don't
want to forget him.

That's kind, but OTOH you've picked it up and done most of the work by now.

I'm swamped with extension related work and have been able to dedicate frustratingly little time to -hackers and the many patches I'd like to be working on. 

Re: Minimal logical decoding on standbys

From
Craig Ringer
Date:
On Tue, 1 Oct 2019 at 02:08, Robert Haas <robertmhaas@gmail.com> wrote:

Why does create_logical_slot_on_standby include sleep(1)?

Yeah, we really need to avoid sleeps in regression tests.

If you need to wait, use a DO block that polls the required condition, and wrap the sleep in that with a much longer total timeout. In BDR and pglogical's pg_regress tests I've started to use a shared prelude that sets a bunch of psql variables that I use as helpers for this sort of thing, so I can just write :wait_slot_ready instead of repeating the same SQL command a pile of times across the tests.

That reminds me: I'm trying to find the time to write a couple of patches to pg_regress to help make life easier too:

- Prelude and postscript .psql files that run before/after every test step to set variables, do cleanup etc

- Test header comment that can be read by pg_regress to set a per-test timeout

- Allow pg_regress to time out individual tests and continue with the next test

- Test result postprocessing by script, where pg_regress writes the raw test results then postprocesses it with a script before diffing the postprocessed output. This would allow us to have things like /* BEGIN_TESTIGNORE */ ... /* END_TESTIGNORE */ blocks for diagnostic output that we want available but don't want to be part of actual test output. Or filter out NOTICEs that vary in output. That sort of thing.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Minimal logical decoding on standbys

From
nil socket
Date:
Sorry to intervene in between,

But what about timeline change?


Thank you.

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 10 Oct 2019 at 05:49, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On Tue, 1 Oct 2019 at 02:08, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>
>> Why does create_logical_slot_on_standby include sleep(1)?
>
>
> Yeah, we really need to avoid sleeps in regression tests.

Yeah, have already got rid of the sleeps from the patch-series version
4 onwards.

By the way, the couple of patches out of the patch series had
bitrotten. Attached is the rebased version.

Thanks
-Amit Khandekar

Attachment

Re: Minimal logical decoding on standbys

From
Rahila Syed
Date:
Hi Amit,

I am reading about this feature and reviewing it.
To start with, I reviewed the patch: 0005-Doc-changes-describing-details-about-logical-decodin.patch. 

>prevent VACUUM from removing required rows from the system catalogs,
>hot_standby_feedback should be set on the standby. In spite of that,
>if any required rows get removed on standby, the slot gets dropped.
IIUC, you mean `if any required rows get removed on *the master* the slot gets
dropped`, right?

Thank you,
--
Rahila Syed
Performance Engineer
2ndQuadrant 
http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 7 Nov 2019 at 14:02, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
>
> Hi Amit,
>
> I am reading about this feature and reviewing it.
> To start with, I reviewed the patch: 0005-Doc-changes-describing-details-about-logical-decodin.patch.

Thanks for picking up the patch review.

Your reply somehow spawned a new mail thread, so I reverted back to
this thread for replying.

>
> >prevent VACUUM from removing required rows from the system catalogs,
> >hot_standby_feedback should be set on the standby. In spite of that,
> >if any required rows get removed on standby, the slot gets dropped.
> IIUC, you mean `if any required rows get removed on *the master* the slot gets
> dropped`, right?

Yes, you are right. In fact, I think it is not necessary to explicitly
mention where the rows get removed. So I have just omitted "on
standby". Will include this change in the next patch versions.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Rahila Syed
Date:
Hi Amit,

Please see following comments: 

1.  0002-Add-info-in-WAL-records-in-preparation-for-logical-s.patch

--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"

The above header inclusion is not necessary as the code compiles fine without it. 
Also, this patch does not apply cleanly on latest master due to the above line.

2.  Following test fails with error.
make -C src/test/recovery/ check PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 120.
#          got: ''
#     expected: anything else

#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 141.
#          got: ''
#     expected: anything else

#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 159.
#          got: ''
#     expected: anything else
t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed out executing this query:


Physical catalog_xmin is NULL on master after logical slot creation on standby .

Due to this below command in the test fails with syntax error as it constructs the SQL query using catalog_xmin value 
obtained above:

# SELECT catalog_xmin::varchar::int >
# FROM pg_catalog.pg_replication_slots
# WHERE slot_name = 'master_physical';
#
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR:  syntax error at or near "FROM"
# LINE 3:   FROM pg_catalog.pg_replication_slots

Thank you,
--
Rahila Syed
Performance Engineer
2ndQuadrant 
http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 12 Dec 2019 at 15:28, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
>
> Hi Amit,
>
>
> 2.  Following test fails with error.
> make -C src/test/recovery/ check PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> #   Failed test 'physical catalog_xmin not null'
> #   at t/018_standby_logical_decoding_xmins.pl line 120.
> #          got: ''
> #     expected: anything else
>
> #   Failed test 'physical catalog_xmin not null'
> #   at t/018_standby_logical_decoding_xmins.pl line 141.
> #          got: ''
> #     expected: anything else
>
> #   Failed test 'physical catalog_xmin not null'
> #   at t/018_standby_logical_decoding_xmins.pl line 159.
> #          got: ''
> #     expected: anything else
> t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed out executing this query:
> #
>
> Physical catalog_xmin is NULL on master after logical slot creation on standby .

Hi, do you consistently get this failure on your machine ? I am not
able to get this failure, but I am going to analyze when/how this can
fail. Thanks





--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Rahila Syed
Date:
Hi,

Hi, do you consistently get this failure on your machine ? I am not
able to get this failure, but I am going to analyze when/how this can
fail. Thanks

Yes, I am getting it each time I run make -C src/test/recovery/ check PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
Also, there aren't any errors in logs indicating the cause.

--
Rahila Syed
Performance Engineer
2ndQuadrant 
http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Thu, 19 Dec 2019 at 01:02, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
>
> Hi,
>
>> Hi, do you consistently get this failure on your machine ? I am not
>> able to get this failure, but I am going to analyze when/how this can
>> fail. Thanks
>>
> Yes, I am getting it each time I run make -C src/test/recovery/ check
PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> Also, there aren't any errors in logs indicating the cause.

Thanks for the reproduction. Finally I could reproduce the behaviour.
It occurs once in 7-8 runs of the test on my machine. The issue is :
on master, the catalog_xmin does not immediately get updated. It
happens only after the hot standby feedback reaches on master. And I
haven't used wait_for_xmins() for these failing cases. I should use
that. Working on the same ...

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Tue, 24 Dec 2019 at 14:02, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 19 Dec 2019 at 01:02, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> >> Hi, do you consistently get this failure on your machine ? I am not
> >> able to get this failure, but I am going to analyze when/how this can
> >> fail. Thanks
> >>
> > Yes, I am getting it each time I run make -C src/test/recovery/ check
PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> > Also, there aren't any errors in logs indicating the cause.
>
> Thanks for the reproduction. Finally I could reproduce the behaviour.
> It occurs once in 7-8 runs of the test on my machine. The issue is :
> on master, the catalog_xmin does not immediately get updated. It
> happens only after the hot standby feedback reaches on master. And I
> haven't used wait_for_xmins() for these failing cases. I should use
> that. Working on the same ...

As mentioned above, I have used wait_for_xmins() so that we can wait
for the xmins to be updated after hot standby feedback is processed.
In one of the 3 scenarios where it failed for you, I removed the check
at the second place because it was redundant. At the 3rd place, I did
some appropriate changes with detailed comments. Please check.
Basically we are checking that the master's phys catalog_xmin has
advanced but not beyond standby's logical catalog_xmin. And for making
sure the master's xmins are updated, I call txid_current() and then
wait for the master's xmin to advance after hot-standby_feedback, and
in this way I make sure the xmin/catalog_xmins are now up-to-date
because of hot-standby-feedback, so that we can check whether the
master's physical slot catalog_xmin has reached the value of standby's
catalog_xmin but not gone past it.

I have also moved the "wal_receiver_status_interval = 1" setting from
master to standby. It was wrongly kept in master. This now reduces the
test time by half, on my machine.

Attached patch set v5 has only the test changes. Please check if now
the test fails for you.

>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Attachment

Re: Minimal logical decoding on standbys

From
Rahila Syed
Date:
Hi Amit,

Can you please rebase the patches as they don't apply on latest master?

Thank you,
Rahila Syed


On Thu, 26 Dec 2019 at 16:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
On Tue, 24 Dec 2019 at 14:02, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Thu, 19 Dec 2019 at 01:02, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
> >
> > Hi,
> >
> >> Hi, do you consistently get this failure on your machine ? I am not
> >> able to get this failure, but I am going to analyze when/how this can
> >> fail. Thanks
> >>
> > Yes, I am getting it each time I run make -C src/test/recovery/ check PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> > Also, there aren't any errors in logs indicating the cause.
>
> Thanks for the reproduction. Finally I could reproduce the behaviour.
> It occurs once in 7-8 runs of the test on my machine. The issue is :
> on master, the catalog_xmin does not immediately get updated. It
> happens only after the hot standby feedback reaches on master. And I
> haven't used wait_for_xmins() for these failing cases. I should use
> that. Working on the same ...

As mentioned above, I have used wait_for_xmins() so that we can wait
for the xmins to be updated after hot standby feedback is processed.
In one of the 3 scenarios where it failed for you, I removed the check
at the second place because it was redundant. At the 3rd place, I did
some appropriate changes with detailed comments. Please check.
Basically we are checking that the master's phys catalog_xmin has
advanced but not beyond standby's logical catalog_xmin. And for making
sure the master's xmins are updated, I call txid_current() and then
wait for the master's xmin to advance after hot-standby_feedback, and
in this way I make sure the xmin/catalog_xmins are now up-to-date
because of hot-standby-feedback, so that we can check whether the
master's physical slot catalog_xmin has reached the value of standby's
catalog_xmin but not gone past it.

I have also moved the "wal_receiver_status_interval = 1" setting from
master to standby. It was wrongly kept in master. This now reduces the
test time by half, on my machine.

Attached patch set v5 has only the test changes. Please check if now
the test fails for you.

>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company



--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


--
Rahila Syed
Performance Engineer
2ndQuadrant 
http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 10 Jan 2020 at 17:50, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
>
> Hi Amit,
>
> Can you please rebase the patches as they don't apply on latest master?

Thanks for notifying. Attached is the rebased version.

Attachment

Re: Minimal logical decoding on standbys

From
Andreas Joseph Krogh
Date:
På torsdag 16. januar 2020 kl. 05:42:24, skrev Amit Khandekar <amitdkhan.pg@gmail.com>:
On Fri, 10 Jan 2020 at 17:50, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
>
> Hi Amit,
>
> Can you please rebase the patches as they don't apply on latest master?

Thanks for notifying. Attached is the rebased version.
 
Will this patch enable logical replication from a standby-server?
 
--
Andreas Joseph Krogh
 

Re: Minimal logical decoding on standbys

From
Amit Khandekar
Date:
On Fri, 17 Jan 2020 at 13:20, Andreas Joseph Krogh <andreas@visena.com> wrote:
>
> På torsdag 16. januar 2020 kl. 05:42:24, skrev Amit Khandekar <amitdkhan.pg@gmail.com>:
>
> On Fri, 10 Jan 2020 at 17:50, Rahila Syed <rahila.syed@2ndquadrant.com> wrote:
> >
> > Hi Amit,
> >
> > Can you please rebase the patches as they don't apply on latest master?
>
> Thanks for notifying. Attached is the rebased version.
>
>
> Will this patch enable logical replication from a standby-server?

Sorry for the late reply.
This patch only supports logical decoding from standby. So it's just
an infrastructure for supporting logical replication from standby. We
don't support creating a publication from standby, but the publication
on master is replicated on standby, so we might be able to create
subscription nodes that connect to existing publications on standby,
but basically we haven't tested whether the publication/subscription
model works with a publication on a physical standby. This patch is
focussed on providing a way to continue logical replication *after*
the standby is promoted as master.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Minimal logical decoding on standbys

From
Andreas Joseph Krogh
Date:
På tirsdag 21. januar 2020 kl. 03:57:42, skrev Amit Khandekar <amitdkhan.pg@gmail.com>:
[...]
Sorry for the late reply.
This patch only supports logical decoding from standby. So it's just
an infrastructure for supporting logical replication from standby. We
don't support creating a publication from standby, but the publication
on master is replicated on standby, so we might be able to create
subscription nodes that connect to existing publications on standby,
but basically we haven't tested whether the publication/subscription
model works with a publication on a physical standby. This patch is
focussed on providing a way to continue logical replication *after*
the standby is promoted as master.
 
 
Thanks for clarifying.
 
--
Andreas Joseph Krogh

Re: Minimal logical decoding on standbys

From
James Sewell
Date:
Hi all,

This is great stuff! My understanding is that this patch guarantees 0 data loss for a logical replication stream if the primary crashes and a standby which was marked as sync at failure time is promoted.

Is this correct?

James 
--
James Sewell,
Chief Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
P (+61) 2 8099 9000  W www.jirotech.com  F (+61) 2 8099 9099


The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.

Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
There were conflicts again, so I rebased once more.  Didn't do anything
else.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
On 2020-Mar-18, Alvaro Herrera wrote:

> There were conflicts again, so I rebased once more.  Didn't do anything
> else.

This compiles fine, but tests seem to hang forever with no output.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
On 2020-Mar-18, Alvaro Herrera wrote:

> On 2020-Mar-18, Alvaro Herrera wrote:
> 
> > There were conflicts again, so I rebased once more.  Didn't do anything
> > else.
> 
> This compiles fine, but tests seem to hang forever with no output.

well, not "forever", but:

$ make check PROVE_TESTS=t/019_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v
...
cd /pgsql/source/master/src/test/recovery &&
TESTDIR='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery'
PATH="/pgsql/build/master/tmp_install/pgsql/install/master/bin:$PATH"
LD_LIBRARY_PATH="/pgsql/build/master/tmp_install/pgsql/install/master/lib" PGPORT='655432'
PG_REGRESS='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery/../../../src/test/regress/pg_regress'
REGRESS_SHLIB='/pgsql/build/master/src/test/regress/regress.so'/usr/bin/prove -I /pgsql/source/master/src/test/perl/ -I
/pgsql/source/master/src/test/recovery-v t/019_standby_logical_decoding_conflicts.pl
 
t/019_standby_logical_decoding_conflicts.pl .. 
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
# poll_query_until timed out executing this query:
# SELECT '0/35C9190' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name =
'standby';
# expecting this output:
# t
# last actual query output:
# 
# with stderr:
Bailout called.  Further testing stopped:  system pg_ctl failed
Bail out!  system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make: *** [Makefile:19: check] Error 255


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Minimal logical decoding on standbys

From
Michael Paquier
Date:
On Wed, Mar 18, 2020 at 04:50:38PM -0300, Alvaro Herrera wrote:
> well, not "forever", but:

No updates in the last six months, so I am marking it as returned with
feedback.

PS: the patch fails to apply.
--
Michael

Attachment

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:

On Wed, Mar 18, 2020 at 4:50 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>
> well, not "forever", but:
>
> $ make check PROVE_TESTS=t/019_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v
> ...
> cd /pgsql/source/master/src/test/recovery && TESTDIR='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery' PATH="/pgsql/build/master/tmp_install/pgsql/install/master/bin:$PATH" LD_LIBRARY_PATH="/pgsql/build/master/tmp_install/pgsql/install/master/lib"  PGPORT='655432' PG_REGRESS='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery/../../../src/test/regress/pg_regress' REGRESS_SHLIB='/pgsql/build/master/src/test/regress/regress.so' /usr/bin/prove -I /pgsql/source/master/src/test/perl/ -I /pgsql/source/master/src/test/recovery -v t/019_standby_logical_decoding_conflicts.pl
> t/019_standby_logical_decoding_conflicts.pl ..
> 1..24
> ok 1 - dropslot on standby created
> ok 2 - activeslot on standby created
> # poll_query_until timed out executing this query:
> # SELECT '0/35C9190' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby';
> # expecting this output:
> # t
> # last actual query output:
> #
> # with stderr:
> Bailout called.  Further testing stopped:  system pg_ctl failed
> Bail out!  system pg_ctl failed
> FAILED--Further testing stopped: system pg_ctl failed
> make: *** [Makefile:19: check] Error 255
>

After rebase and did minimal tweaks (duplicated oid, TAP tests numbering) I'm facing similar problem but in other place:


make -C src/test/recovery check PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v
...
/usr/bin/mkdir -p '/data/src/pg/main/src/test/recovery'/tmp_check
cd . && TESTDIR='/data/src/pg/main/src/test/recovery' PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/bin:$PATH" LD_LIBRARY_PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/lib"  PGPORT='65432' PG_REGRESS='/data/src/pg/main/src/test/recovery/../../../src/test/regress/pg_regress' REGRESS_SHLIB='/d/src/pg/main/src/test/regress/regress.so' /usr/bin/prove -I ../../../src/test/perl/ -I . -v t/023_standby_logical_decoding_conflicts.pl
t/023_standby_logical_decoding_conflicts.pl ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
not ok 3 - dropslot on standby dropped

#   Failed test 'dropslot on standby dropped'
#   at t/023_standby_logical_decoding_conflicts.pl line 67.
#          got: 'logical'
#     expected: ''
not ok 4 - activeslot on standby dropped

#   Failed test 'activeslot on standby dropped'
#   at t/023_standby_logical_decoding_conflicts.pl line 68.
#          got: 'logical'
#     expected: ''


TAP tests hang forever in `check_slots_dropped` exactly here:

    # our client should've terminated in response to the walsender error
    eval {
        $slot_user_handle->finish;
    };

Regards,

--
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com
Attachment

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 12/15/20 7:24 PM, Fabrízio de Royes Mello wrote:

On Wed, Mar 18, 2020 at 4:50 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>
> well, not "forever", but:
>
> $ make check PROVE_TESTS=t/019_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v
> ...
> cd /pgsql/source/master/src/test/recovery && TESTDIR='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery' PATH="/pgsql/build/master/tmp_install/pgsql/install/master/bin:$PATH" LD_LIBRARY_PATH="/pgsql/build/master/tmp_install/pgsql/install/master/lib"  PGPORT='655432' PG_REGRESS='/home/alvherre/mnt/crypt/alvherre/Code/pgsql/build/master/src/test/recovery/../../../src/test/regress/pg_regress' REGRESS_SHLIB='/pgsql/build/master/src/test/regress/regress.so' /usr/bin/prove -I /pgsql/source/master/src/test/perl/ -I /pgsql/source/master/src/test/recovery -v t/019_standby_logical_decoding_conflicts.pl
> t/019_standby_logical_decoding_conflicts.pl ..
> 1..24
> ok 1 - dropslot on standby created
> ok 2 - activeslot on standby created
> # poll_query_until timed out executing this query:
> # SELECT '0/35C9190' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby';
> # expecting this output:
> # t
> # last actual query output:
> #
> # with stderr:
> Bailout called.  Further testing stopped:  system pg_ctl failed
> Bail out!  system pg_ctl failed
> FAILED--Further testing stopped: system pg_ctl failed
> make: *** [Makefile:19: check] Error 255
>

After rebase and did minimal tweaks (duplicated oid, TAP tests numbering) I'm facing similar problem but in other place:


make -C src/test/recovery check PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v
...
/usr/bin/mkdir -p '/data/src/pg/main/src/test/recovery'/tmp_check
cd . && TESTDIR='/data/src/pg/main/src/test/recovery' PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/bin:$PATH" LD_LIBRARY_PATH="/d/src/pg/main/tmp_install/home/fabrizio/pgsql/logical-decoding-standby/lib"  PGPORT='65432' PG_REGRESS='/data/src/pg/main/src/test/recovery/../../../src/test/regress/pg_regress' REGRESS_SHLIB='/d/src/pg/main/src/test/regress/regress.so' /usr/bin/prove -I ../../../src/test/perl/ -I . -v t/023_standby_logical_decoding_conflicts.pl
t/023_standby_logical_decoding_conflicts.pl ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
not ok 3 - dropslot on standby dropped

#   Failed test 'dropslot on standby dropped'
#   at t/023_standby_logical_decoding_conflicts.pl line 67.
#          got: 'logical'
#     expected: ''
not ok 4 - activeslot on standby dropped

#   Failed test 'activeslot on standby dropped'
#   at t/023_standby_logical_decoding_conflicts.pl line 68.
#          got: 'logical'
#     expected: ''


TAP tests hang forever in `check_slots_dropped` exactly here:

    # our client should've terminated in response to the walsender error
    eval {
        $slot_user_handle->finish;
    };

3 and 4 were failing because the ResolveRecoveryConflictWithLogicalSlots() call was missing in ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.

The new version attached also provides a few changes to make it compiling on the current master (it was not the case anymore).

I also had to change 023_standby_logical_decoding_conflicts.pl (had to call $node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE should drops it's slots, including active slots" section)

So that now the tests are passing:

t/023_standby_logical_decoding_conflicts.pl ..
1..24
ok 1 - dropslot on standby created
ok 2 - activeslot on standby created
ok 3 - dropslot on standby dropped
ok 4 - activeslot on standby dropped
ok 5 - pg_recvlogical exited non-zero
#
ok 6 - recvlogical recovery conflict
ok 7 - recvlogical error detail
ok 8 - dropslot on standby created
ok 9 - activeslot on standby created
ok 10 - dropslot on standby dropped
ok 11 - activeslot on standby dropped
ok 12 - pg_recvlogical exited non-zero
#
ok 13 - recvlogical recovery conflict
ok 14 - recvlogical error detail
ok 15 - otherslot on standby created
ok 16 - dropslot on standby created
ok 17 - activeslot on standby created
ok 18 - database dropped on standby
ok 19 - dropslot on standby dropped
ok 20 - activeslot on standby dropped
ok 21 - pg_recvlogical exited non-zero
#
ok 22 - recvlogical recovery conflict
ok 23 - recvlogical error detail
ok 24 - otherslot on standby not dropped
ok
All tests successful.
Files=1, Tests=24,  4 wallclock secs ( 0.02 usr  0.00 sys +  1.27 cusr  0.37 csys =  1.66 CPU)
Result: PASS

Attached is the new version.

Bertrand

Attachment

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:


On Mon, Jan 18, 2021 at 8:48 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> 3 and 4 were failing because the ResolveRecoveryConflictWithLogicalSlots() call was missing in ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.
>
> The new version attached also provides a few changes to make it compiling on the current master (it was not the case anymore).
>
> I also had to change 023_standby_logical_decoding_conflicts.pl (had to call $node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE should drops it's slots, including active slots" section)
>

Awesome and thanks a lot.

Seems your patch series is broken... can you please `git format-patch` and send again?

Regards,

--
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 1/25/21 8:34 PM, Fabrízio de Royes Mello wrote:

On Mon, Jan 18, 2021 at 8:48 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> 3 and 4 were failing because the ResolveRecoveryConflictWithLogicalSlots() call was missing in ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.
>
> The new version attached also provides a few changes to make it compiling on the current master (it was not the case anymore).
>
> I also had to change 023_standby_logical_decoding_conflicts.pl (had to call $node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE should drops it's slots, including active slots" section)
>

Awesome and thanks a lot.

Seems your patch series is broken... can you please `git format-patch` and send again?

Thanks for pointing out!

Enclosed a new series created with "format-patch" and that can be applied with "git am":

$ git am v8-000*.patch
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict handling.
Applying: Handle logical slot conflicts on standby.
Applying: New TAP test for logical decoding on standby.
Applying: Doc changes describing details about logical decoding.

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 1/26/21 10:31 AM, Drouvot, Bertrand wrote:

Hi,

On 1/25/21 8:34 PM, Fabrízio de Royes Mello wrote:

On Mon, Jan 18, 2021 at 8:48 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> 3 and 4 were failing because the ResolveRecoveryConflictWithLogicalSlots() call was missing in ResolveRecoveryConflictWithSnapshot(): the new version attached adds it.
>
> The new version attached also provides a few changes to make it compiling on the current master (it was not the case anymore).
>
> I also had to change 023_standby_logical_decoding_conflicts.pl (had to call $node_standby->create_logical_slot_on_standby($node_master, 'otherslot', 'postgres'); at the very beginning of the "DROP DATABASE should drops it's slots, including active slots" section)
>

Awesome and thanks a lot.

Seems your patch series is broken... can you please `git format-patch` and send again?

Thanks for pointing out!

Enclosed a new series created with "format-patch" and that can be applied with "git am":

$ git am v8-000*.patch
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict handling.
Applying: Handle logical slot conflicts on standby.
Applying: New TAP test for logical decoding on standby.
Applying: Doc changes describing details about logical decoding.

Bertrand

Had to do a little change to make it compiling again (re-add the heapRel argument in _bt_delitems_delete() that was removed by commit dc43492e46c7145a476cb8ca6200fc8eefe673ef).

Given that this attached v9 version:

  • compiles successfully on current master
  • passes "make check"
  • passes the 2 associated tap tests "make -C src/test/recovery check PROVE_TESTS=t/022_standby_logical_decoding_xmins.pl PROVE_FLAGS=-v"  and "make -C src/test/recovery check PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v"

wouldn't that make sense to (re)add this patch in the commitfest?

Thanks

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:


On Thu, Feb 4, 2021 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Had to do a little change to make it compiling again (re-add the heapRel argument in _bt_delitems_delete() that was removed by commit dc43492e46c7145a476cb8ca6200fc8eefe673ef).
>
> Given that this attached v9 version:
>
> compiles successfully on current master
> passes "make check"
> passes the 2 associated tap tests "make -C src/test/recovery check PROVE_TESTS=t/022_standby_logical_decoding_xmins.pl PROVE_FLAGS=-v"  and "make -C src/test/recovery check PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v"
>

Perfect thanks... will review ASAP!

> wouldn't that make sense to (re)add this patch in the commitfest?
>

Added to next commitfest: https://commitfest.postgresql.org/32/2968/

Regards,

--
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 2/4/21 6:33 PM, Fabrízio de Royes Mello wrote:

On Thu, Feb 4, 2021 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Had to do a little change to make it compiling again (re-add the heapRel argument in _bt_delitems_delete() that was removed by commit dc43492e46c7145a476cb8ca6200fc8eefe673ef).
>
> Given that this attached v9 version:
>
> compiles successfully on current master
> passes "make check"
> passes the 2 associated tap tests "make -C src/test/recovery check PROVE_TESTS=t/022_standby_logical_decoding_xmins.pl PROVE_FLAGS=-v"  and "make -C src/test/recovery check PROVE_TESTS=t/023_standby_logical_decoding_conflicts.pl PROVE_FLAGS=-v"
>

Perfect thanks... will review ASAP!

Thanks!

Just made minor changes to make it compiling again on current master (mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced in e5d8a99903).

Please find enclosed the new patch version that currently passes "make check" and the 2 associated TAP tests.

I'll have a look to the whole thread to check if there is anything else waiting in the pipe regarding this feature, unless some of you know off the top of their head?

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:

On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Thanks!
>
> Just made minor changes to make it compiling again on current master (mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced in e5d8a99903).
>
> Please find enclosed the new patch version that currently passes "make check" and the 2 associated TAP tests.
>

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch                                                    
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for logical slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


> I'll have a look to the whole thread to check if there is anything else waiting in the pipe regarding this feature, unless some of you know off the top of their head?
>

Will do the same!!! But as far I remember last time I checked it everything discussed is covered in this patch set.

Regards,

--
   Fabrízio de Royes Mello         Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:


On 3/22/21 3:10 PM, Fabrízio de Royes Mello wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Thanks!
>
> Just made minor changes to make it compiling again on current master (mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced in e5d8a99903).
>
> Please find enclosed the new patch version that currently passes "make check" and the 2 associated TAP tests.
>

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch                                                    
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for logical slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

oh Indeed, it's moving so fast!

Let me rebase it (It was already my plan to do so as I have observed during the week end that the v7-0003-Handle-logical-slot-conflicts-on-standby.patch introduced incorrect changes (that should not be there at all in ReplicationSlotReserveWal()) that have been kept in v8, v9 and v10.



> I'll have a look to the whole thread to check if there is anything else waiting in the pipe regarding this feature, unless some of you know off the top of their head?
>

Will do the same!!!

Thanks!


But as far I remember last time I checked it everything discussed is covered in this patch set.

That's also what I have observed so far.

Bertrand

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:


On 3/22/21 3:57 PM, Drouvot, Bertrand wrote:


On 3/22/21 3:10 PM, Fabrízio de Royes Mello wrote:

On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Thanks!
>
> Just made minor changes to make it compiling again on current master (mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced in e5d8a99903).
>
> Please find enclosed the new patch version that currently passes "make check" and the 2 associated TAP tests.
>

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch                                                    
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for logical slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

oh Indeed, it's moving so fast!

Let me rebase it (It was already my plan to do so as I have observed during the week end that the v7-0003-Handle-logical-slot-conflicts-on-standby.patch introduced incorrect changes (that should not be there at all in ReplicationSlotReserveWal()) that have been kept in v8, v9 and v10.

please find enclosed the rebase version, that also contains the fix for ReplicationSlotReserveWal() mentioned above.

Going back to looking at the whole thread.

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:


On 3/22/21 4:56 PM, Drouvot, Bertrand wrote:


On 3/22/21 3:57 PM, Drouvot, Bertrand wrote:


On 3/22/21 3:10 PM, Fabrízio de Royes Mello wrote:

On Thu, Mar 18, 2021 at 5:34 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Thanks!
>
> Just made minor changes to make it compiling again on current master (mainly had to take care of ResolveRecoveryConflictWithSnapshotFullXid() that has been introduced in e5d8a99903).
>
> Please find enclosed the new patch version that currently passes "make check" and the 2 associated TAP tests.
>

Unfortunately it still not applying to the current master:

$ git am ~/Downloads/v10-000*.patch                                                    
Applying: Allow logical decoding on standby.
Applying: Add info in WAL records in preparation for logical slot conflict handling.
error: patch failed: src/backend/access/nbtree/nbtpage.c:32
error: src/backend/access/nbtree/nbtpage.c: patch does not apply
Patch failed at 0002 Add info in WAL records in preparation for logical slot conflict handling.
hint: Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

oh Indeed, it's moving so fast!

Let me rebase it (It was already my plan to do so as I have observed during the week end that the v7-0003-Handle-logical-slot-conflicts-on-standby.patch introduced incorrect changes (that should not be there at all in ReplicationSlotReserveWal()) that have been kept in v8, v9 and v10.

please find enclosed the rebase version, that also contains the fix for ReplicationSlotReserveWal() mentioned above.

Going back to looking at the whole thread.

I have one remark regarding the conflicts:

The logical slots are dropped if a conflict is detected.

But, if the slot is not active before being dropped (say wal_level is changed to  < logical on master and a logical slot is not active on the standby) then its corresponding pg_stat_database_conflicts.confl_logicalslot is not incremented (as it would be incremented "only" during the cancel of an "active" backend).

I think, it should be incremented in all the cases (active or not), what do you think?

I updated the patch to handle this scenario (see the new pgstat_send_droplogicalslot() in v12-0003-Handle-logical-slot-conflicts-on-standby.patch).

I also added more tests in 023_standby_logical_decoding_conflicts.pl to verify that pg_stat_database_conflicts.confl_logicalslot is updated as expected (see check_confl_logicalslot() in v12-0004-New-TAP-test-for-logical-decoding-on-standby.patch).

Except this remark and the associated changes, then it looks good to me.

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:

On Tue, Mar 23, 2021 at 8:47 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> I have one remark regarding the conflicts:
>
> The logical slots are dropped if a conflict is detected.
>
> But, if the slot is not active before being dropped (say wal_level is changed to  < logical on master and a logical slot is not active on the standby) then its corresponding pg_stat_database_conflicts.confl_logicalslot is not incremented (as it would be incremented "only" during the cancel of an "active" backend).
>
> I think, it should be incremented in all the cases (active or not), what do you think?
>

Good catch... IMHO it should be incremented as well!!!

> I updated the patch to handle this scenario (see the new pgstat_send_droplogicalslot() in v12-0003-Handle-logical-slot-conflicts-on-standby.patch).
>

Perfect.

> I also added more tests in 023_standby_logical_decoding_conflicts.pl to verify that pg_stat_database_conflicts.confl_logicalslot is updated as expected (see check_confl_logicalslot() in v12-0004-New-TAP-test-for-logical-decoding-on-standby.patch).
>

Perfect.

> Except this remark and the associated changes, then it looks good to me.
>

LGTM too... Reviewing new changes now to move it forward and make this patch set ready for commiter review.

Regards,

--
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:


On Tue, Mar 23, 2021 at 10:18 AM Fabrízio de Royes Mello <fabriziomello@gmail.com> wrote:
>
> LGTM too... Reviewing new changes now to move it forward and make this patch set ready for commiter review.
>

According to the feature LGTM and all tests passed. Documentation is also OK. Some minor comments:

+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>

Remove extra space before "Existing logical slots..."

+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,

Move it to the end of pg_stat_database_conflicts columns


+            * is being reduced.  Hence this extra check.

Remove extra space before "Hence this..."


+       /* Send the other backend, a conflict recovery signal */
+
+       SetInvalidVirtualTransactionId(vxid);

Remove extra empty line


+               if (restart_lsn % XLOG_BLCKSZ != 0)
+                   elog(ERROR, "invalid replay pointer");

Add an empty line after this "IF" for code readability 


+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+                                       char *conflict_reason)
+{
+   int         i;
+   bool        found_conflict = false;
+
+   if (max_replication_slots <= 0)
+       return;

What about adding an "Assert(max_replication_slots >= 0);" before the replication slots check?

One last thing is about the name of TAP tests, we should rename them because there are other TAP tests starting with 022_ and 023_. It should be renamed to:


Regards,

--
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 3/23/21 3:29 PM, Fabrízio de Royes Mello wrote:

On Tue, Mar 23, 2021 at 10:18 AM Fabrízio de Royes Mello <fabriziomello@gmail.com> wrote:
>
> LGTM too... Reviewing new changes now to move it forward and make this patch set ready for commiter review.
>

According to the feature LGTM and all tests passed. Documentation is also OK.

Thanks for the review!

Some minor comments:

+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     dropped.  Existing logical slots on standby also get dropped if wal_level
+     on primary is reduced to less than 'logical'.
+    </para>

Remove extra space before "Existing logical slots..."

done in v13 attached.


+            pg_stat_get_db_conflict_logicalslot(D.oid) AS confl_logicalslot,

Move it to the end of pg_stat_database_conflicts columns

done in v13 attached.



+            * is being reduced.  Hence this extra check.

Remove extra space before "Hence this..."

done in v13 attached.


+       /* Send the other backend, a conflict recovery signal */
+
+       SetInvalidVirtualTransactionId(vxid);

Remove extra empty line

done in v13 attached.


+               if (restart_lsn % XLOG_BLCKSZ != 0)
+                   elog(ERROR, "invalid replay pointer");

Add an empty line after this "IF" for code readability

done in v13 attached.



+void
+ResolveRecoveryConflictWithLogicalSlots(Oid dboid, TransactionId xid,
+                                       char *conflict_reason)
+{
+   int         i;
+   bool        found_conflict = false;
+
+   if (max_replication_slots <= 0)
+       return;

What about adding an "Assert(max_replication_slots >= 0);" before the replication slots check?


Makes sense, in v13 attached: Assert added and then also changed the following if accordingly to "== 0".


One last thing is about the name of TAP tests, we should rename them because there are other TAP tests starting with 022_ and 023_. It should be renamed to:


done in v13 attached.

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:

>
> done in v13 attached.
>

All tests passed and everything looks good to me... just a final minor fix on regression tests:

diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b0e17d4e1d..961ec869a6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1869,9 +1869,9 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
     pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
     pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
-    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
-    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
+    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
+    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
    FROM pg_database d;
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,

We moved "pg_stat_database_conflicts.confl_logicalslot" to the end of the column list but forgot to change the regression test expected result.

Regards,

--
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
On 3/23/21 11:05 PM, Fabrízio de Royes Mello wrote:
>
> >
> > done in v13 attached.
> >
>
> All tests passed and everything looks good to me... just a final 
> minor fix on regression tests:
>
> diff --git a/src/test/regress/expected/rules.out 
> b/src/test/regress/expected/rules.out
> index b0e17d4e1d..961ec869a6 100644
> --- a/src/test/regress/expected/rules.out
> +++ b/src/test/regress/expected/rules.out
> @@ -1869,9 +1869,9 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
>      pg_stat_get_db_conflict_tablespace(d.oid) AS confl_tablespace,
>      pg_stat_get_db_conflict_lock(d.oid) AS confl_lock,
>      pg_stat_get_db_conflict_snapshot(d.oid) AS confl_snapshot,
> -    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot,
>      pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
> -    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
> +    pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock,
> +    pg_stat_get_db_conflict_logicalslot(d.oid) AS confl_logicalslot
>     FROM pg_database d;
>  pg_stat_gssapi| SELECT s.pid,
>      s.gss_auth AS gss_authenticated,
>
> We moved "pg_stat_database_conflicts.confl_logicalslot" to the end of 
> the column list but forgot to change the regression test expected result.

Thanks for pointing out, fixed in v14 attached.

Bertrand


Attachment

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:


On Wed, Mar 24, 2021 at 3:57 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Thanks for pointing out, fixed in v14 attached.
>

Thanks... now everything is working as expected... changed the status to Ready for Commiter:

Regards,

--
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:


On 3/25/21 12:01 AM, Fabrízio de Royes Mello wrote:


On Wed, Mar 24, 2021 at 3:57 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Thanks for pointing out, fixed in v14 attached.
>

Thanks... now everything is working as expected... changed the status to Ready for Commiter:

Thanks!

I think this would be a great feature, so I am looking forward to help/work on any comments/suggestions that they may have.

Bertrand

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 3/25/21 8:51 AM, Drouvot, Bertrand wrote:


On 3/25/21 12:01 AM, Fabrízio de Royes Mello wrote:


On Wed, Mar 24, 2021 at 3:57 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Thanks for pointing out, fixed in v14 attached.
>

Thanks... now everything is working as expected... changed the status to Ready for Commiter:

Thanks!

I think this would be a great feature, so I am looking forward to help/work on any comments/suggestions that they may have.

Just needed a minor rebase due to 2 new conflicts with:

  • b4af70cb21: in vacuum_log_cleanup_info() (see new v15-0002-Add-info-in-WAL-records-in-preparation-for-logic.patch)
  • 43620e3286: oid conflict with pg_log_backend_memory_contexts() and pg_stat_get_db_conflict_snapshot() (see new v15-0003-Handle-logical-slot-conflicts-on-standby.patch)

New v15 attached is passing "make check" and the 2 new associated TAP tests.

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2021-04-06 14:30:29 +0200, Drouvot, Bertrand wrote:
> From 827295f74aff9c627ee722f541a6c7cc6d4133cf Mon Sep 17 00:00:00 2001
> From: bdrouvotAWS <bdrouvot@amazon.com>
> Date: Tue, 6 Apr 2021 11:59:23 +0000
> Subject: [PATCH v15 1/5] Allow logical decoding on standby.
> 
> Allow a logical slot to be created on standby. Restrict its usage
> or its creation if wal_level on primary is less than logical.
> During slot creation, it's restart_lsn is set to the last replayed
> LSN. Effectively, a logical slot creation on standby waits for an
> xl_running_xact record to arrive from primary. Conflicting slots
> would be handled in next commits.
>
> Andres Freund and Amit Khandekar.

I think more people have worked on this by now...

Does this strike you as an accurate description?

Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas

> --- a/src/backend/replication/logical/logical.c
> +++ b/src/backend/replication/logical/logical.c
> @@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
>                  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                   errmsg("logical decoding requires a database connection")));
>  
> -    /* ----
> -     * TODO: We got to change that someday soon...
> -     *
> -     * There's basically three things missing to allow this:
> -     * 1) We need to be able to correctly and quickly identify the timeline a
> -     *      LSN belongs to
> -     * 2) We need to force hot_standby_feedback to be enabled at all times so
> -     *      the primary cannot remove rows we need.
> -     * 3) support dropping replication slots referring to a database, in
> -     *      dbase_redo. There can't be any active ones due to HS recovery
> -     *      conflicts, so that should be relatively easy.
> -     * ----
> -     */
>      if (RecoveryInProgress())
> -        ereport(ERROR,
> -                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                 errmsg("logical decoding cannot be used while in recovery")));

Maybe I am just missing something right now, and maybe I'm being a bit
overly pedantic, but I don't immediately see how 0001 is correct without
0002 and 0003? I think it'd be better to first introduce the conflict
information, then check for conflicts, and only after that allow
decoding on standbys?


> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 6f8810e149..6a21cba362 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
>      ReadControlFile();
>  }
>  
> +/*
> + * Get the wal_level from the control file. For a standby, this value should be
> + * considered as its active wal_level, because it may be different from what
> + * was originally configured on standby.
> + */
> +WalLevel
> +GetActiveWalLevel(void)
> +{
> +    return ControlFile->wal_level;
> +}
> +

This strikes me as error-prone - there's nothing in the function name
that this should mainly (only?) be used during recovery...


> +        if (SlotIsPhysical(slot))
> +            restart_lsn = GetRedoRecPtr();
> +        else if (RecoveryInProgress())
> +        {
> +            restart_lsn = GetXLogReplayRecPtr(NULL);
> +            /*
> +             * Replay pointer may point one past the end of the record. If that
> +             * is a XLOG page boundary, it will not be a valid LSN for the
> +             * start of a record, so bump it up past the page header.
> +             */
> +            if (!XRecOffIsValid(restart_lsn))
> +            {
> +                if (restart_lsn % XLOG_BLCKSZ != 0)
> +                    elog(ERROR, "invalid replay pointer");
> +
> +                /* For the first page of a segment file, it's a long header */
> +                if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
> +                    restart_lsn += SizeOfXLogLongPHD;
> +                else
> +                    restart_lsn += SizeOfXLogShortPHD;
> +            }
> +        }

This seems like a layering violation to me. I don't think stuff like
this should be outside of xlog[reader].c, and definitely not in
ReplicationSlotReserveWal().

Relevant discussion (which totally escaped my mind):
https://postgr.es/m/CAJ3gD9csOr0LoYoMK9NnfBk0RZmvHXcJAFWFd2EuL%3DNOfz7PVA%40mail.gmail.com


> +        else
> +            restart_lsn = GetXLogInsertRecPtr();
> +
> +        SpinLockAcquire(&slot->mutex);
> +        slot->data.restart_lsn = restart_lsn;
> +        SpinLockRelease(&slot->mutex);
> +
>          if (!RecoveryInProgress() && SlotIsLogical(slot))
>          {
>              XLogRecPtr    flushptr;
>  
> -            /* start at current insert position */
> -            restart_lsn = GetXLogInsertRecPtr();
> -            SpinLockAcquire(&slot->mutex);
> -            slot->data.restart_lsn = restart_lsn;
> -            SpinLockRelease(&slot->mutex);
> -
>              /* make sure we have enough information to start */
>              flushptr = LogStandbySnapshot();
>  
>              /* and make sure it's fsynced to disk */
>              XLogFlush(flushptr);
>          }
> -        else
> -        {
> -            restart_lsn = GetRedoRecPtr();
> -            SpinLockAcquire(&slot->mutex);
> -            slot->data.restart_lsn = restart_lsn;
> -            SpinLockRelease(&slot->mutex);
> -        }
>  
>          /* prevent WAL removal as fast as possible */
>          ReplicationSlotsComputeRequiredLSN();

I think I'd move the LogStandbySnapshot() piece out of the entire
loop. There's no reason for logging multiple ones if we then just end up
failing because of the XLogGetLastRemovedSegno() check.


> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
> index 178d49710a..6c4c26c2fe 100644
> --- a/src/include/access/heapam_xlog.h
> +++ b/src/include/access/heapam_xlog.h
> @@ -239,6 +239,7 @@ typedef struct xl_heap_update
>   */
>  typedef struct xl_heap_clean
>  {
> +    bool        onCatalogTable;
>      TransactionId latestRemovedXid;
>      uint16        nredirected;
>      uint16        ndead;
> @@ -254,6 +255,7 @@ typedef struct xl_heap_clean
>   */
>  typedef struct xl_heap_cleanup_info
>  {
> +    bool        onCatalogTable;
>      RelFileNode node;
>      TransactionId latestRemovedXid;
>  } xl_heap_cleanup_info;
> @@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
>   */
>  typedef struct xl_heap_freeze_page
>  {
> +    bool        onCatalogTable;
>      TransactionId cutoff_xid;
>      uint16        ntuples;
>  } xl_heap_freeze_page;
> @@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
>   */
>  typedef struct xl_heap_visible
>  {
> +    bool        onCatalogTable;
>      TransactionId cutoff_xid;
>      uint8        flags;
>  } xl_heap_visible;

Reminder to self: This needs a WAL version bump.

> diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
> index 9a3a03e520..3405070d63 100644
> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -16,6 +16,7 @@
>  
>  #include "access/tupdesc.h"
>  #include "access/xlog.h"
> +#include "catalog/catalog.h"
>  #include "catalog/pg_class.h"
>  #include "catalog/pg_index.h"
>  #include "catalog/pg_publication.h"

Not clear why this is in this patch?



> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index 5ba776e789..03c5dbea48 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
>      pgstat_send(&msg, sizeof(msg));
>  }
>  
> +/* ----------
> + * pgstat_send_droplogicalslot() -
> + *
> + *    Tell the collector about a logical slot being dropped
> + *    due to conflict.
> + * ----------
> + */
> +void
> +pgstat_send_droplogicalslot(Oid dbOid)
> +{
> +    PgStat_MsgRecoveryConflict msg;
> +
> +    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
> +    msg.m_databaseid = dbOid;
> +    msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
> +    pgstat_send(&msg, sizeof(msg));
> +}

Why do we have this in adition to pgstat_report_replslot_drop()? ISTM
that we should instead add a reason parameter to
pgstat_report_replslot_drop()?


> +/*
> + * Resolve recovery conflicts with logical slots.
> + *
> + * When xid is valid, it means that rows older than xid might have been
> + * removed.

I don't think the past tense is correct - the rows better not be removed
yet on the standby, otherwise we'd potentially do something random in
decoding.


> diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
> new file mode 100644
> index 0000000000..d654d79526
> --- /dev/null
> +++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
> @@ -0,0 +1,272 @@
> +# logical decoding on a standby : ensure xmins are appropriately updated
> +
> +use strict;
> +use warnings;
> +
> +use PostgresNode;
> +use TestLib;
> +use Test::More tests => 23;
> +use RecursiveCopy;
> +use File::Copy;
> +use Time::HiRes qw(usleep);

Several of these don't actually seem to be used?


> +########################
> +# Initialize master node
> +########################

(I'll rename these to primary/replica)


> +$node_master->init(allows_streaming => 1, has_archiving => 1);
> +$node_master->append_conf('postgresql.conf', q{
> +wal_level = 'logical'
> +max_replication_slots = 4
> +max_wal_senders = 4
> +log_min_messages = 'debug2'
> +log_error_verbosity = verbose
> +# very promptly terminate conflicting backends
> +max_standby_streaming_delay = '2s'
> +});

Why is this done on the primary, rather than on the standby?


> +################################
> +# Catalog xmins should advance after standby logical slot fetches the changes.
> +################################
> +
> +# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
> +# we hold down xmin.

I don't know what that means.


> +$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
> +$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
> +for my $i (0 .. 2000)
> +{
> +    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
> +}

Forking 2000 psql processes is pretty expensive, especially on slower
machines. What is this supposed to test?


> +($ret, $stdout, $stderr) = $node_standby->psql('postgres',
> +    qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0',
'skip-empty-xacts','1', 'include-timestamp', '0')]);
 
> +is($ret, 0, 'replay of big series succeeded');
> +isnt($stdout, '', 'replayed some rows');

Nothing is being replayed...



> +######################
> +# Upstream oldestXid should not go past downstream catalog_xmin
> +######################
> +
> +# First burn some xids on the master in another DB, so we push the master's
> +# nextXid ahead.
> +foreach my $i (1 .. 100)
> +{
> +    $node_master->safe_psql('postgres', 'SELECT txid_current()');
> +}
> +
> +# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
> +# past our needed xmin. The only way we have visibility into that is to force
> +# a checkpoint.
> +$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
> +foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
> +{
> +    $node_master->safe_psql($dbname, 'VACUUM FREEZE');
> +}
> +$node_master->safe_psql('postgres', 'CHECKPOINT');
> +IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
> +    or die "pg_controldata failed with $?";
> +my @checkpoint = split('\n', $stdout);
> +my $oldestXid = '';
> +foreach my $line (@checkpoint)
> +{
> +    if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
> +    {
> +        $oldestXid = $1;
> +    }
> +}
> +die 'no oldestXID found in checkpoint' unless $oldestXid;
> +
> +cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
> +       'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
> +
> +$node_master->safe_psql('postgres',
> +    "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
> +

I am thinking of removing this test. It doesn't seem to test anything
really related to the issue at hand, and seems complicated (needing to
update datallowcon, manually triggering checkpoints, parsing
pg_controldata output).


> +# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
> +# given boolean condition to be true to ensure we've reached a quiescent state
> +sub wait_for_xmins
> +{
> +    my ($node, $slotname, $check_expr) = @_;
> +
> +    $node->poll_query_until(
> +        'postgres', qq[
> +        SELECT $check_expr
> +        FROM pg_catalog.pg_replication_slots
> +        WHERE slot_name = '$slotname';
> +    ]) or die "Timed out waiting for slot xmins to advance";
> +}
> +
> +# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
> +sub check_confl_logicalslot
> +{
> +    ok( $node_standby->poll_query_until(
> +        'postgres',
> +        "select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
> +        'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
> +}
> +

Given that this hardcodes a specific number of conflicting slots etc,
there doesn't seem much point in making this a function...


> +# Acquire one of the standby logical slots created by create_logical_slots()
> +sub make_slot_active
> +{
> +    my $slot_user_handle;
> +
> +    # make sure activeslot is in use
> +    print "starting pg_recvlogical\n";
> +    $slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S',
'activeslot','-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
 
> +
> +    while (!$node_standby->slot('activeslot')->{'active_pid'})
> +    {
> +        usleep(100_000);
> +        print "waiting for slot to become active\n";
> +    }
> +    return $slot_user_handle;
> +}

It's a bad idea to not have timeouts in things like this - if there's a
problem, it'll lead to the test never returning. Things like
poll_query_until() have timeouts to deal with this, but this doesn't.


> +# Check if all the slots on standby are dropped. These include the 'activeslot'
> +# that was acquired by make_slot_active(), and the non-active 'dropslot'.
> +sub check_slots_dropped
> +{
> +    my ($slot_user_handle) = @_;
> +    my $return;
> +
> +    is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
> +    is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
> +
> +    # our client should've terminated in response to the walsender error
> +    eval {
> +        $slot_user_handle->finish;
> +    };
> +    $return = $?;
> +    cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
> +    if ($return) {
> +        like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
> +        like($stderr, qr/must be dropped/, 'recvlogical error detail');
> +    }

Why do we need to use eval{} for things like checking if a program
finished?


> @@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
>       may consume changes from a slot at any given time.
>      </para>
>  
> +    <para>
> +     A logical replication slot can also be created on a hot standby. To prevent
> +     <command>VACUUM</command> from removing required rows from the system
> +     catalogs, <varname>hot_standby_feedback</varname> should be set on the
> +     standby. In spite of that, if any required rows get removed, the slot gets
> +     dropped. Existing logical slots on standby also get dropped if wal_level
> +     on primary is reduced to less than 'logical'.
> +    </para>

I think this should add that it's very advisable to use a physical slot
between primary and standby. Otherwise hot_standby_feedback will work,
but only while the connection is alive - as soon as it breaks, a node
gets restarted, ...

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Andres,

On 4/6/21 8:02 PM, Andres Freund wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 
>
>
>
> Hi,
>
> On 2021-04-06 14:30:29 +0200, Drouvot, Bertrand wrote:
>>  From 827295f74aff9c627ee722f541a6c7cc6d4133cf Mon Sep 17 00:00:00 2001
>> From: bdrouvotAWS <bdrouvot@amazon.com>
>> Date: Tue, 6 Apr 2021 11:59:23 +0000
>> Subject: [PATCH v15 1/5] Allow logical decoding on standby.
>>
>> Allow a logical slot to be created on standby. Restrict its usage
>> or its creation if wal_level on primary is less than logical.
>> During slot creation, it's restart_lsn is set to the last replayed
>> LSN. Effectively, a logical slot creation on standby waits for an
>> xl_running_xact record to arrive from primary. Conflicting slots
>> would be handled in next commits.
>>
>> Andres Freund and Amit Khandekar.
> I think more people have worked on this by now...
>
> Does this strike you as an accurate description?
>
> Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
> Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas
>
>> --- a/src/backend/replication/logical/logical.c
>> +++ b/src/backend/replication/logical/logical.c
>> @@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
>>                                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>>                                 errmsg("logical decoding requires a database connection")));
>>
>> -     /* ----
>> -      * TODO: We got to change that someday soon...
>> -      *
>> -      * There's basically three things missing to allow this:
>> -      * 1) We need to be able to correctly and quickly identify the timeline a
>> -      *        LSN belongs to
>> -      * 2) We need to force hot_standby_feedback to be enabled at all times so
>> -      *        the primary cannot remove rows we need.
>> -      * 3) support dropping replication slots referring to a database, in
>> -      *        dbase_redo. There can't be any active ones due to HS recovery
>> -      *        conflicts, so that should be relatively easy.
>> -      * ----
>> -      */
>>        if (RecoveryInProgress())
>> -             ereport(ERROR,
>> -                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> -                              errmsg("logical decoding cannot be used while in recovery")));
> Maybe I am just missing something right now, and maybe I'm being a bit
> overly pedantic, but I don't immediately see how 0001 is correct without
> 0002 and 0003? I think it'd be better to first introduce the conflict
> information, then check for conflicts, and only after that allow
> decoding on standbys?
>
>
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index 6f8810e149..6a21cba362 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
>>        ReadControlFile();
>>   }
>>
>> +/*
>> + * Get the wal_level from the control file. For a standby, this value should be
>> + * considered as its active wal_level, because it may be different from what
>> + * was originally configured on standby.
>> + */
>> +WalLevel
>> +GetActiveWalLevel(void)
>> +{
>> +     return ControlFile->wal_level;
>> +}
>> +
> This strikes me as error-prone - there's nothing in the function name
> that this should mainly (only?) be used during recovery...
>
>
>> +             if (SlotIsPhysical(slot))
>> +                     restart_lsn = GetRedoRecPtr();
>> +             else if (RecoveryInProgress())
>> +             {
>> +                     restart_lsn = GetXLogReplayRecPtr(NULL);
>> +                     /*
>> +                      * Replay pointer may point one past the end of the record. If that
>> +                      * is a XLOG page boundary, it will not be a valid LSN for the
>> +                      * start of a record, so bump it up past the page header.
>> +                      */
>> +                     if (!XRecOffIsValid(restart_lsn))
>> +                     {
>> +                             if (restart_lsn % XLOG_BLCKSZ != 0)
>> +                                     elog(ERROR, "invalid replay pointer");
>> +
>> +                             /* For the first page of a segment file, it's a long header */
>> +                             if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
>> +                                     restart_lsn += SizeOfXLogLongPHD;
>> +                             else
>> +                                     restart_lsn += SizeOfXLogShortPHD;
>> +                     }
>> +             }
> This seems like a layering violation to me. I don't think stuff like
> this should be outside of xlog[reader].c, and definitely not in
> ReplicationSlotReserveWal().
>
> Relevant discussion (which totally escaped my mind):
> https://postgr.es/m/CAJ3gD9csOr0LoYoMK9NnfBk0RZmvHXcJAFWFd2EuL%3DNOfz7PVA%40mail.gmail.com
>
>
>> +             else
>> +                     restart_lsn = GetXLogInsertRecPtr();
>> +
>> +             SpinLockAcquire(&slot->mutex);
>> +             slot->data.restart_lsn = restart_lsn;
>> +             SpinLockRelease(&slot->mutex);
>> +
>>                if (!RecoveryInProgress() && SlotIsLogical(slot))
>>                {
>>                        XLogRecPtr      flushptr;
>>
>> -                     /* start at current insert position */
>> -                     restart_lsn = GetXLogInsertRecPtr();
>> -                     SpinLockAcquire(&slot->mutex);
>> -                     slot->data.restart_lsn = restart_lsn;
>> -                     SpinLockRelease(&slot->mutex);
>> -
>>                        /* make sure we have enough information to start */
>>                        flushptr = LogStandbySnapshot();
>>
>>                        /* and make sure it's fsynced to disk */
>>                        XLogFlush(flushptr);
>>                }
>> -             else
>> -             {
>> -                     restart_lsn = GetRedoRecPtr();
>> -                     SpinLockAcquire(&slot->mutex);
>> -                     slot->data.restart_lsn = restart_lsn;
>> -                     SpinLockRelease(&slot->mutex);
>> -             }
>>
>>                /* prevent WAL removal as fast as possible */
>>                ReplicationSlotsComputeRequiredLSN();
> I think I'd move the LogStandbySnapshot() piece out of the entire
> loop. There's no reason for logging multiple ones if we then just end up
> failing because of the XLogGetLastRemovedSegno() check.
>
>
>> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
>> index 178d49710a..6c4c26c2fe 100644
>> --- a/src/include/access/heapam_xlog.h
>> +++ b/src/include/access/heapam_xlog.h
>> @@ -239,6 +239,7 @@ typedef struct xl_heap_update
>>    */
>>   typedef struct xl_heap_clean
>>   {
>> +     bool            onCatalogTable;
>>        TransactionId latestRemovedXid;
>>        uint16          nredirected;
>>        uint16          ndead;
>> @@ -254,6 +255,7 @@ typedef struct xl_heap_clean
>>    */
>>   typedef struct xl_heap_cleanup_info
>>   {
>> +     bool            onCatalogTable;
>>        RelFileNode node;
>>        TransactionId latestRemovedXid;
>>   } xl_heap_cleanup_info;
>> @@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
>>    */
>>   typedef struct xl_heap_freeze_page
>>   {
>> +     bool            onCatalogTable;
>>        TransactionId cutoff_xid;
>>        uint16          ntuples;
>>   } xl_heap_freeze_page;
>> @@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
>>    */
>>   typedef struct xl_heap_visible
>>   {
>> +     bool            onCatalogTable;
>>        TransactionId cutoff_xid;
>>        uint8           flags;
>>   } xl_heap_visible;
> Reminder to self: This needs a WAL version bump.
>
>> diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
>> index 9a3a03e520..3405070d63 100644
>> --- a/src/include/utils/rel.h
>> +++ b/src/include/utils/rel.h
>> @@ -16,6 +16,7 @@
>>
>>   #include "access/tupdesc.h"
>>   #include "access/xlog.h"
>> +#include "catalog/catalog.h"
>>   #include "catalog/pg_class.h"
>>   #include "catalog/pg_index.h"
>>   #include "catalog/pg_publication.h"
> Not clear why this is in this patch?
>
>
>
>> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
>> index 5ba776e789..03c5dbea48 100644
>> --- a/src/backend/postmaster/pgstat.c
>> +++ b/src/backend/postmaster/pgstat.c
>> @@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
>>        pgstat_send(&msg, sizeof(msg));
>>   }
>>
>> +/* ----------
>> + * pgstat_send_droplogicalslot() -
>> + *
>> + *   Tell the collector about a logical slot being dropped
>> + *   due to conflict.
>> + * ----------
>> + */
>> +void
>> +pgstat_send_droplogicalslot(Oid dbOid)
>> +{
>> +     PgStat_MsgRecoveryConflict msg;
>> +
>> +     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
>> +     msg.m_databaseid = dbOid;
>> +     msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
>> +     pgstat_send(&msg, sizeof(msg));
>> +}
> Why do we have this in adition to pgstat_report_replslot_drop()? ISTM
> that we should instead add a reason parameter to
> pgstat_report_replslot_drop()?
>
>
>> +/*
>> + * Resolve recovery conflicts with logical slots.
>> + *
>> + * When xid is valid, it means that rows older than xid might have been
>> + * removed.
> I don't think the past tense is correct - the rows better not be removed
> yet on the standby, otherwise we'd potentially do something random in
> decoding.
>
>
>> diff --git a/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
>> new file mode 100644
>> index 0000000000..d654d79526
>> --- /dev/null
>> +++ b/src/test/recovery/t/024_standby_logical_decoding_xmins.pl
>> @@ -0,0 +1,272 @@
>> +# logical decoding on a standby : ensure xmins are appropriately updated
>> +
>> +use strict;
>> +use warnings;
>> +
>> +use PostgresNode;
>> +use TestLib;
>> +use Test::More tests => 23;
>> +use RecursiveCopy;
>> +use File::Copy;
>> +use Time::HiRes qw(usleep);
> Several of these don't actually seem to be used?
>
>
>> +########################
>> +# Initialize master node
>> +########################
> (I'll rename these to primary/replica)
>
>
>> +$node_master->init(allows_streaming => 1, has_archiving => 1);
>> +$node_master->append_conf('postgresql.conf', q{
>> +wal_level = 'logical'
>> +max_replication_slots = 4
>> +max_wal_senders = 4
>> +log_min_messages = 'debug2'
>> +log_error_verbosity = verbose
>> +# very promptly terminate conflicting backends
>> +max_standby_streaming_delay = '2s'
>> +});
> Why is this done on the primary, rather than on the standby?
>
>
>> +################################
>> +# Catalog xmins should advance after standby logical slot fetches the changes.
>> +################################
>> +
>> +# Ideally we'd just hold catalog_xmin, but since hs_feedback currently uses the slot,
>> +# we hold down xmin.
> I don't know what that means.
>
>
>> +$node_master->safe_psql('postgres', qq[CREATE TABLE catalog_increase_1();]);
>> +$node_master->safe_psql('postgres', 'CREATE TABLE test_table(id serial primary key, blah text)');
>> +for my $i (0 .. 2000)
>> +{
>> +    $node_master->safe_psql('postgres', qq[INSERT INTO test_table(blah) VALUES ('entry $i')]);
>> +}
> Forking 2000 psql processes is pretty expensive, especially on slower
> machines. What is this supposed to test?
>
>
>> +($ret, $stdout, $stderr) = $node_standby->psql('postgres',
>> +     qq[SELECT data FROM pg_logical_slot_get_changes('$standby_slotname', NULL, NULL, 'include-xids', '0',
'skip-empty-xacts','1', 'include-timestamp', '0')]);
 
>> +is($ret, 0, 'replay of big series succeeded');
>> +isnt($stdout, '', 'replayed some rows');
> Nothing is being replayed...
>
>
>
>> +######################
>> +# Upstream oldestXid should not go past downstream catalog_xmin
>> +######################
>> +
>> +# First burn some xids on the master in another DB, so we push the master's
>> +# nextXid ahead.
>> +foreach my $i (1 .. 100)
>> +{
>> +     $node_master->safe_psql('postgres', 'SELECT txid_current()');
>> +}
>> +
>> +# Force vacuum freeze on the master and ensure its oldestXmin doesn't advance
>> +# past our needed xmin. The only way we have visibility into that is to force
>> +# a checkpoint.
>> +$node_master->safe_psql('postgres', "UPDATE pg_database SET datallowconn = true WHERE datname = 'template0'");
>> +foreach my $dbname ('template1', 'postgres', 'postgres', 'template0')
>> +{
>> +     $node_master->safe_psql($dbname, 'VACUUM FREEZE');
>> +}
>> +$node_master->safe_psql('postgres', 'CHECKPOINT');
>> +IPC::Run::run(['pg_controldata', $node_master->data_dir()], '>', \$stdout)
>> +     or die "pg_controldata failed with $?";
>> +my @checkpoint = split('\n', $stdout);
>> +my $oldestXid = '';
>> +foreach my $line (@checkpoint)
>> +{
>> +     if ($line =~ qr/^Latest checkpoint's oldestXID:\s+(\d+)/)
>> +     {
>> +             $oldestXid = $1;
>> +     }
>> +}
>> +die 'no oldestXID found in checkpoint' unless $oldestXid;
>> +
>> +cmp_ok($oldestXid, "<=", $node_standby->slot($standby_slotname)->{'catalog_xmin'},
>> +        'upstream oldestXid not past downstream catalog_xmin with hs_feedback on');
>> +
>> +$node_master->safe_psql('postgres',
>> +     "UPDATE pg_database SET datallowconn = false WHERE datname = 'template0'");
>> +
> I am thinking of removing this test. It doesn't seem to test anything
> really related to the issue at hand, and seems complicated (needing to
> update datallowcon, manually triggering checkpoints, parsing
> pg_controldata output).
>
>
>> +# Fetch xmin columns from slot's pg_replication_slots row, after waiting for
>> +# given boolean condition to be true to ensure we've reached a quiescent state
>> +sub wait_for_xmins
>> +{
>> +     my ($node, $slotname, $check_expr) = @_;
>> +
>> +     $node->poll_query_until(
>> +             'postgres', qq[
>> +             SELECT $check_expr
>> +             FROM pg_catalog.pg_replication_slots
>> +             WHERE slot_name = '$slotname';
>> +     ]) or die "Timed out waiting for slot xmins to advance";
>> +}
>> +
>> +# Verify that pg_stat_database_conflicts.confl_logicalslot has been updated
>> +sub check_confl_logicalslot
>> +{
>> +     ok( $node_standby->poll_query_until(
>> +             'postgres',
>> +             "select (confl_logicalslot = 2) from pg_stat_database_conflicts where datname = 'testdb'", 't'),
>> +             'confl_logicalslot updated') or die "Timed out waiting confl_logicalslot to be updated";
>> +}
>> +
> Given that this hardcodes a specific number of conflicting slots etc,
> there doesn't seem much point in making this a function...
>
>
>> +# Acquire one of the standby logical slots created by create_logical_slots()
>> +sub make_slot_active
>> +{
>> +     my $slot_user_handle;
>> +
>> +     # make sure activeslot is in use
>> +     print "starting pg_recvlogical\n";
>> +     $slot_user_handle = IPC::Run::start(['pg_recvlogical', '-d', $node_standby->connstr('testdb'), '-S',
'activeslot','-f', '-', '--no-loop', '--start'], '>', \$stdout, '2>', \$stderr);
 
>> +
>> +     while (!$node_standby->slot('activeslot')->{'active_pid'})
>> +     {
>> +             usleep(100_000);
>> +             print "waiting for slot to become active\n";
>> +     }
>> +     return $slot_user_handle;
>> +}
> It's a bad idea to not have timeouts in things like this - if there's a
> problem, it'll lead to the test never returning. Things like
> poll_query_until() have timeouts to deal with this, but this doesn't.
>
>
>> +# Check if all the slots on standby are dropped. These include the 'activeslot'
>> +# that was acquired by make_slot_active(), and the non-active 'dropslot'.
>> +sub check_slots_dropped
>> +{
>> +     my ($slot_user_handle) = @_;
>> +     my $return;
>> +
>> +     is($node_standby->slot('dropslot')->{'slot_type'}, '', 'dropslot on standby dropped');
>> +     is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot on standby dropped');
>> +
>> +     # our client should've terminated in response to the walsender error
>> +     eval {
>> +             $slot_user_handle->finish;
>> +     };
>> +     $return = $?;
>> +     cmp_ok($return, "!=", 0, "pg_recvlogical exited non-zero\n");
>> +     if ($return) {
>> +             like($stderr, qr/conflict with recovery/, 'recvlogical recovery conflict');
>> +             like($stderr, qr/must be dropped/, 'recvlogical error detail');
>> +     }
> Why do we need to use eval{} for things like checking if a program
> finished?
>
>
>> @@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
>>        may consume changes from a slot at any given time.
>>       </para>
>>
>> +    <para>
>> +     A logical replication slot can also be created on a hot standby. To prevent
>> +     <command>VACUUM</command> from removing required rows from the system
>> +     catalogs, <varname>hot_standby_feedback</varname> should be set on the
>> +     standby. In spite of that, if any required rows get removed, the slot gets
>> +     dropped. Existing logical slots on standby also get dropped if wal_level
>> +     on primary is reduced to less than 'logical'.
>> +    </para>
> I think this should add that it's very advisable to use a physical slot
> between primary and standby. Otherwise hot_standby_feedback will work,
> but only while the connection is alive - as soon as it breaks, a node
> gets restarted, ...
>
> Greetings,
>
> Andres Freund

Thanks for your feedback!, I'll look at it.

But prior to that, I am sharing v16 (a rebase of v15 needed due to 
8523492d4e).

Bertrand


Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

I think I'll remove the "Catalog xmins should advance after standby
logical slot fetches the changes." test. For one, it takes a long time
(due to the 2000 psqls). But more importantly, it's simply not testing
anything that's reliable:

1) There's no guarantee that I can see that catalog_xmin needs to
increase, just because we called pg_logical_slot_get_changes() once.

2) $node_master->wait_for_catchup($node_standby, 'replay',
  $node_master->lsn('flush')); is definitely not OK. It only happens to
work by accident / the 2000 iterations. There might not be any logical
changes associated with that LSN, so there'd might not be anything to
replay. That's especially true for the second wait_for_catchup - there
haven't been any logical changes since the last wait.

The test hangs reliably for me if I replace the 2000 with 2. Kinda looks
like somebody just tried to add more and more inserts to make the test
not fail, without addressing the reliability issues. That kind of thing
rarely works out well, because it tends to be very machine specific to
get the timing right. And it makes the test take forever.


TBH, most of 024_standby_logical_decoding_xmins.pl looks like they've
been minimally hacked up the tests from Craig's quite different patch,
without adjusting them. There's stuff like:

# Create new slots on the standby, ignoring the ones on the master completely.
#
# This must succeed since we know we have a catalog_xmin reservation. We
# might've already sent hot standby feedback to advance our physical slot's
# catalog_xmin but not received the corresponding xlog for the catalog xmin
# advance, in which case we'll create a slot that isn't usable. The calling
# application can prevent this by creating a temporary slot on the master to
# lock in its catalog_xmin. For a truly race-free solution we'd need
# master-to-standby hot_standby_feedback replies.
#
# In this case it won't race because there's no concurrent activity on the
# master.
#

This issue doesn't exist in the patch.

There's also no test for a recovery conflict due to row removal. Despite
that being a substantial part of the patchset.


I'm tempted to throw out 024 - all of its tests seem fragile and prove
little. And then add a few more tests to 025 (and renaming it).

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2021-04-07 10:09:54 -0700, Andres Freund wrote:
> There's also no test for a recovery conflict due to row removal. Despite
> that being a substantial part of the patchset.

Another aspect that wasn't tested *at all*: Whether logical decoding
actually produces useful and correct results.


> I'm tempted to throw out 024 - all of its tests seem fragile and prove
> little. And then add a few more tests to 025 (and renaming it).

While working on this I found a, somewhat substantial, issue:

When the primary is idle, on the standby logical decoding via walsender
will typically not process the records until further WAL writes come in
from the primary, or until a 10s lapsed.

The problem is that WalSndWaitForWal() waits for the *replay* LSN to
increase, but gets woken up by walreceiver when new WAL has been
flushed. Which means that typically walsenders will get woken up at the
same time that the startup process will be - which means that by the
time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
that the startup process already replayed the record and updated
XLogCtl->lastReplayedEndRecPtr.

I think fixing this would require too invasive changes at this point. I
think we might be able to live with 10s delay issue for one release, but
it sure is ugly :(.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2021-04-07 13:32:18 -0700, Andres Freund wrote:
> While working on this I found a, somewhat substantial, issue:
>
> When the primary is idle, on the standby logical decoding via walsender
> will typically not process the records until further WAL writes come in
> from the primary, or until a 10s lapsed.
>
> The problem is that WalSndWaitForWal() waits for the *replay* LSN to
> increase, but gets woken up by walreceiver when new WAL has been
> flushed. Which means that typically walsenders will get woken up at the
> same time that the startup process will be - which means that by the
> time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
> that the startup process already replayed the record and updated
> XLogCtl->lastReplayedEndRecPtr.
>
> I think fixing this would require too invasive changes at this point. I
> think we might be able to live with 10s delay issue for one release, but
> it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().


- There's about which timeline to use. If you use pg_recvlogical and you
  restart the server, you'll see errors like:

  pg_recvlogical: error: unexpected termination of replication stream: ERROR:  requested WAL segment
000000000000000000000003has already been removed
 

  the real filename is 000000010000000000000003 - i.e. the timeline is
  0.

  This isn't too hard to fix, but definitely needs fixing.

- ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
  leading us to drop a slot that has been created since we signalled a
  recovery conflict.  See
  https://www.postgresql.org/message-id/20210408020913.zzprrlvqyvlt5cyy%40alap3.anarazel.de
  for some very similar issues.

- Given the precedent of max_slot_wal_keep_size, I think it's wrong to
  just drop the logical slots. Instead we should just mark them as
  invalid, like InvalidateObsoleteReplicationSlots().

- There's no tests covering timeline switches, what happens if there's a
  promotion if logical decoding is currently ongoing.

- The way ResolveRecoveryConflictWithLogicalSlots() builds the error
  message is not good (and I've complained about it before...).

Unfortunately I think the things I have found are too many for me to
address within the given time. I'll send a version with a somewhat
polished set of the changes I made in the next few days...

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

Thanks for the feedback!

On 4/6/21 8:02 PM, Andres Freund wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you
canconfirm the sender and know the content is safe.
 
>
>
>
> Hi,
>
> On 2021-04-06 14:30:29 +0200, Drouvot, Bertrand wrote:
>>  From 827295f74aff9c627ee722f541a6c7cc6d4133cf Mon Sep 17 00:00:00 2001
>> From: bdrouvotAWS <bdrouvot@amazon.com>
>> Date: Tue, 6 Apr 2021 11:59:23 +0000
>> Subject: [PATCH v15 1/5] Allow logical decoding on standby.
>>
>> Allow a logical slot to be created on standby. Restrict its usage
>> or its creation if wal_level on primary is less than logical.
>> During slot creation, it's restart_lsn is set to the last replayed
>> LSN. Effectively, a logical slot creation on standby waits for an
>> xl_running_xact record to arrive from primary. Conflicting slots
>> would be handled in next commits.
>>
>> Andres Freund and Amit Khandekar.
> I think more people have worked on this by now...
>
> Does this strike you as an accurate description?
>
> Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
> Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas

Yes it looks like, adding Fabrizio as reviewer as well.

>> --- a/src/backend/replication/logical/logical.c
>> +++ b/src/backend/replication/logical/logical.c
>> @@ -119,23 +119,22 @@ CheckLogicalDecodingRequirements(void)
>>                                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>>                                 errmsg("logical decoding requires a database connection")));
>>
>> -     /* ----
>> -      * TODO: We got to change that someday soon...
>> -      *
>> -      * There's basically three things missing to allow this:
>> -      * 1) We need to be able to correctly and quickly identify the timeline a
>> -      *        LSN belongs to
>> -      * 2) We need to force hot_standby_feedback to be enabled at all times so
>> -      *        the primary cannot remove rows we need.
>> -      * 3) support dropping replication slots referring to a database, in
>> -      *        dbase_redo. There can't be any active ones due to HS recovery
>> -      *        conflicts, so that should be relatively easy.
>> -      * ----
>> -      */
>>        if (RecoveryInProgress())
>> -             ereport(ERROR,
>> -                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> -                              errmsg("logical decoding cannot be used while in recovery")));
> Maybe I am just missing something right now, and maybe I'm being a bit
> overly pedantic, but I don't immediately see how 0001 is correct without
> 0002 and 0003? I think it'd be better to first introduce the conflict
> information, then check for conflicts, and only after that allow
> decoding on standbys?

RIght, changing the order in v17 attached.

>
>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> index 6f8810e149..6a21cba362 100644
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -5080,6 +5080,17 @@ LocalProcessControlFile(bool reset)
>>        ReadControlFile();
>>   }
>>
>> +/*
>> + * Get the wal_level from the control file. For a standby, this value should be
>> + * considered as its active wal_level, because it may be different from what
>> + * was originally configured on standby.
>> + */
>> +WalLevel
>> +GetActiveWalLevel(void)
>> +{
>> +     return ControlFile->wal_level;
>> +}
>> +
> This strikes me as error-prone - there's nothing in the function name
> that this should mainly (only?) be used during recovery...
>
renamed to GetActiveWalLevelOnStandby().

>> +             if (SlotIsPhysical(slot))
>> +                     restart_lsn = GetRedoRecPtr();
>> +             else if (RecoveryInProgress())
>> +             {
>> +                     restart_lsn = GetXLogReplayRecPtr(NULL);
>> +                     /*
>> +                      * Replay pointer may point one past the end of the record. If that
>> +                      * is a XLOG page boundary, it will not be a valid LSN for the
>> +                      * start of a record, so bump it up past the page header.
>> +                      */
>> +                     if (!XRecOffIsValid(restart_lsn))
>> +                     {
>> +                             if (restart_lsn % XLOG_BLCKSZ != 0)
>> +                                     elog(ERROR, "invalid replay pointer");
>> +
>> +                             /* For the first page of a segment file, it's a long header */
>> +                             if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
>> +                                     restart_lsn += SizeOfXLogLongPHD;
>> +                             else
>> +                                     restart_lsn += SizeOfXLogShortPHD;
>> +                     }
>> +             }
> This seems like a layering violation to me. I don't think stuff like
> this should be outside of xlog[reader].c, and definitely not in
> ReplicationSlotReserveWal().

Moved the bump to GetXLogReplayRecPtr(), does that make more sense or 
did you have something else in mind?

> Relevant discussion (which totally escaped my mind):
> https://postgr.es/m/CAJ3gD9csOr0LoYoMK9NnfBk0RZmvHXcJAFWFd2EuL%3DNOfz7PVA%40mail.gmail.com
>
>
>> +             else
>> +                     restart_lsn = GetXLogInsertRecPtr();
>> +
>> +             SpinLockAcquire(&slot->mutex);
>> +             slot->data.restart_lsn = restart_lsn;
>> +             SpinLockRelease(&slot->mutex);
>> +
>>                if (!RecoveryInProgress() && SlotIsLogical(slot))
>>                {
>>                        XLogRecPtr      flushptr;
>>
>> -                     /* start at current insert position */
>> -                     restart_lsn = GetXLogInsertRecPtr();
>> -                     SpinLockAcquire(&slot->mutex);
>> -                     slot->data.restart_lsn = restart_lsn;
>> -                     SpinLockRelease(&slot->mutex);
>> -
>>                        /* make sure we have enough information to start */
>>                        flushptr = LogStandbySnapshot();
>>
>>                        /* and make sure it's fsynced to disk */
>>                        XLogFlush(flushptr);
>>                }
>> -             else
>> -             {
>> -                     restart_lsn = GetRedoRecPtr();
>> -                     SpinLockAcquire(&slot->mutex);
>> -                     slot->data.restart_lsn = restart_lsn;
>> -                     SpinLockRelease(&slot->mutex);
>> -             }
>>
>>                /* prevent WAL removal as fast as possible */
>>                ReplicationSlotsComputeRequiredLSN();
> I think I'd move the LogStandbySnapshot() piece out of the entire
> loop. There's no reason for logging multiple ones if we then just end up
> failing because of the XLogGetLastRemovedSegno() check.

Right, moved it outside of the loop.

>
>> diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
>> index 178d49710a..6c4c26c2fe 100644
>> --- a/src/include/access/heapam_xlog.h
>> +++ b/src/include/access/heapam_xlog.h
>> @@ -239,6 +239,7 @@ typedef struct xl_heap_update
>>    */
>>   typedef struct xl_heap_clean
>>   {
>> +     bool            onCatalogTable;
>>        TransactionId latestRemovedXid;
>>        uint16          nredirected;
>>        uint16          ndead;
>> @@ -254,6 +255,7 @@ typedef struct xl_heap_clean
>>    */
>>   typedef struct xl_heap_cleanup_info
>>   {
>> +     bool            onCatalogTable;
>>        RelFileNode node;
>>        TransactionId latestRemovedXid;
>>   } xl_heap_cleanup_info;
>> @@ -334,6 +336,7 @@ typedef struct xl_heap_freeze_tuple
>>    */
>>   typedef struct xl_heap_freeze_page
>>   {
>> +     bool            onCatalogTable;
>>        TransactionId cutoff_xid;
>>        uint16          ntuples;
>>   } xl_heap_freeze_page;
>> @@ -348,6 +351,7 @@ typedef struct xl_heap_freeze_page
>>    */
>>   typedef struct xl_heap_visible
>>   {
>> +     bool            onCatalogTable;
>>        TransactionId cutoff_xid;
>>        uint8           flags;
>>   } xl_heap_visible;
> Reminder to self: This needs a WAL version bump.
>
>> diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
>> index 9a3a03e520..3405070d63 100644
>> --- a/src/include/utils/rel.h
>> +++ b/src/include/utils/rel.h
>> @@ -16,6 +16,7 @@
>>
>>   #include "access/tupdesc.h"
>>   #include "access/xlog.h"
>> +#include "catalog/catalog.h"
>>   #include "catalog/pg_class.h"
>>   #include "catalog/pg_index.h"
>>   #include "catalog/pg_publication.h"
> Not clear why this is in this patch?

It's needed for IsCatalogRelation() call in 
RelationIsAccessibleInLogicalDecoding() and RelationIsLogicallyLogged().

So instead, in v17 attached i removed the new includes of catalog.h as 
it makes more sense to me to keep this new one in rel.h.

>> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
>> index 5ba776e789..03c5dbea48 100644
>> --- a/src/backend/postmaster/pgstat.c
>> +++ b/src/backend/postmaster/pgstat.c
>> @@ -2928,6 +2928,24 @@ pgstat_send_archiver(const char *xlog, bool failed)
>>        pgstat_send(&msg, sizeof(msg));
>>   }
>>
>> +/* ----------
>> + * pgstat_send_droplogicalslot() -
>> + *
>> + *   Tell the collector about a logical slot being dropped
>> + *   due to conflict.
>> + * ----------
>> + */
>> +void
>> +pgstat_send_droplogicalslot(Oid dbOid)
>> +{
>> +     PgStat_MsgRecoveryConflict msg;
>> +
>> +     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
>> +     msg.m_databaseid = dbOid;
>> +     msg.m_reason = PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT;
>> +     pgstat_send(&msg, sizeof(msg));
>> +}
> Why do we have this in adition to pgstat_report_replslot_drop()? ISTM
> that we should instead add a reason parameter to
> pgstat_report_replslot_drop()?

Added a reason parameter in pgstat_report_replslot_drop() and dropped 
pgstat_send_droplogicalslot().

>
>> +/*
>> + * Resolve recovery conflicts with logical slots.
>> + *
>> + * When xid is valid, it means that rows older than xid might have been
>> + * removed.
> I don't think the past tense is correct - the rows better not be removed
> yet on the standby, otherwise we'd potentially do something random in
> decoding.
>
RIght, wording changed.
>
>> @@ -297,6 +297,24 @@ postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NU
>>        may consume changes from a slot at any given time.
>>       </para>
>>
>> +    <para>
>> +     A logical replication slot can also be created on a hot standby. To prevent
>> +     <command>VACUUM</command> from removing required rows from the system
>> +     catalogs, <varname>hot_standby_feedback</varname> should be set on the
>> +     standby. In spite of that, if any required rows get removed, the slot gets
>> +     dropped. Existing logical slots on standby also get dropped if wal_level
>> +     on primary is reduced to less than 'logical'.
>> +    </para>
> I think this should add that it's very advisable to use a physical slot
> between primary and standby. Otherwise hot_standby_feedback will work,
> but only while the connection is alive - as soon as it breaks, a node
> gets restarted, ...

Good point, wording added .

v17 attached does contain those changes.

Remarks related to the TAP tests have not been addressed in v17, will 
look at it now.

Bertrand


Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Andres,

On 4/8/21 5:47 AM, Andres Freund wrote:
> Hi,
>
> On 2021-04-07 13:32:18 -0700, Andres Freund wrote:
>> While working on this I found a, somewhat substantial, issue:
>>
>> When the primary is idle, on the standby logical decoding via walsender
>> will typically not process the records until further WAL writes come in
>> from the primary, or until a 10s lapsed.
>>
>> The problem is that WalSndWaitForWal() waits for the *replay* LSN to
>> increase, but gets woken up by walreceiver when new WAL has been
>> flushed. Which means that typically walsenders will get woken up at the
>> same time that the startup process will be - which means that by the
>> time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
>> that the startup process already replayed the record and updated
>> XLogCtl->lastReplayedEndRecPtr.
>>
>> I think fixing this would require too invasive changes at this point. I
>> think we might be able to live with 10s delay issue for one release, but
>> it sure is ugly :(.
> This is indeed pretty painful. It's a lot more regularly occuring if you
> either have a slot disk, or you switch around the order of
> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
>
> - There's about which timeline to use. If you use pg_recvlogical and you
>    restart the server, you'll see errors like:
>
>    pg_recvlogical: error: unexpected termination of replication stream: ERROR:  requested WAL segment
000000000000000000000003has already been removed
 
>
>    the real filename is 000000010000000000000003 - i.e. the timeline is
>    0.
>
>    This isn't too hard to fix, but definitely needs fixing.

Thanks, nice catch!

 From what I have seen, we are not going through InitXLOGAccess() on a 
Standby and in some cases (like the one you mentioned) 
StartLogicalReplication() is called without IdentifySystem() being 
called previously: this lead to ThisTimeLineID still set to 0.

I am proposing a fix in the attached v18 by adding a check in 
StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.

>
> - ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
>    leading us to drop a slot that has been created since we signalled a
>    recovery conflict.  See
>    https://www.postgresql.org/message-id/20210408020913.zzprrlvqyvlt5cyy%40alap3.anarazel.de
>    for some very similar issues.

I have rewritten this part by following the same logic as the one used 
in 96540f80f8 (the commit linked to the thread you mentioned).

>
> - Given the precedent of max_slot_wal_keep_size, I think it's wrong to
>    just drop the logical slots. Instead we should just mark them as
>    invalid, like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.

I am setting the slot's data.xmin and data.catalog_xmin as 
InvalidTransactionId to mark the slot(s) as invalid in case of conflict.

> - There's no tests covering timeline switches, what happens if there's a
>    promotion if logical decoding is currently ongoing.

I'll now work on the tests.

>
> - The way ResolveRecoveryConflictWithLogicalSlots() builds the error
>    message is not good (and I've complained about it before...).

I changed it and made it more simple.

I also removed the details around mentioning xmin or catalog xmin (as I 
am not sure of the added value and they are currently also not mentioned 
during standby recovery snapshot conflict).

>
> Unfortunately I think the things I have found are too many for me to
> address within the given time. I'll send a version with a somewhat
> polished set of the changes I made in the next few days...

Thanks for the review and feedback.

Please find enclosed v18 with the changes I worked on.

I still need to have a look on the tests.

There is also the 10s delay to work on, do you already have an idea on 
how we should handle it?

Thanks

Bertrand


Attachment

Re: [UNVERIFIED SENDER] Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Andres,

On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:
> Hi Andres,
>
> On 4/8/21 5:47 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-04-07 13:32:18 -0700, Andres Freund wrote:
>>> While working on this I found a, somewhat substantial, issue:
>>>
>>> When the primary is idle, on the standby logical decoding via walsender
>>> will typically not process the records until further WAL writes come in
>>> from the primary, or until a 10s lapsed.
>>>
>>> The problem is that WalSndWaitForWal() waits for the *replay* LSN to
>>> increase, but gets woken up by walreceiver when new WAL has been
>>> flushed. Which means that typically walsenders will get woken up at the
>>> same time that the startup process will be - which means that by the
>>> time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
>>> that the startup process already replayed the record and updated
>>> XLogCtl->lastReplayedEndRecPtr.
>>>
>>> I think fixing this would require too invasive changes at this point. I
>>> think we might be able to live with 10s delay issue for one release, 
>>> but
>>> it sure is ugly :(.
>> This is indeed pretty painful. It's a lot more regularly occuring if you
>> either have a slot disk, or you switch around the order of
>> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
>>
>> - There's about which timeline to use. If you use pg_recvlogical and you
>>    restart the server, you'll see errors like:
>>
>>    pg_recvlogical: error: unexpected termination of replication 
>> stream: ERROR:  requested WAL segment 000000000000000000000003 has 
>> already been removed
>>
>>    the real filename is 000000010000000000000003 - i.e. the timeline is
>>    0.
>>
>>    This isn't too hard to fix, but definitely needs fixing.
>
> Thanks, nice catch!
>
> From what I have seen, we are not going through InitXLOGAccess() on a 
> Standby and in some cases (like the one you mentioned) 
> StartLogicalReplication() is called without IdentifySystem() being 
> called previously: this lead to ThisTimeLineID still set to 0.
>
> I am proposing a fix in the attached v18 by adding a check in 
> StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.
>
>>
>> - ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
>>    leading us to drop a slot that has been created since we signalled a
>>    recovery conflict.  See
>> https://www.postgresql.org/message-id/20210408020913.zzprrlvqyvlt5cyy%40alap3.anarazel.de
>>    for some very similar issues.
>
> I have rewritten this part by following the same logic as the one used 
> in 96540f80f8 (the commit linked to the thread you mentioned).
>
>>
>> - Given the precedent of max_slot_wal_keep_size, I think it's wrong to
>>    just drop the logical slots. Instead we should just mark them as
>>    invalid, like InvalidateObsoleteReplicationSlots().
>
> Makes fully sense and done that way in the attached patch.
>
> I am setting the slot's data.xmin and data.catalog_xmin as 
> InvalidTransactionId to mark the slot(s) as invalid in case of conflict.
>
>> - There's no tests covering timeline switches, what happens if there's a
>>    promotion if logical decoding is currently ongoing.
>
> I'll now work on the tests.
>
>>
>> - The way ResolveRecoveryConflictWithLogicalSlots() builds the error
>>    message is not good (and I've complained about it before...).
>
> I changed it and made it more simple.
>
> I also removed the details around mentioning xmin or catalog xmin (as 
> I am not sure of the added value and they are currently also not 
> mentioned during standby recovery snapshot conflict).
>
>>
>> Unfortunately I think the things I have found are too many for me to
>> address within the given time. I'll send a version with a somewhat
>> polished set of the changes I made in the next few days...
>
> Thanks for the review and feedback.
>
> Please find enclosed v18 with the changes I worked on.
>
> I still need to have a look on the tests.

Please find enclosed v19 that also contains the changes related to your 
TAP tests remarks, mainly:

- get rid of 024 and add more tests in 026 (025 has been used in the 
meantime)

- test that logical decoding actually produces useful and correct results

- test standby promotion and logical decoding behavior once done

- useless "use" removal

- check_confl_logicalslot() function removal

- rewrote make_slot_active() to make use of poll_query_until() and timeout

- remove the useless eval()

- remove the "Catalog xmins should advance after standby logical slot 
fetches the changes" test

One thing that's not clear to me is your remark "There's also no test 
for a recovery conflict due to row removal": Don't you think that the 
"vacuum full" conflict test is enough? if not, what kind of additional 
tests would you like to see?

>
> There is also the 10s delay to work on, do you already have an idea on 
> how we should handle it?
>
> Thanks
>
> Bertrand
>
Thanks

Bertrand



Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Andres,

On 6/22/21 12:38 PM, Drouvot, Bertrand wrote:
> Hi Andres,
>
> On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:
>> Hi Andres,
>>
>> On 4/8/21 5:47 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2021-04-07 13:32:18 -0700, Andres Freund wrote:
>>>> While working on this I found a, somewhat substantial, issue:
>>>>
>>>> When the primary is idle, on the standby logical decoding via 
>>>> walsender
>>>> will typically not process the records until further WAL writes 
>>>> come in
>>>> from the primary, or until a 10s lapsed.
>>>>
>>>> The problem is that WalSndWaitForWal() waits for the *replay* LSN to
>>>> increase, but gets woken up by walreceiver when new WAL has been
>>>> flushed. Which means that typically walsenders will get woken up at 
>>>> the
>>>> same time that the startup process will be - which means that by the
>>>> time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
>>>> that the startup process already replayed the record and updated
>>>> XLogCtl->lastReplayedEndRecPtr.
>>>>
>>>> I think fixing this would require too invasive changes at this 
>>>> point. I
>>>> think we might be able to live with 10s delay issue for one 
>>>> release, but
>>>> it sure is ugly :(.
>>> This is indeed pretty painful. It's a lot more regularly occuring if 
>>> you
>>> either have a slot disk, or you switch around the order of
>>> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
>>>
>>> - There's about which timeline to use. If you use pg_recvlogical and 
>>> you
>>>    restart the server, you'll see errors like:
>>>
>>>    pg_recvlogical: error: unexpected termination of replication 
>>> stream: ERROR:  requested WAL segment 000000000000000000000003 has 
>>> already been removed
>>>
>>>    the real filename is 000000010000000000000003 - i.e. the timeline is
>>>    0.
>>>
>>>    This isn't too hard to fix, but definitely needs fixing.
>>
>> Thanks, nice catch!
>>
>> From what I have seen, we are not going through InitXLOGAccess() on a 
>> Standby and in some cases (like the one you mentioned) 
>> StartLogicalReplication() is called without IdentifySystem() being 
>> called previously: this lead to ThisTimeLineID still set to 0.
>>
>> I am proposing a fix in the attached v18 by adding a check in 
>> StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.
>>
>>>
>>> - ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
>>>    leading us to drop a slot that has been created since we signalled a
>>>    recovery conflict.  See
>>> https://www.postgresql.org/message-id/20210408020913.zzprrlvqyvlt5cyy%40alap3.anarazel.de 
>>>
>>>    for some very similar issues.
>>
>> I have rewritten this part by following the same logic as the one 
>> used in 96540f80f8 (the commit linked to the thread you mentioned).
>>
>>>
>>> - Given the precedent of max_slot_wal_keep_size, I think it's wrong to
>>>    just drop the logical slots. Instead we should just mark them as
>>>    invalid, like InvalidateObsoleteReplicationSlots().
>>
>> Makes fully sense and done that way in the attached patch.
>>
>> I am setting the slot's data.xmin and data.catalog_xmin as 
>> InvalidTransactionId to mark the slot(s) as invalid in case of conflict.
>>
>>> - There's no tests covering timeline switches, what happens if 
>>> there's a
>>>    promotion if logical decoding is currently ongoing.
>>
>> I'll now work on the tests.
>>
>>>
>>> - The way ResolveRecoveryConflictWithLogicalSlots() builds the error
>>>    message is not good (and I've complained about it before...).
>>
>> I changed it and made it more simple.
>>
>> I also removed the details around mentioning xmin or catalog xmin (as 
>> I am not sure of the added value and they are currently also not 
>> mentioned during standby recovery snapshot conflict).
>>
>>>
>>> Unfortunately I think the things I have found are too many for me to
>>> address within the given time. I'll send a version with a somewhat
>>> polished set of the changes I made in the next few days...
>>
>> Thanks for the review and feedback.
>>
>> Please find enclosed v18 with the changes I worked on.
>>
>> I still need to have a look on the tests.
>
> Please find enclosed v19 that also contains the changes related to 
> your TAP tests remarks, mainly:
>
> - get rid of 024 and add more tests in 026 (025 has been used in the 
> meantime)
>
> - test that logical decoding actually produces useful and correct results
>
> - test standby promotion and logical decoding behavior once done
>
> - useless "use" removal
>
> - check_confl_logicalslot() function removal
>
> - rewrote make_slot_active() to make use of poll_query_until() and 
> timeout
>
> - remove the useless eval()
>
> - remove the "Catalog xmins should advance after standby logical slot 
> fetches the changes" test
>
> One thing that's not clear to me is your remark "There's also no test 
> for a recovery conflict due to row removal": Don't you think that the 
> "vacuum full" conflict test is enough? if not, what kind of additional 
> tests would you like to see?
>
>>
>> There is also the 10s delay to work on, do you already have an idea 
>> on how we should handle it?
>>
>> Thanks
>>
>> Bertrand
>>
> Thanks
>
> Bertrand
>
Please find enclosed v20 a needed rebase (nothing serious worth 
mentioning) of v19.

FWIW, just to sum up that v19 (and so v20):

- contained the changes (see details above) related to your TAP tests 
remarks

- contained the changes (see details above) related to your code remarks

There is still the 10s delay thing that need work: do you already have 
an idea on how we should handle it?

And still one thing that's not clear to me is your remark "There's also 
no test for a recovery conflict due to row removal": Don't you think 
that the "vacuum full" conflict test is enough? if not, what kind of 
additional tests would you like to see?

Thanks

Bertrand





Attachment

Re: Minimal logical decoding on standbys

From
Ibrar Ahmed
Date:


On Fri, Jul 16, 2021 at 1:07 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
Hi Andres,

On 6/22/21 12:38 PM, Drouvot, Bertrand wrote:
> Hi Andres,
>
> On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:
>> Hi Andres,
>>
>> On 4/8/21 5:47 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2021-04-07 13:32:18 -0700, Andres Freund wrote:
>>>> While working on this I found a, somewhat substantial, issue:
>>>>
>>>> When the primary is idle, on the standby logical decoding via
>>>> walsender
>>>> will typically not process the records until further WAL writes
>>>> come in
>>>> from the primary, or until a 10s lapsed.
>>>>
>>>> The problem is that WalSndWaitForWal() waits for the *replay* LSN to
>>>> increase, but gets woken up by walreceiver when new WAL has been
>>>> flushed. Which means that typically walsenders will get woken up at
>>>> the
>>>> same time that the startup process will be - which means that by the
>>>> time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
>>>> that the startup process already replayed the record and updated
>>>> XLogCtl->lastReplayedEndRecPtr.
>>>>
>>>> I think fixing this would require too invasive changes at this
>>>> point. I
>>>> think we might be able to live with 10s delay issue for one
>>>> release, but
>>>> it sure is ugly :(.
>>> This is indeed pretty painful. It's a lot more regularly occuring if
>>> you
>>> either have a slot disk, or you switch around the order of
>>> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
>>>
>>> - There's about which timeline to use. If you use pg_recvlogical and
>>> you
>>>    restart the server, you'll see errors like:
>>>
>>>    pg_recvlogical: error: unexpected termination of replication
>>> stream: ERROR:  requested WAL segment 000000000000000000000003 has
>>> already been removed
>>>
>>>    the real filename is 000000010000000000000003 - i.e. the timeline is
>>>    0.
>>>
>>>    This isn't too hard to fix, but definitely needs fixing.
>>
>> Thanks, nice catch!
>>
>> From what I have seen, we are not going through InitXLOGAccess() on a
>> Standby and in some cases (like the one you mentioned)
>> StartLogicalReplication() is called without IdentifySystem() being
>> called previously: this lead to ThisTimeLineID still set to 0.
>>
>> I am proposing a fix in the attached v18 by adding a check in
>> StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.
>>
>>>
>>> - ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
>>>    leading us to drop a slot that has been created since we signalled a
>>>    recovery conflict.  See
>>> https://www.postgresql.org/message-id/20210408020913.zzprrlvqyvlt5cyy%40alap3.anarazel.de
>>>
>>>    for some very similar issues.
>>
>> I have rewritten this part by following the same logic as the one
>> used in 96540f80f8 (the commit linked to the thread you mentioned).
>>
>>>
>>> - Given the precedent of max_slot_wal_keep_size, I think it's wrong to
>>>    just drop the logical slots. Instead we should just mark them as
>>>    invalid, like InvalidateObsoleteReplicationSlots().
>>
>> Makes fully sense and done that way in the attached patch.
>>
>> I am setting the slot's data.xmin and data.catalog_xmin as
>> InvalidTransactionId to mark the slot(s) as invalid in case of conflict.
>>
>>> - There's no tests covering timeline switches, what happens if
>>> there's a
>>>    promotion if logical decoding is currently ongoing.
>>
>> I'll now work on the tests.
>>
>>>
>>> - The way ResolveRecoveryConflictWithLogicalSlots() builds the error
>>>    message is not good (and I've complained about it before...).
>>
>> I changed it and made it more simple.
>>
>> I also removed the details around mentioning xmin or catalog xmin (as
>> I am not sure of the added value and they are currently also not
>> mentioned during standby recovery snapshot conflict).
>>
>>>
>>> Unfortunately I think the things I have found are too many for me to
>>> address within the given time. I'll send a version with a somewhat
>>> polished set of the changes I made in the next few days...
>>
>> Thanks for the review and feedback.
>>
>> Please find enclosed v18 with the changes I worked on.
>>
>> I still need to have a look on the tests.
>
> Please find enclosed v19 that also contains the changes related to
> your TAP tests remarks, mainly:
>
> - get rid of 024 and add more tests in 026 (025 has been used in the
> meantime)
>
> - test that logical decoding actually produces useful and correct results
>
> - test standby promotion and logical decoding behavior once done
>
> - useless "use" removal
>
> - check_confl_logicalslot() function removal
>
> - rewrote make_slot_active() to make use of poll_query_until() and
> timeout
>
> - remove the useless eval()
>
> - remove the "Catalog xmins should advance after standby logical slot
> fetches the changes" test
>
> One thing that's not clear to me is your remark "There's also no test
> for a recovery conflict due to row removal": Don't you think that the
> "vacuum full" conflict test is enough? if not, what kind of additional
> tests would you like to see?
>
>>
>> There is also the 10s delay to work on, do you already have an idea
>> on how we should handle it?
>>
>> Thanks
>>
>> Bertrand
>>
> Thanks
>
> Bertrand
>
Please find enclosed v20 a needed rebase (nothing serious worth
mentioning) of v19.

FWIW, just to sum up that v19 (and so v20):

- contained the changes (see details above) related to your TAP tests
remarks

- contained the changes (see details above) related to your code remarks

There is still the 10s delay thing that need work: do you already have
an idea on how we should handle it?

And still one thing that's not clear to me is your remark "There's also
no test for a recovery conflict due to row removal": Don't you think
that the "vacuum full" conflict test is enough? if not, what kind of
additional tests would you like to see?

Thanks

Bertrand





The patch does not apply and an updated patch is required.
patching file src/include/replication/slot.h
Hunk #1 FAILED at 214.
1 out of 2 hunks FAILED -- saving rejects to file src/include/replication/slot.h.rej



--
Ibrar Ahmed

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 7/19/21 12:13 PM, Ibrar Ahmed wrote:

On Fri, Jul 16, 2021 at 1:07 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
Hi Andres,

On 6/22/21 12:38 PM, Drouvot, Bertrand wrote:
> Hi Andres,
>
> On 6/14/21 7:41 AM, Drouvot, Bertrand wrote:
>> Hi Andres,
>>
>> On 4/8/21 5:47 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2021-04-07 13:32:18 -0700, Andres Freund wrote:
>>>> While working on this I found a, somewhat substantial, issue:
>>>>
>>>> When the primary is idle, on the standby logical decoding via
>>>> walsender
>>>> will typically not process the records until further WAL writes
>>>> come in
>>>> from the primary, or until a 10s lapsed.
>>>>
>>>> The problem is that WalSndWaitForWal() waits for the *replay* LSN to
>>>> increase, but gets woken up by walreceiver when new WAL has been
>>>> flushed. Which means that typically walsenders will get woken up at
>>>> the
>>>> same time that the startup process will be - which means that by the
>>>> time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
>>>> that the startup process already replayed the record and updated
>>>> XLogCtl->lastReplayedEndRecPtr.
>>>>
>>>> I think fixing this would require too invasive changes at this
>>>> point. I
>>>> think we might be able to live with 10s delay issue for one
>>>> release, but
>>>> it sure is ugly :(.
>>> This is indeed pretty painful. It's a lot more regularly occuring if
>>> you
>>> either have a slot disk, or you switch around the order of
>>> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
>>>
>>> - There's about which timeline to use. If you use pg_recvlogical and
>>> you
>>>    restart the server, you'll see errors like:
>>>
>>>    pg_recvlogical: error: unexpected termination of replication
>>> stream: ERROR:  requested WAL segment 000000000000000000000003 has
>>> already been removed
>>>
>>>    the real filename is 000000010000000000000003 - i.e. the timeline is
>>>    0.
>>>
>>>    This isn't too hard to fix, but definitely needs fixing.
>>
>> Thanks, nice catch!
>>
>> From what I have seen, we are not going through InitXLOGAccess() on a
>> Standby and in some cases (like the one you mentioned)
>> StartLogicalReplication() is called without IdentifySystem() being
>> called previously: this lead to ThisTimeLineID still set to 0.
>>
>> I am proposing a fix in the attached v18 by adding a check in
>> StartLogicalReplication() and ensuring that ThisTimeLineID is retrieved.
>>
>>>
>>> - ResolveRecoveryConflictWithLogicalSlots() is racy - potentially
>>>    leading us to drop a slot that has been created since we signalled a
>>>    recovery conflict.  See
>>> https://www.postgresql.org/message-id/20210408020913.zzprrlvqyvlt5cyy%40alap3.anarazel.de
>>>
>>>    for some very similar issues.
>>
>> I have rewritten this part by following the same logic as the one
>> used in 96540f80f8 (the commit linked to the thread you mentioned).
>>
>>>
>>> - Given the precedent of max_slot_wal_keep_size, I think it's wrong to
>>>    just drop the logical slots. Instead we should just mark them as
>>>    invalid, like InvalidateObsoleteReplicationSlots().
>>
>> Makes fully sense and done that way in the attached patch.
>>
>> I am setting the slot's data.xmin and data.catalog_xmin as
>> InvalidTransactionId to mark the slot(s) as invalid in case of conflict.
>>
>>> - There's no tests covering timeline switches, what happens if
>>> there's a
>>>    promotion if logical decoding is currently ongoing.
>>
>> I'll now work on the tests.
>>
>>>
>>> - The way ResolveRecoveryConflictWithLogicalSlots() builds the error
>>>    message is not good (and I've complained about it before...).
>>
>> I changed it and made it more simple.
>>
>> I also removed the details around mentioning xmin or catalog xmin (as
>> I am not sure of the added value and they are currently also not
>> mentioned during standby recovery snapshot conflict).
>>
>>>
>>> Unfortunately I think the things I have found are too many for me to
>>> address within the given time. I'll send a version with a somewhat
>>> polished set of the changes I made in the next few days...
>>
>> Thanks for the review and feedback.
>>
>> Please find enclosed v18 with the changes I worked on.
>>
>> I still need to have a look on the tests.
>
> Please find enclosed v19 that also contains the changes related to
> your TAP tests remarks, mainly:
>
> - get rid of 024 and add more tests in 026 (025 has been used in the
> meantime)
>
> - test that logical decoding actually produces useful and correct results
>
> - test standby promotion and logical decoding behavior once done
>
> - useless "use" removal
>
> - check_confl_logicalslot() function removal
>
> - rewrote make_slot_active() to make use of poll_query_until() and
> timeout
>
> - remove the useless eval()
>
> - remove the "Catalog xmins should advance after standby logical slot
> fetches the changes" test
>
> One thing that's not clear to me is your remark "There's also no test
> for a recovery conflict due to row removal": Don't you think that the
> "vacuum full" conflict test is enough? if not, what kind of additional
> tests would you like to see?
>
>>
>> There is also the 10s delay to work on, do you already have an idea
>> on how we should handle it?
>>
>> Thanks
>>
>> Bertrand
>>
> Thanks
>
> Bertrand
>
Please find enclosed v20 a needed rebase (nothing serious worth
mentioning) of v19.

FWIW, just to sum up that v19 (and so v20):

- contained the changes (see details above) related to your TAP tests
remarks

- contained the changes (see details above) related to your code remarks

There is still the 10s delay thing that need work: do you already have
an idea on how we should handle it?

And still one thing that's not clear to me is your remark "There's also
no test for a recovery conflict due to row removal": Don't you think
that the "vacuum full" conflict test is enough? if not, what kind of
additional tests would you like to see?

Thanks

Bertrand





The patch does not apply and an updated patch is required.
patching file src/include/replication/slot.h
Hunk #1 FAILED at 214.
1 out of 2 hunks FAILED -- saving rejects to file src/include/replication/slot.h.rej

Thanks for the warning, rebase done and new v21 version attached.

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:
> Thanks for the warning, rebase done and new v21 version attached.

Did you have a go at fixing the walsender race conditions I
(re-)discovered? Without fixing those I don't see this patch going in...

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
On 2021-Jul-27, Drouvot, Bertrand wrote:

> diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c

> +bool
> +get_rel_logical_catalog(Oid relid)
> +{
> +    bool    res;
> +    Relation rel;
> +
> +    /* assume previously locked */
> +    rel = table_open(relid, NoLock);
> +    res = RelationIsAccessibleInLogicalDecoding(rel);
> +    table_close(rel, NoLock);
> +
> +    return res;
> +}

So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.
Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.
At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs instead of opening and closing it time and again.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)



Re: Minimal logical decoding on standbys

From
Ronan Dunklau
Date:
Le mardi 27 juillet 2021, 09:23:48 CEST Drouvot, Bertrand a écrit :
> Thanks for the warning, rebase done and new v21 version attached.
>
> Bertrand

Hello,

I've taken a look at this patch, and it looks like you adressed every prior
remark, including the race condition Andres was worried about.

As for the basics: make check-world and make installcheck-world pass.

I think the beahviour when dropping a database on the primary should be
documented, and proper procedures for handling it correctly should be
suggested.

Something along the lines of:

"If a database is dropped on the primary server, the logical replication slot
on the standby will be dropped as well. This means that you should ensure that
the client usually connected to this slot has had the opportunity to stream
the latest changes before the database is dropped."

As for the patches themselves, I only have two small comments to make.

In patch 0002, in InvalidateConflictingLogicalReplicationSlots, I don't see the
need to check for an InvalidOid since we already check the SlotIsLogical:

+        /* We are only dealing with *logical* slot conflicts. */
+        if (!SlotIsLogical(s))
+            continue;
+
+        /* not our database and we don't want all the database,
skip */
+        if ((s->data.database != InvalidOid && s->data.database
!= dboid) && TransactionIdIsValid(xid))
+            continue;

In patch 0004, small typo in the test file:
+##################################################
+# Test standby promotion and logical decoding bheavior
+# after the standby gets promoted.
+##################################################

Thank you for working on this !

Regards,

--
Ronan Dunklau





Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Andres,

On 7/27/21 7:22 PM, Andres Freund wrote:
> Hi,
>
> On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:
>> Thanks for the warning, rebase done and new v21 version attached.
> Did you have a go at fixing the walsender race conditions I
> (re-)discovered? Without fixing those I don't see this patch going in...

Those new patches should be addressing all your previous code and TAP 
tests remarks, except those 2 for which I would need your input:

1. The first one is linked to your remarks:
"

 > While working on this I found a, somewhat substantial, issue:
 >
 > When the primary is idle, on the standby logical decoding via walsender
 > will typically not process the records until further WAL writes come in
 > from the primary, or until a 10s lapsed.
 >
 > The problem is that WalSndWaitForWal() waits for the *replay* LSN to
 > increase, but gets woken up by walreceiver when new WAL has been
 > flushed. Which means that typically walsenders will get woken up at the
 > same time that the startup process will be - which means that by the
 > time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
 > that the startup process already replayed the record and updated
 > XLogCtl->lastReplayedEndRecPtr.
 >
 > I think fixing this would require too invasive changes at this point. I
 > think we might be able to live with 10s delay issue for one release, but
 > it sure is ugly :(.

This is indeed pretty painful. It's a lot more regularly occuring if you
either have a slot disk, or you switch around the order of
WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().

"

Is that what you are referring to as the “walsender race conditions”?
If so, do you already have in mind a way to handle this? (I thought you 
already had in mind a way to handle it so the question)

2. The second one is linked to your remark:

"There's also no test 
for a recovery conflict due to row removal"

Don't you think that the 
"vacuum full" conflict test is enough?

if not, what kind of additional 
tests would you like to see?


In the same time, I am attaching a new v22 as a rebase was needed since v21.

Thanks

Bertrand


Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:
> On 2021-Jul-27, Drouvot, Bertrand wrote:
>
>> diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
>> +bool
>> +get_rel_logical_catalog(Oid relid)
>> +{
>> +     bool    res;
>> +     Relation rel;
>> +
>> +     /* assume previously locked */
>> +     rel = table_open(relid, NoLock);
>> +     res = RelationIsAccessibleInLogicalDecoding(rel);
>> +     table_close(rel, NoLock);
>> +
>> +     return res;
>> +}
> So RelationIsAccessibleInLogicalDecoding() does a cheap check for
> wal_level which can be done without opening the table; I think this
> function should be rearranged to avoid doing that when not needed.

Thanks for looking at it.


> Also, putting this function in lsyscache.c seems somewhat wrong since
> it's not merely accessing the system caches ...
>
> I think it would be better to move this elsewhere (relcache.c, proto in
> relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
> short-circuit for the check that can be done before opening the table.
> At least the GiST code appears to be able to call this several times per
> vacuum run, so it makes sense to short-circuit it for the fast case.
>
> ... though looking at the GiST code again I wonder if it would be more
> sensible to just stash the table's Relation pointer somewhere in the
> context structs instead of opening and closing it time and again.

That does make sense, I'll look at it.

Bertrand




Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Ronan,

Thanks for looking at it.

On 8/2/21 1:57 PM, Ronan Dunklau wrote:
> Le mardi 27 juillet 2021, 09:23:48 CEST Drouvot, Bertrand a écrit :
>> Thanks for the warning, rebase done and new v21 version attached.
>>
>> Bertrand
> Hello,
>
> I've taken a look at this patch, and it looks like you adressed every prior
> remark, including the race condition Andres was worried about.

I think there is still 2 points that need to be addressed (see [1])

>
> As for the basics: make check-world and make installcheck-world pass.
>
> I think the beahviour when dropping a database on the primary should be
> documented, and proper procedures for handling it correctly should be
> suggested.
>
> Something along the lines of:
>
> "If a database is dropped on the primary server, the logical replication slot
> on the standby will be dropped as well. This means that you should ensure that
> the client usually connected to this slot has had the opportunity to stream
> the latest changes before the database is dropped."

I am not sure we should highlight this as part of this patch.

I mean, the same would currently occur on a non standby if you drop a 
database that has a replication slot linked to it.

>
> As for the patches themselves, I only have two small comments to make.
>
> In patch 0002, in InvalidateConflictingLogicalReplicationSlots, I don't see the
> need to check for an InvalidOid since we already check the SlotIsLogical:
>
> +               /* We are only dealing with *logical* slot conflicts. */
> +               if (!SlotIsLogical(s))
> +                       continue;
> +
> +               /* not our database and we don't want all the database,
> skip */
> +               if ((s->data.database != InvalidOid && s->data.database
> != dboid) && TransactionIdIsValid(xid))
> +                       continue;

Agree, v22 attached in [1] does remove the useless s->data.database != 
InvalidOid check, thanks!

>
> In patch 0004, small typo in the test file:
> +##################################################
> +# Test standby promotion and logical decoding bheavior
> +# after the standby gets promoted.
> +##################################################

Typo also fixed in v22, thanks!

Bertrand

[1]: 
https://www.postgresql.org/message-id/69aad0bf-697a-04e1-df6c-0920ec8fa528%40amazon.com




Re: Minimal logical decoding on standbys

From
Ronan Dunklau
Date:
Le lundi 2 août 2021, 17:31:46 CEST Drouvot, Bertrand a écrit :
> > I think the beahviour when dropping a database on the primary should be
> > documented, and proper procedures for handling it correctly should be
> > suggested.
> >
> > Something along the lines of:
> >
> > "If a database is dropped on the primary server, the logical replication
> > slot on the standby will be dropped as well. This means that you should
> > ensure that the client usually connected to this slot has had the
> > opportunity to stream the latest changes before the database is dropped."
>
> I am not sure we should highlight this as part of this patch.
>
> I mean, the same would currently occur on a non standby if you drop a
> database that has a replication slot linked to it.

The way I see it, the main difference is that you drop an additional object on
the standby, which doesn't exist and that you don't necessarily have any
knowledge of on the primary. As such, I thought it would be better to be
explicit about it to warn users of that possible case.


Regards,

--
Ronan Dunklau











Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2021-08-02 16:45:23 +0200, Drouvot, Bertrand wrote:
> On 7/27/21 7:22 PM, Andres Freund wrote:
> > On 2021-07-27 09:23:48 +0200, Drouvot, Bertrand wrote:
> > > Thanks for the warning, rebase done and new v21 version attached.
> > Did you have a go at fixing the walsender race conditions I
> > (re-)discovered? Without fixing those I don't see this patch going in...
>
> Those new patches should be addressing all your previous code and TAP tests
> remarks, except those 2 for which I would need your input:
>
> 1. The first one is linked to your remarks:
> "
>
> > While working on this I found a, somewhat substantial, issue:
> >
> > When the primary is idle, on the standby logical decoding via walsender
> > will typically not process the records until further WAL writes come in
> > from the primary, or until a 10s lapsed.
> >
> > The problem is that WalSndWaitForWal() waits for the *replay* LSN to
> > increase, but gets woken up by walreceiver when new WAL has been
> > flushed. Which means that typically walsenders will get woken up at the
> > same time that the startup process will be - which means that by the
> > time the logical walsender checks GetXLogReplayRecPtr() it's unlikely
> > that the startup process already replayed the record and updated
> > XLogCtl->lastReplayedEndRecPtr.
> >
> > I think fixing this would require too invasive changes at this point. I
> > think we might be able to live with 10s delay issue for one release, but
> > it sure is ugly :(.
>
> This is indeed pretty painful. It's a lot more regularly occuring if you
> either have a slot disk, or you switch around the order of
> WakeupRecovery() and WalSndWakeup() XLogWalRcvFlush().
>
> "
>
> Is that what you are referring to as the “walsender race conditions”?

Yes.


> If so, do you already have in mind a way to handle this? (I thought you
> already had in mind a way to handle it so the question)

Yes. I think we need to add a condition variable to be able to wait for
WAL positions to change. Either multiple condition variables (one for
the flush position, one for the replay position), or one that just
changes more often. That way one can wait for apply without a race
condition.


> 2. The second one is linked to your remark:
>
> "There's also no test 
for a recovery conflict due to row removal"
>
> Don't you think that the 
"vacuum full" conflict test is enough?

It's not. It'll cause conflicts due to exclusive locks etc.


> if not, what kind of additional 
tests would you like to see?

A few catalog rows being removed (e.g. due to DELETE and then VACUUM
*without* full) and a standby without hot_standby_feedback catching
that.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 8/2/21 6:01 PM, Andres Freund wrote:
> While working on this I found a, somewhat substantial, issue:
>> If so, do you already have in mind a way to handle this? (I thought you
>> already had in mind a way to handle it so the question)
> Yes. I think we need to add a condition variable to be able to wait for
> WAL positions to change. Either multiple condition variables (one for
> the flush position, one for the replay position), or one that just
> changes more often. That way one can wait for apply without a race
> condition.
>
Thanks for the feedback.

Wouldn't a condition variable on the replay position be enough? I don't 
get why the proposed one on the flush position is needed.

>> if not, what kind of additional 
tests would you like to see?
> A few catalog rows being removed (e.g. due to DELETE and then VACUUM
> *without* full) and a standby without hot_standby_feedback catching
> that.

Test added in v23 attached.

Thanks

Bertrand


Attachment

Re: Minimal logical decoding on standbys

From
Peter Eisentraut
Date:
I noticed the tests added in this patch set are very slow.  Here are 
some of the timings:

...
[13:26:59] t/018_wal_optimize.pl ................ ok    13976 ms
[13:27:13] t/019_replslot_limit.pl .............. ok    10976 ms
[13:27:24] t/020_archive_status.pl .............. ok     6190 ms
[13:27:30] t/021_row_visibility.pl .............. ok     3227 ms
[13:27:33] t/022_crash_temp_files.pl ............ ok     2296 ms
[13:27:36] t/023_pitr_prepared_xact.pl .......... ok     3601 ms
[13:27:39] t/024_archive_recovery.pl ............ ok     3937 ms
[13:27:43] t/025_stuck_on_old_timeline.pl ....... ok     4348 ms
[13:27:47] t/026_standby_logical_decoding.pl .... ok   117730 ms  <<<

Is it possible to improve this?



Re: Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi Peter,

On 8/26/21 1:35 PM, Peter Eisentraut wrote:
> CAUTION: This email originated from outside of the organization. Do 
> not click links or open attachments unless you can confirm the sender 
> and know the content is safe.
>
>
>
> I noticed the tests added in this patch set are very slow.  Here are
> some of the timings:
>
> ...
> [13:26:59] t/018_wal_optimize.pl ................ ok    13976 ms
> [13:27:13] t/019_replslot_limit.pl .............. ok    10976 ms
> [13:27:24] t/020_archive_status.pl .............. ok     6190 ms
> [13:27:30] t/021_row_visibility.pl .............. ok     3227 ms
> [13:27:33] t/022_crash_temp_files.pl ............ ok     2296 ms
> [13:27:36] t/023_pitr_prepared_xact.pl .......... ok     3601 ms
> [13:27:39] t/024_archive_recovery.pl ............ ok     3937 ms
> [13:27:43] t/025_stuck_on_old_timeline.pl ....... ok     4348 ms
> [13:27:47] t/026_standby_logical_decoding.pl .... ok   117730 ms <<<
>
> Is it possible to improve this?

Thanks for looking at it.

Once the walsender race conditions mentioned by Andres in [1] are 
addressed then i think that the tests should be much more faster.

I'll try to have a look soon and come with a proposal to address those 
race conditions.

Thanks

Bertrand

[1]: 
https://www.postgresql.org/message-id/20210802160133.uugcce5ql4m5mv5m%40alap3.anarazel.de





Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi Andres,

On 8/6/21 1:27 PM, Drouvot, Bertrand wrote:
Hi,

On 8/2/21 6:01 PM, Andres Freund wrote:
While working on this I found a, somewhat substantial, issue:
If so, do you already have in mind a way to handle this? (I thought you
already had in mind a way to handle it so the question)
Yes. I think we need to add a condition variable to be able to wait for
WAL positions to change. Either multiple condition variables (one for
the flush position, one for the replay position), or one that just
changes more often. That way one can wait for apply without a race
condition.

Thanks for the feedback.

Wouldn't a condition variable on the replay position be enough? I don't get why the proposed one on the flush position is needed.

Please find enclosed a patch proposal to address those corner cases.

I think (but may be wrong) that the condition variable on the flush position would be needed only for the walsender(s) on non Standby node, that's why:

  • I made use of a condition variable on the replay position only.
  • The walsender waits on it in WalSndWaitForWal() only if recovery is in progress.

For simplicity to discuss those corner cases, this is a dedicated patch that can be applied on top of v23 patches shared previously.

Thanks

Bertrand


if not, what kind of additional 
tests would you like to see?
A few catalog rows being removed (e.g. due to DELETE and then VACUUM
*without* full) and a standby without hot_standby_feedback catching
that.

Test added in v23 attached.

Thanks

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi Alvaro,

On 8/2/21 4:56 PM, Drouvot, Bertrand wrote:
Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:
On 2021-Jul-27, Drouvot, Bertrand wrote:

diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
+bool
+get_rel_logical_catalog(Oid relid)
+{
+     bool    res;
+     Relation rel;
+
+     /* assume previously locked */
+     rel = table_open(relid, NoLock);
+     res = RelationIsAccessibleInLogicalDecoding(rel);
+     table_close(rel, NoLock);
+
+     return res;
+}
So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.

Thanks for looking at it.


Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

If so, then what about also creating a new RelationIsAccessibleWhileLogicalWalLevel() or something like this doing the same as RelationIsAccessibleInLogicalDecoding() but without the XLogLogicalInfoActive() check?

At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs

Do you have a "good place" in mind?

Thanks

Bertrand

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 9/9/21 9:17 AM, Drouvot, Bertrand wrote:

Hi Alvaro,

On 8/2/21 4:56 PM, Drouvot, Bertrand wrote:
Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:
On 2021-Jul-27, Drouvot, Bertrand wrote:

diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
+bool
+get_rel_logical_catalog(Oid relid)
+{
+     bool    res;
+     Relation rel;
+
+     /* assume previously locked */
+     rel = table_open(relid, NoLock);
+     res = RelationIsAccessibleInLogicalDecoding(rel);
+     table_close(rel, NoLock);
+
+     return res;
+}
So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.

Thanks for looking at it.


Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

If so, then what about also creating a new RelationIsAccessibleWhileLogicalWalLevel() or something like this doing the same as RelationIsAccessibleInLogicalDecoding() but without the XLogLogicalInfoActive() check?

At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs

Do you have a "good place" in mind?

Another rebase attached.

The patch proposal to address Andre's walsender corner cases is still a dedicated commit (as i think it may be easier to discuss).

Thanks

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Fabrízio de Royes Mello
Date:


On Wed, Sep 15, 2021 at 8:36 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Another rebase attached.
>
> The patch proposal to address Andre's walsender corner cases is still a dedicated commit (as i think it may be easier to discuss).
>

Did one more battery of tests and everything went well...

But doing some manually tests:

1. Setup master/replica (wal_level=logical, hot_standby_feedback=on, etc)
2. Initialize the master instance: "pgbench -i -s10 on master"
3. Terminal1: execute "pgbench -c20 -T 2000"
4. Terminal2: create the logical replication slot:

271480 (replica) fabrizio=# select * from pg_create_logical_replication_slot('test_logical', 'test_decoding');
-[ RECORD 1 ]-----------
slot_name | test_logical
lsn       | 1/C7C59E0

Time: 37658.725 ms (00:37.659)

5. Terminal3: start the pg_recvlogical

~/pgsql
➜ pg_recvlogical -p 5433 -S test_logical -d fabrizio -f - --start
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR:  replication slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR:  replication slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR:  replication slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR:  replication slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR:  replication slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR:  replication slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "test_logical" LOGICAL 0/0": ERROR:  replication slot "test_logical" is active for PID 271480
pg_recvlogical: disconnected; waiting 5 seconds to try again
BEGIN 3767318
COMMIT 3767318
BEGIN 3767319
COMMIT 3767319
BEGIN 3767320
table public.pgbench_history: TRUNCATE: (no-flags)
COMMIT 3767320
BEGIN 3767323
table public.pgbench_accounts: UPDATE: aid[integer]:398507 bid[integer]:4 abalance[integer]:-1307 filler[character]:'                                                                                    '
table public.pgbench_tellers: UPDATE: tid[integer]:17 bid[integer]:2 tbalance[integer]:-775356 filler[character]:null
table public.pgbench_branches: UPDATE: bid[integer]:4 bbalance[integer]:1862180 filler[character]:null
table public.pgbench_history: INSERT: tid[integer]:17 bid[integer]:4 aid[integer]:398507 delta[integer]:182 mtime[timestamp without time zone]:'2021-09-17 17:25:19.811239' filler[character]:null
COMMIT 3767323
BEGIN 3767322
table public.pgbench_accounts: UPDATE: aid[integer]:989789 bid[integer]:10 abalance[integer]:1224 filler[character]:'                                                                                    '
table public.pgbench_tellers: UPDATE: tid[integer]:86 bid[integer]:9 tbalance[integer]:-283737 filler[character]:null
table public.pgbench_branches: UPDATE: bid[integer]:9 bbalance[integer]:1277609 filler[character]:null
table public.pgbench_history: INSERT: tid[integer]:86 bid[integer]:9 aid[integer]:989789 delta[integer]:-2934 mtime[timestamp without time zone]:'2021-09-17 17:25:19.811244' filler[character]:null
COMMIT 3767322

Even with activity on primary the creation of the logical replication slot took ~38s. Can we do something related to it or should we need to clarify even more the documentation?

Regards,

--
   Fabrízio de Royes Mello         Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 9/17/21 10:32 PM, Fabrízio de Royes Mello wrote:

On Wed, Sep 15, 2021 at 8:36 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Another rebase attached.
>
> The patch proposal to address Andre's walsender corner cases is still a dedicated commit (as i think it may be easier to discuss).
>

Did one more battery of tests and everything went well...

Thanks for looking at it!


But doing some manually tests:

1. Setup master/replica (wal_level=logical, hot_standby_feedback=on, etc)
2. Initialize the master instance: "pgbench -i -s10 on master"
3. Terminal1: execute "pgbench -c20 -T 2000"
4. Terminal2: create the logical replication slot:

271480 (replica) fabrizio=# select * from pg_create_logical_replication_slot('test_logical', 'test_decoding');
-[ RECORD 1 ]-----------
slot_name | test_logical
lsn       | 1/C7C59E0

Time: 37658.725 ms (00:37.659)


Even with activity on primary the creation of the logical replication slot took ~38s. Can we do something related to it or should we need to clarify even more the documentation?

For the logical slot creation on the standby, as we can not do WAL writes, we have to wait for xl_running_xact to be logged on the primary and be replayed on the standby.

So we are somehow dependent on the checkpoints on the primary and LOG_SNAPSHOT_INTERVAL_MS.

If we want to get rid of this, what i could think of is the standby having to ask the primary to log a standby snapshot (until we get one we are happy with).

Or, we may just want to mention in the doc:

+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, creating a logical slot on standby
+     may take a noticeable time.

Instead of:

+     For a logical slot to be created, it builds a historic snapshot, for which
+     information of all the currently running transactions is essential. On
+     primary, this information is available, but on standby, this information
+     has to be obtained from primary. So, slot creation may wait for some
+     activity to happen on the primary. If the primary is idle, creating a
+     logical slot on standby may take a noticeable time.

What do you think?

Thanks

Bertrand

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 9/15/21 1:36 PM, Drouvot, Bertrand wrote:

Hi,

On 9/9/21 9:17 AM, Drouvot, Bertrand wrote:

Hi Alvaro,

On 8/2/21 4:56 PM, Drouvot, Bertrand wrote:
Hi Alvaro,

On 7/28/21 5:26 PM, Alvaro Herrera wrote:
On 2021-Jul-27, Drouvot, Bertrand wrote:

diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
+bool
+get_rel_logical_catalog(Oid relid)
+{
+     bool    res;
+     Relation rel;
+
+     /* assume previously locked */
+     rel = table_open(relid, NoLock);
+     res = RelationIsAccessibleInLogicalDecoding(rel);
+     table_close(rel, NoLock);
+
+     return res;
+}
So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.

Thanks for looking at it.


Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.

So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
RelationIsAccessibleInLogicalDecoding()?

If so, then what about also creating a new RelationIsAccessibleWhileLogicalWalLevel() or something like this doing the same as RelationIsAccessibleInLogicalDecoding() but without the XLogLogicalInfoActive() check?

At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs

Do you have a "good place" in mind?

Another rebase attached.

The patch proposal to address Andre's walsender corner cases is still a dedicated commit (as i think it may be easier to discuss).

Another rebase attached (mainly to fix TAP tests failing due to b3b4d8e68a).

@Andres, the patch file number 6 contains an attempt to fix the Walsender corner case you pointed out.

@Alvaro, I did not look at your remark yet. Do you have a "good place" in mind? (related to "just stash the table's Relation pointer somewhere in the context structs")

Given the size of this patch series, I'm wondering if we could start committing piece per piece (while still working on the corner cases in parallel).

That would maximize the amount of coverage it gets in the v15 development cycle.

What do you think?

Thanks

Bertrand

Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
> So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
> RelationIsAccessibleInLogicalDecoding()?

I think 0001 is utterly unacceptable. We cannot add calls to
table_open() in low-level functions like this. Suppose for example
that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
a buffer lock on the page. The idea that it's safe to call
table_open() while holding a buffer lock cannot be taken seriously.
That could do arbitrary amounts of work taking any number of other
buffer locks, which could easily deadlock (and the deadlock detector
wouldn't help, since these are lwlocks). Even if that were no issue,
we really, really do not want to write code that could result in large
numbers of additional calls to table_open() -- and _bt_getbuf() is
certainly a frequently-used function. I think that, in order to have
any chance of being acceptable, this would need to be restructured so
that it pulls data from an existing relcache entry that is known to be
valid, without attempting to create a new one. That is,
get_rel_logical_decoding() would need to take a Relation argument, not
an OID.

I also think it's super-weird that the value being logged is computed
using RelationIsAccessibleInLogicalDecoding(). That means that if
wal_level < logical, we'll set onCatalogTable = false in the xlog
record, regardless of whether that's true or not. Now I suppose it
won't matter, because presumably this field is only going to be
consulted for whatever purpose when logical replication is active, but
I object on principle to the idea of a field whose name suggests that
it means one thing and whose value is inconsistent with that
interpretation.

Regarding 0003, I notice that GetXLogReplayRecPtr() gets an extra
argument that is set to false everywhere except one place that is
inside the new code. That suggests to me that putting logic that the
other 15 callers don't need is not the right approach here. It also
looks like, in the one place where that argument does get passed as
true, LogStandbySnapshot() moves outside the retry loop. I think
that's unlikely to be correct.

I also notice that 0003 deletes a comment that says "We need to force
hot_standby_feedback to be enabled at all times so the primary cannot
remove rows we need," but also that this is the only mention of
hot_standby_feedback in the entire patch set. If the existing comment
that we need to do something about that is incorrect, we should update
it independently of this patch set to be correct. But if the existing
comment is correct then there ought to be something in the patch that
deals with it.

Another part of that same deleted comment says "We need to be able to
correctly and quickly identify the timeline LSN belongs to," but I
don't see what the patch does about that, either. I'm actually not
sure exactly what that's talking about, but today for unrelated
reasons I happened to be looking at logical_read_xlog_page(), which is
actually what caused me to look at this thread. In that function we
have, as the first two lines of executable code:

     XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
     sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

The second line of code depends on the value of ThisTimeLineID. The
first line of code does too, because XLogReadDetermineTimeline() uses
that variable internally. If logical decoding is only allowed on a
primary, then there can't really be an issue here, because we will
have checked RecoveryInProgress() in
CheckLogicalDecodingRequirements() and ThisTimeLineID will have its
final value. But on a standby, I'm not sure that ThisTimeLineID even
has to be initialized here, and I really can't see any reason at all
why the value it contains is necessarily still current. This
function's sister, read_local_xlog_page(), contains a bunch of logic
that tries to make sure that we're always reading every record from
the right timeline, but there's nothing similar here. I think that
would likely have to be fixed in order for decoding to work on
standbys, but maybe I'm missing something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
> > So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
> > RelationIsAccessibleInLogicalDecoding()?
> 
> I think 0001 is utterly unacceptable. We cannot add calls to
> table_open() in low-level functions like this. Suppose for example
> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
> a buffer lock on the page. The idea that it's safe to call
> table_open() while holding a buffer lock cannot be taken seriously.

Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
think. Possibly a bit more verbose than nice, but...

Alternatively we could propagate the information whether a relcache entry is
for a catalog from the table to the index. Then we'd not need to change the
btree code to pass the table down.


> That could do arbitrary amounts of work taking any number of other
> buffer locks, which could easily deadlock (and the deadlock detector
> wouldn't help, since these are lwlocks). Even if that were no issue,
> we really, really do not want to write code that could result in large
> numbers of additional calls to table_open() -- and _bt_getbuf() is
> certainly a frequently-used function.

The BTPageIsRecyclable() path hopefully less so. Not that that makes it OK.


> I think that, in order to have
> any chance of being acceptable, this would need to be restructured so
> that it pulls data from an existing relcache entry that is known to be
> valid, without attempting to create a new one. That is,
> get_rel_logical_decoding() would need to take a Relation argument, not
> an OID.

Hm? Once we have a relation we don't really need the helper function anymore.


> I also think it's super-weird that the value being logged is computed
> using RelationIsAccessibleInLogicalDecoding(). That means that if
> wal_level < logical, we'll set onCatalogTable = false in the xlog
> record, regardless of whether that's true or not. Now I suppose it
> won't matter, because presumably this field is only going to be
> consulted for whatever purpose when logical replication is active, but
> I object on principle to the idea of a field whose name suggests that
> it means one thing and whose value is inconsistent with that
> interpretation.

Hm. Not sure what a good solution for this is. I don't think we should make
the field independent of wal_level - it doesn't really mean anything with a
lower wal_level. And it increases the illusion that the table is guaranteed to
be a system table or something a bit. Perhaps the field name should hint at
this being logically decoding related?


> I also notice that 0003 deletes a comment that says "We need to force
> hot_standby_feedback to be enabled at all times so the primary cannot
> remove rows we need," but also that this is the only mention of
> hot_standby_feedback in the entire patch set. If the existing comment
> that we need to do something about that is incorrect, we should update
> it independently of this patch set to be correct. But if the existing
> comment is correct then there ought to be something in the patch that
> deals with it.

The patch deals with this - we'll detect the removal of row versions that
aren't needed anymore and stop decoding. Of course you'll most of the time
want to use hs_feedback, but sometimes it'll also just be a companion slot on
the primary or such (think slots for failover or such).


> Another part of that same deleted comment says "We need to be able to
> correctly and quickly identify the timeline LSN belongs to," but I
> don't see what the patch does about that, either. I'm actually not
> sure exactly what that's talking about

Hm - could you expand on what you're unclear about re LSN->timeline? It's just
that we need to read a WAL page for a certain LSN, and for that we need the
timeline?


> , but today for unrelated
> reasons I happened to be looking at logical_read_xlog_page(), which is
> actually what caused me to look at this thread. In that function we
> have, as the first two lines of executable code:
> 
>      XLogReadDetermineTimeline(state, targetPagePtr, reqLen);
>      sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
>
> The second line of code depends on the value of ThisTimeLineID. The
> first line of code does too, because XLogReadDetermineTimeline() uses
> that variable internally. If logical decoding is only allowed on a
> primary, then there can't really be an issue here, because we will
> have checked RecoveryInProgress() in
> CheckLogicalDecodingRequirements() and ThisTimeLineID will have its
> final value. But on a standby, I'm not sure that ThisTimeLineID even
> has to be initialized here, and I really can't see any reason at all
> why the value it contains is necessarily still current.

I think the code tries to deal with this via XLogReadDetermineTimeline(),
which limits up to where WAL is valid on the current timeline, based on the
timeline history file. But as you say, it does rely on ThisTimeLineID for
that, and it's not obvious why it's likely current, let alone guaranteed to be
current.


> This function's sister, read_local_xlog_page(), contains a bunch of logic
> that tries to make sure that we're always reading every record from the
> right timeline, but there's nothing similar here. I think that would likely
> have to be fixed in order for decoding to work on standbys, but maybe I'm
> missing something.

I think that part actually works, afaict they both rely on the same
XLogReadDetermineTimeline() for that job afaict. What might be missing is
logic to update the target timeline.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Thu, Oct 28, 2021 at 5:07 PM Andres Freund <andres@anarazel.de> wrote:
> > I think that, in order to have
> > any chance of being acceptable, this would need to be restructured so
> > that it pulls data from an existing relcache entry that is known to be
> > valid, without attempting to create a new one. That is,
> > get_rel_logical_decoding() would need to take a Relation argument, not
> > an OID.
>
> Hm? Once we have a relation we don't really need the helper function anymore.

Well, that's fine, too.

> > I also think it's super-weird that the value being logged is computed
> > using RelationIsAccessibleInLogicalDecoding(). That means that if
> > wal_level < logical, we'll set onCatalogTable = false in the xlog
> > record, regardless of whether that's true or not. Now I suppose it
> > won't matter, because presumably this field is only going to be
> > consulted for whatever purpose when logical replication is active, but
> > I object on principle to the idea of a field whose name suggests that
> > it means one thing and whose value is inconsistent with that
> > interpretation.
>
> Hm. Not sure what a good solution for this is. I don't think we should make
> the field independent of wal_level - it doesn't really mean anything with a
> lower wal_level. And it increases the illusion that the table is guaranteed to
> be a system table or something a bit. Perhaps the field name should hint at
> this being logically decoding related?

Not sure - I don't know what this is for. I did wonder if maybe it
should be testing IsCatalogRelation(relation) ||
RelationIsUsedAsCatalogTable(relation) i.e.
RelationIsAccessibleInLogicalDecoding() with the removal of the
XLogLogicalInfoActive() and RelationNeedsWAL() tests. But since I
don't know what I'm talking about, all I can say for sure right now is
that the field name and the field contents don't seem to align.

> > I also notice that 0003 deletes a comment that says "We need to force
> > hot_standby_feedback to be enabled at all times so the primary cannot
> > remove rows we need," but also that this is the only mention of
> > hot_standby_feedback in the entire patch set. If the existing comment
> > that we need to do something about that is incorrect, we should update
> > it independently of this patch set to be correct. But if the existing
> > comment is correct then there ought to be something in the patch that
> > deals with it.
>
> The patch deals with this - we'll detect the removal of row versions that
> aren't needed anymore and stop decoding. Of course you'll most of the time
> want to use hs_feedback, but sometimes it'll also just be a companion slot on
> the primary or such (think slots for failover or such).

Where and how does this happen?

> > Another part of that same deleted comment says "We need to be able to
> > correctly and quickly identify the timeline LSN belongs to," but I
> > don't see what the patch does about that, either. I'm actually not
> > sure exactly what that's talking about
>
> Hm - could you expand on what you're unclear about re LSN->timeline? It's just
> that we need to read a WAL page for a certain LSN, and for that we need the
> timeline?

I don't know - I'm trying to understand the meaning of a comment that
I think you wrote originally.

> > This function's sister, read_local_xlog_page(), contains a bunch of logic
> > that tries to make sure that we're always reading every record from the
> > right timeline, but there's nothing similar here. I think that would likely
> > have to be fixed in order for decoding to work on standbys, but maybe I'm
> > missing something.
>
> I think that part actually works, afaict they both rely on the same
> XLogReadDetermineTimeline() for that job afaict. What might be missing is
> logic to update the target timeline.

Hmm, OK, perhaps I mis-spoke, but I think we're talking about the same
thing. read_local_xlog_page() has this:

          * RecoveryInProgress() will update ThisTimeLineID when it first
          * notices recovery finishes, so we only have to maintain it for the
          * local process until recovery ends.
          */
         if (!RecoveryInProgress())
             read_upto = GetFlushRecPtr();
         else
             read_upto = GetXLogReplayRecPtr(&ThisTimeLineID);
         tli = ThisTimeLineID;

That's a bulletproof guarantee that "tli" and "ThisTimeLineID" are up
to date. The other function has nothing similar.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:
> Hi,
>
> On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
>> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>>> So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
>>> RelationIsAccessibleInLogicalDecoding()?
>> I think 0001 is utterly unacceptable. We cannot add calls to
>> table_open() in low-level functions like this. Suppose for example
>> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
>> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
>> a buffer lock on the page. The idea that it's safe to call
>> table_open() while holding a buffer lock cannot be taken seriously.
> Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
> think. Possibly a bit more verbose than nice, but...
>
> Alternatively we could propagate the information whether a relcache entry is
> for a catalog from the table to the index. Then we'd not need to change the
> btree code to pass the table down.

+1 for the idea of propagating to the index. If that sounds good to you 
too, I can try to have a look at it.

Thanks Robert and Andres for the feedbacks you have done on the various 
sub-patches.

I've now in mind to work sub patch by sub patch (starting with 0001 
then) and move to the next one once we agree that the current one is 
"ready".

I think that could help us to get this new feature moving forward more 
"easily", what do you think?

Thanks

Bertrand




Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 2/25/22 10:34 AM, Drouvot, Bertrand wrote:
> Hi,
>
> On 10/28/21 11:07 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
>>> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand 
>>> <bdrouvot@amazon.com> wrote:
>>>> So you have in mind to check for XLogLogicalInfoActive() first, and 
>>>> if true, then open the relation and call
>>>> RelationIsAccessibleInLogicalDecoding()?
>>> I think 0001 is utterly unacceptable. We cannot add calls to
>>> table_open() in low-level functions like this. Suppose for example
>>> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
>>> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
>>> a buffer lock on the page. The idea that it's safe to call
>>> table_open() while holding a buffer lock cannot be taken seriously.
>> Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard 
>> to fix, I
>> think. Possibly a bit more verbose than nice, but...
>>
>> Alternatively we could propagate the information whether a relcache 
>> entry is
>> for a catalog from the table to the index. Then we'd not need to 
>> change the
>> btree code to pass the table down.
>
> +1 for the idea of propagating to the index. If that sounds good to 
> you too, I can try to have a look at it.
>
> Thanks Robert and Andres for the feedbacks you have done on the 
> various sub-patches.
>
> I've now in mind to work sub patch by sub patch (starting with 0001 
> then) and move to the next one once we agree that the current one is 
> "ready".
>
> I think that could help us to get this new feature moving forward more 
> "easily", what do you think?
>
> Thanks
>
> Bertrand
>
I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time 
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

- It would give more visibility in case others want to jump in

Hope that makes sense,

Thanks,

Bertrand




Re: Minimal logical decoding on standbys

From
Ibrar Ahmed
Date:


On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
Hi,

On 2/25/22 10:34 AM, Drouvot, Bertrand wrote:
> Hi,
>
> On 10/28/21 11:07 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
>>> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand
>>> <bdrouvot@amazon.com> wrote:
>>>> So you have in mind to check for XLogLogicalInfoActive() first, and
>>>> if true, then open the relation and call
>>>> RelationIsAccessibleInLogicalDecoding()?
>>> I think 0001 is utterly unacceptable. We cannot add calls to
>>> table_open() in low-level functions like this. Suppose for example
>>> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
>>> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
>>> a buffer lock on the page. The idea that it's safe to call
>>> table_open() while holding a buffer lock cannot be taken seriously.
>> Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard
>> to fix, I
>> think. Possibly a bit more verbose than nice, but...
>>
>> Alternatively we could propagate the information whether a relcache
>> entry is
>> for a catalog from the table to the index. Then we'd not need to
>> change the
>> btree code to pass the table down.
>
> +1 for the idea of propagating to the index. If that sounds good to
> you too, I can try to have a look at it.
>
> Thanks Robert and Andres for the feedbacks you have done on the
> various sub-patches.
>
> I've now in mind to work sub patch by sub patch (starting with 0001
> then) and move to the next one once we agree that the current one is
> "ready".
>
> I think that could help us to get this new feature moving forward more
> "easily", what do you think?
>
> Thanks
>
> Bertrand
>
I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

I have already done some research on that, I can definitely look at it.
 
- It would give more visibility in case others want to jump in

Hope that makes sense,

Thanks,

Bertrand



--
Ibrar Ahmed

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

Hi,

On 7/1/22 10:03 PM, Ibrar Ahmed wrote:

On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

I have already done some research on that, I can definitely look at it.

Thanks!

This feature proposal is currently made of 5 sub-patches:

0001: Add info in WAL records in preparation for logical slot conflict handling
0002: Handle logical slot conflicts on standby
0003: Allow logical decoding on standby.
0004: New TAP test for logical decoding on standby
0005: Doc changes describing details about logical decoding

I suggest that we focus on one sub-patch at a time.

I'll start with 0001 and come back with a rebase addressing Andres and Robert's previous comments.

Sounds good to you?

Thanks

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

Re: Minimal logical decoding on standbys

From
Ibrar Ahmed
Date:


On Mon, Jul 4, 2022 at 6:12 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

Hi,

On 7/1/22 10:03 PM, Ibrar Ahmed wrote:

On Thu, Jun 30, 2022 at 1:49 PM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
I'm going to re-create a CF entry for it, as:

- It seems there is a clear interest for the feature (given the time
already spend on it and the number of people that worked on)

- I've in mind to resume working on it

I have already done some research on that, I can definitely look at it.

Thanks!

This feature proposal is currently made of 5 sub-patches:

0001: Add info in WAL records in preparation for logical slot conflict handling
0002: Handle logical slot conflicts on standby
0003: Allow logical decoding on standby.
0004: New TAP test for logical decoding on standby
0005: Doc changes describing details about logical decoding

I suggest that we focus on one sub-patch at a time.

I'll start with 0001 and come back with a rebase addressing Andres and Robert's previous comments.

Sounds good to you?

Thanks

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com
That's great I am looking at "0002: Handle logical slot conflicts on standby".


--
Ibrar Ahmed

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 10/28/21 11:07 PM, Andres Freund wrote:
> Hi,
>
> On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
>> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>>> So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
>>> RelationIsAccessibleInLogicalDecoding()?
>> I think 0001 is utterly unacceptable. We cannot add calls to
>> table_open() in low-level functions like this. Suppose for example
>> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
>> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
>> a buffer lock on the page. The idea that it's safe to call
>> table_open() while holding a buffer lock cannot be taken seriously.
> Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
> think. Possibly a bit more verbose than nice, but...
>
> Alternatively we could propagate the information whether a relcache entry is
> for a catalog from the table to the index. Then we'd not need to change the
> btree code to pass the table down.

Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me 
that the missing part to be able to tell whether or not an index is for 
a catalog is the rd_options->user_catalog_table value of its related 
heap relation.

Then, a way to achieve that could be to:

- Add to Relation a new "heap_rd_options" representing the rd_options of 
the related heap relation when appropriate

- Trigger the related indexes relcache invalidations when an 
ATExecSetRelOptions() is triggered on a heap relation

- Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that 
would make use of the heap_rd_options instead

Does that sound like a valid option to you or do you have another idea 
in mind to propagate the information whether a relcache entry is for a 
catalog from the table to the index?

Regards,

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 10/28/21 11:07 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
>>> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand 
>>> <bdrouvot@amazon.com> wrote:
>>>> So you have in mind to check for XLogLogicalInfoActive() first, and 
>>>> if true, then open the relation and call
>>>> RelationIsAccessibleInLogicalDecoding()?
>>> I think 0001 is utterly unacceptable. We cannot add calls to
>>> table_open() in low-level functions like this. Suppose for example
>>> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
>>> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
>>> a buffer lock on the page. The idea that it's safe to call
>>> table_open() while holding a buffer lock cannot be taken seriously.
>> Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard 
>> to fix, I
>> think. Possibly a bit more verbose than nice, but...
>>
>> Alternatively we could propagate the information whether a relcache 
>> entry is
>> for a catalog from the table to the index. Then we'd not need to 
>> change the
>> btree code to pass the table down.
> 
> Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me 
> that the missing part to be able to tell whether or not an index is for 
> a catalog is the rd_options->user_catalog_table value of its related 
> heap relation.
> 
> Then, a way to achieve that could be to:
> 
> - Add to Relation a new "heap_rd_options" representing the rd_options of 
> the related heap relation when appropriate
> 
> - Trigger the related indexes relcache invalidations when an 
> ATExecSetRelOptions() is triggered on a heap relation
> 
> - Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that 
> would make use of the heap_rd_options instead
> 
> Does that sound like a valid option to you or do you have another idea 
> in mind to propagate the information whether a relcache entry is for a 
> catalog from the table to the index?
> 

I ended up with the attached proposal to propagate the catalog 
information to the indexes.

The attached adds a new field "isusercatalog" in pg_index to indicate 
whether or not the index is linked to a table that has the storage 
parameter user_catalog_table set to true.

Then it defines new macros, including 
"IndexIsAccessibleInLogicalDecoding" making use of this new field.

This new macro replaces get_rel_logical_catalog() that was part of the 
previous patch version.

What do you think about this approach and the attached?

If that sounds reasonable, then I'll add tap tests for it and try to 
improve the way isusercatalog is propagated to the index(es) in case a 
reset is done on user_catalog_table on the table (currently in this POC 
patch, it's hardcoded to "false" which is the default value for 
user_catalog_table in boolRelOpts[]) (A better approach would be 
probably to retrieve the value from the table once the reset is done and 
then propagate it to the index(es).)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 9/30/22 2:11 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 10/28/21 11:07 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
>>>> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>>>>> So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
>>>>> RelationIsAccessibleInLogicalDecoding()?
>>>> I think 0001 is utterly unacceptable. We cannot add calls to
>>>> table_open() in low-level functions like this. Suppose for example
>>>> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
>>>> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
>>>> a buffer lock on the page. The idea that it's safe to call
>>>> table_open() while holding a buffer lock cannot be taken seriously.
>>> Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
>>> think. Possibly a bit more verbose than nice, but...
>>>
>>> Alternatively we could propagate the information whether a relcache entry is
>>> for a catalog from the table to the index. Then we'd not need to change the
>>> btree code to pass the table down.
>>
>> Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me that the missing part to be able to tell
whetheror not an index is for a catalog is the rd_options->user_catalog_table value of its related heap relation.
 
>>
>> Then, a way to achieve that could be to:
>>
>> - Add to Relation a new "heap_rd_options" representing the rd_options of the related heap relation when appropriate
>>
>> - Trigger the related indexes relcache invalidations when an ATExecSetRelOptions() is triggered on a heap relation
>>
>> - Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that would make use of the heap_rd_options
instead
>>
>> Does that sound like a valid option to you or do you have another idea in mind to propagate the information whether
arelcache entry is for a catalog from the table to the index?
 
>>
> 
> I ended up with the attached proposal to propagate the catalog information to the indexes.
> 
> The attached adds a new field "isusercatalog" in pg_index to indicate whether or not the index is linked to a table
thathas the storage parameter user_catalog_table set to true.
 
> 
> Then it defines new macros, including "IndexIsAccessibleInLogicalDecoding" making use of this new field.
> 
> This new macro replaces get_rel_logical_catalog() that was part of the previous patch version.
> 
> What do you think about this approach and the attached?
> 
> If that sounds reasonable, then I'll add tap tests for it and try to improve the way isusercatalog is propagated to
theindex(es) in case a reset is done on user_catalog_table on the table (currently in this POC patch, it's hardcoded to
"false"which is the default value for user_catalog_table in boolRelOpts[]) (A better approach would be probably to
retrievethe value from the table once the reset is done and then propagate it to the index(es).)
 

Please find attached a rebase to propagate the catalog information to the indexes.
It also takes care of the RESET on user_catalog_table (adding a new Macro "HEAP_DEFAULT_USER_CATALOG_TABLE") and adds a
fewtests in contrib/test_decoding/sql/ddl.sql.
 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 11/25/22 11:26 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 9/30/22 2:11 PM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 7/6/22 3:30 PM, Drouvot, Bertrand wrote:
>>> Hi,
>>>
>>> On 10/28/21 11:07 PM, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> On 2021-10-28 16:24:22 -0400, Robert Haas wrote:
>>>>> On Wed, Oct 27, 2021 at 2:56 AM Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>>>>>> So you have in mind to check for XLogLogicalInfoActive() first, and if true, then open the relation and call
>>>>>> RelationIsAccessibleInLogicalDecoding()?
>>>>> I think 0001 is utterly unacceptable. We cannot add calls to
>>>>> table_open() in low-level functions like this. Suppose for example
>>>>> that _bt_getbuf() calls _bt_log_reuse_page() which with 0001 applied
>>>>> would call get_rel_logical_catalog(). _bt_getbuf() will have acquired
>>>>> a buffer lock on the page. The idea that it's safe to call
>>>>> table_open() while holding a buffer lock cannot be taken seriously.
>>>> Yes - that's pretty clearly a deadlock hazard. It shouldn't too hard to fix, I
>>>> think. Possibly a bit more verbose than nice, but...
>>>>
>>>> Alternatively we could propagate the information whether a relcache entry is
>>>> for a catalog from the table to the index. Then we'd not need to change the
>>>> btree code to pass the table down.
>>>
>>> Looking closer at RelationIsAccessibleInLogicalDecoding() It seems to me that the missing part to be able to tell
whetheror not an index is for a catalog is the rd_options->user_catalog_table value of its related heap relation.
 
>>>
>>> Then, a way to achieve that could be to:
>>>
>>> - Add to Relation a new "heap_rd_options" representing the rd_options of the related heap relation when
appropriate
>>>
>>> - Trigger the related indexes relcache invalidations when an ATExecSetRelOptions() is triggered on a heap relation
>>>
>>> - Write an equivalent of RelationIsUsedAsCatalogTable() for indexes that would make use of the heap_rd_options
instead
>>>
>>> Does that sound like a valid option to you or do you have another idea in mind to propagate the information whether
arelcache entry is for a catalog from the table to the index?
 
>>>
>>
>> I ended up with the attached proposal to propagate the catalog information to the indexes.
>>
>> The attached adds a new field "isusercatalog" in pg_index to indicate whether or not the index is linked to a table
thathas the storage parameter user_catalog_table set to true.
 
>>
>> Then it defines new macros, including "IndexIsAccessibleInLogicalDecoding" making use of this new field.
>>
>> This new macro replaces get_rel_logical_catalog() that was part of the previous patch version.
>>
>> What do you think about this approach and the attached?
>>
>> If that sounds reasonable, then I'll add tap tests for it and try to improve the way isusercatalog is propagated to
theindex(es) in case a reset is done on user_catalog_table on the table (currently in this POC patch, it's hardcoded to
"false"which is the default value for user_catalog_table in boolRelOpts[]) (A better approach would be probably to
retrievethe value from the table once the reset is done and then propagate it to the index(es).)
 
> 
> Please find attached a rebase to propagate the catalog information to the indexes.
> It also takes care of the RESET on user_catalog_table (adding a new Macro "HEAP_DEFAULT_USER_CATALOG_TABLE") and adds
afew tests in contrib/test_decoding/sql/ddl.sql.
 

Please find attached a new patch series:

v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
v27-0002-Handle-logical-slot-conflicts-on-standby.patch
v27-0003-Allow-logical-decoding-on-standby.patch
v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
v27-0005-Doc-changes-describing-details-about-logical-dec.patch
v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

with the previous comments addressed, means mainly:

1/ don't call table_open() in low-level functions in 0001: this is done with a new field "isusercatalog" in pg_index to
indicatewhether or not the index is linked to a table that has the storage parameter user_catalog_table set to true (we
maywant to make this field "invisible" though). This new field is then used in the new
IndexIsAccessibleInLogicalDecodingMacro (through IndexIsUserCatalog).
 

2/ Renaming the new field generated in the xlog record (to arrange conflict handling) from "onCatalogTable" to
"onCatalogAccessibleInLogicalDecoding"to avoid any confusion (see 0001).
 

3/ Making sure that "currTLI" is the current one in logical_read_xlog_page() (see 0003).

4/ Fixing Walsender/startup process corner case: It's done in 0006 (I thought it is better to keep the other patches
purely"feature" related and to address this corner case separately to ease the review). The fix is making use of a new
 

condition variable "replayedCV" so that the startup process can broadcast the walsender(s) once a replay is done.

Remarks:

- The new confl_active_logicalslot field added in pg_stat_database_conflicts (see 0002) is incremented only if the slot
beinginvalidated is active (I think it makes more sense in regard to the other fields too). In all the cases
(active/notactive) the slot invalidation is reported in the logfile. The documentation update mentions this behavior
(see0002).
 

- LogStandbySnapshot() being moved outside of the loop in ReplicationSlotReserveWal() (see 0003), is a proposal made by
Andresin [1] and I think it makes sense.
 

- Tap tests (see 0004) are covering: tests that the logical decoding on standby behaves correctly, conflicts, slots
invalidations,standby promotion.
 


Looking forward to your feedback,


[1]: https://www.postgresql.org/message-id/20210406180231.qsnkyrgrm7gtxb73%40alap3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

> Hi,
> Please find attached a new patch series:
> 
> v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
> v27-0002-Handle-logical-slot-conflicts-on-standby.patch
> v27-0003-Allow-logical-decoding-on-standby.patch
> v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
> v27-0005-Doc-changes-describing-details-about-logical-dec.patch
> v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch
> 
> with the previous comments addressed, means mainly:
> 
> 1/ don't call table_open() in low-level functions in 0001: this is done with a new field "isusercatalog" in pg_index
toindicate whether or not the index is linked to a table that has the storage parameter user_catalog_table set to true
(wemay want to make this field "invisible" though). This new field is then used in the new
IndexIsAccessibleInLogicalDecodingMacro (through IndexIsUserCatalog).
 
> 
> 2/ Renaming the new field generated in the xlog record (to arrange conflict handling) from "onCatalogTable" to
"onCatalogAccessibleInLogicalDecoding"to avoid any confusion (see 0001).
 
> 
> 3/ Making sure that "currTLI" is the current one in logical_read_xlog_page() (see 0003).
> 
> 4/ Fixing Walsender/startup process corner case: It's done in 0006 (I thought it is better to keep the other patches
purely"feature" related and to address this corner case separately to ease the review). The fix is making use of a new
 
> 
> condition variable "replayedCV" so that the startup process can broadcast the walsender(s) once a replay is done.
> 
> Remarks:
> 
> - The new confl_active_logicalslot field added in pg_stat_database_conflicts (see 0002) is incremented only if the
slotbeing invalidated is active (I think it makes more sense in regard to the other fields too). In all the cases
(active/notactive) the slot invalidation is reported in the logfile. The documentation update mentions this behavior
(see0002).
 
> 
> - LogStandbySnapshot() being moved outside of the loop in ReplicationSlotReserveWal() (see 0003), is a proposal made
byAndres in [1] and I think it makes sense.
 
> 
> - Tap tests (see 0004) are covering: tests that the logical decoding on standby behaves correctly, conflicts, slots
invalidations,standby promotion.
 
> 
> 
> Looking forward to your feedback,
> 
> 
> [1]: https://www.postgresql.org/message-id/20210406180231.qsnkyrgrm7gtxb73%40alap3.anarazel.de
> 

Please find attached v28 (mandatory rebase due to 8018ffbf58).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2022-12-07 10:00:25 +0100, Drouvot, Bertrand wrote:
> > Please find attached a new patch series:
> > 
> > v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
> > v27-0002-Handle-logical-slot-conflicts-on-standby.patch
> > v27-0003-Allow-logical-decoding-on-standby.patch
> > v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
> > v27-0005-Doc-changes-describing-details-about-logical-dec.patch
> > v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch

This failed on cfbot [1]. The tap output [2] has the following bit:

[09:48:56.216](5.979s) not ok 26 - cannot read from logical replication slot
[09:48:56.223](0.007s) #   Failed test 'cannot read from logical replication slot'
#   at C:/cirrus/src/test/recovery/t/034_standby_logical_decoding.pl line 422.
...
Warning: unable to close filehandle GEN150 properly: Bad file descriptor during global destruction.
Warning: unable to close filehandle GEN155 properly: Bad file descriptor during global destruction.

The "unable to close filehandle" stuff in my experience indicates an IPC::Run
process that wasn't ended before the tap test ended.

Greetings,

Andres Freund

[1] https://cirrus-ci.com/task/5092676671373312
[2]
https://api.cirrus-ci.com/v1/artifact/task/5092676671373312/testrun/build/testrun/recovery/034_standby_logical_decoding/log/regress_log_034_standby_logical_decoding



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/7/22 6:58 PM, Andres Freund wrote:
> Hi,
> 
> On 2022-12-07 10:00:25 +0100, Drouvot, Bertrand wrote:
>>> Please find attached a new patch series:
>>>
>>> v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
>>> v27-0002-Handle-logical-slot-conflicts-on-standby.patch
>>> v27-0003-Allow-logical-decoding-on-standby.patch
>>> v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
>>> v27-0005-Doc-changes-describing-details-about-logical-dec.patch
>>> v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch
> 
> This failed on cfbot [1]. The tap output [2] has the following bit:
> 
> [09:48:56.216](5.979s) not ok 26 - cannot read from logical replication slot
> [09:48:56.223](0.007s) #   Failed test 'cannot read from logical replication slot'
> #   at C:/cirrus/src/test/recovery/t/034_standby_logical_decoding.pl line 422.
> ...
> Warning: unable to close filehandle GEN150 properly: Bad file descriptor during global destruction.
> Warning: unable to close filehandle GEN155 properly: Bad file descriptor during global destruction.
> 
> The "unable to close filehandle" stuff in my experience indicates an IPC::Run
> process that wasn't ended before the tap test ended.
> 
> Greetings,
> 
> Andres Freund
> 
> [1] https://cirrus-ci.com/task/5092676671373312
> [2]
https://api.cirrus-ci.com/v1/artifact/task/5092676671373312/testrun/build/testrun/recovery/034_standby_logical_decoding/log/regress_log_034_standby_logical_decoding

Thanks for pointing out!

Please find attached V29 addressing this "Windows perl" issue: V29 changes the way the slot invalidation is tested and
addsa "handle->finish". That looks ok now (I launched several successful consecutive tests on my enabled cirrus-ci
repository).

V29 differs from V28 only in 0004 to workaround the above "Windows perl" issue.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/8/22 12:07 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 12/7/22 6:58 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2022-12-07 10:00:25 +0100, Drouvot, Bertrand wrote:
>>>> Please find attached a new patch series:
>>>>
>>>> v27-0001-Add-info-in-WAL-records-in-preparation-for-logic.patch
>>>> v27-0002-Handle-logical-slot-conflicts-on-standby.patch
>>>> v27-0003-Allow-logical-decoding-on-standby.patch
>>>> v27-0004-New-TAP-test-for-logical-decoding-on-standby.patch
>>>> v27-0005-Doc-changes-describing-details-about-logical-dec.patch
>>>> v27-0006-Fixing-Walsender-corner-case-with-logical-decodi.patch
>>
>> This failed on cfbot [1]. The tap output [2] has the following bit:
>>
>> [09:48:56.216](5.979s) not ok 26 - cannot read from logical replication slot
>> [09:48:56.223](0.007s) #   Failed test 'cannot read from logical replication slot'
>> #   at C:/cirrus/src/test/recovery/t/034_standby_logical_decoding.pl line 422.
>> ...
>> Warning: unable to close filehandle GEN150 properly: Bad file descriptor during global destruction.
>> Warning: unable to close filehandle GEN155 properly: Bad file descriptor during global destruction.
>>
>> The "unable to close filehandle" stuff in my experience indicates an IPC::Run
>> process that wasn't ended before the tap test ended.
>>
>> Greetings,
>>
>> Andres Freund
>>
>> [1] https://cirrus-ci.com/task/5092676671373312
>> [2]
https://api.cirrus-ci.com/v1/artifact/task/5092676671373312/testrun/build/testrun/recovery/034_standby_logical_decoding/log/regress_log_034_standby_logical_decoding
> 
> Thanks for pointing out!
> 
> Please find attached V29 addressing this "Windows perl" issue: V29 changes the way the slot invalidation is tested
andadds a "handle->finish". That looks ok now (I launched several successful consecutive tests on my enabled cirrus-ci
repository).
> 
> V29 differs from V28 only in 0004 to workaround the above "Windows perl" issue.
> 
> Regards,
> 

Attaching V30, mandatory rebase due to 66dcb09246.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Sat, Dec 10, 2022 at 3:09 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> Attaching V30, mandatory rebase due to 66dcb09246.

It's a shame that this hasn't gotten more attention, because the topic
is important, but I'm as guilty of being too busy to spend a lot of
time on it as everyone else.

Anyway, while I'm not an expert on this topic, I did spend a little
time looking at it today, especially 0001. Here are a few comments:

I think that it's not good for IndexIsAccessibleInLogicalDecoding and
RelationIsAccessibleInLogicalDecoding to both exist. Indexes and
tables are types of relations, so this invites confusion: when the
object in question is an index, it would seem that either one can be
applied, based on the names. I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.

But here we can see where past thinking of this topic has been,
perhaps, a bit fuzzy. If that option were called user_catalog_relation
and had to be set on both tables and indexes as appropriate, then
RelationIsAccessibleInLogicalDecoding would already be doing the right
thing, and consequently there would be no need to add
IndexIsAccessibleInLogicalDecoding. I think we should explore the idea
of making the existing macro return the correct answer rather than
adding a new one. It's probably too late to redefine the semantics of
user_catalog_table, although if anyone wants to argue that we could
require logical decoding plugins to set this for both indexes and
tables, and/or rename to say relation instead of table, and/or add a
parallel reloption called user_catalog_index, then let's talk about
that.

Otherwise, I think we can consider adjusting the definition of
RelationIsUsedAsCatalogTable. The simplest way to do that would be to
make it check indisusercatalog for indexes and do what it does already
for tables. Then IndexIsUserCatalog and
IndexIsAccessibleInLogicalDecoding go away and
RelationIsAccessibleInLogicalDecoding returns the right answer in all
cases.

But I also wonder if a new pg_index column is really the right
approach here. One fairly obvious alternative is to try to use the
user_catalog_table reloption in both places. We could try to propagate
that reloption from the table to its indexes; whenever it's set or
unset on the table, push that down to each index. We'd have to take
care not to let the property be changed independently on indexes,
though. This feels a little grotty to me, but it does have the
advantage of symmetry. Another way to get symmetry is to go the other
way and add a new column pg_class.relusercatalog which gets used
instead of putting user_catalog_table in the reloptions, and
propagated down to indexes. But I have a feeling that the reloptions
code is not very well-structured to allow reloptions to be stored any
place but in pg_class.reloptions, so this may be difficult to
implement. Yet a third way is to have the index fetch the flag from
the associated table, perhaps when the relcache entry is built. But I
see no existing precedent for that in RelationInitIndexAccessInfo,
which I think is where it would be if we had it -- and that makes me
suspect that there might be good reasons why this isn't actually safe.
So while I do not really like the approach of storing the same
property in different ways for tables and for indexes, it's also not
really obvious to me how to do better.

Regarding the new flags that have been added to various WAL records, I
am a bit curious as to whether there's some way that we can avoid the
need to carry this information through the WAL, but I don't understand
why we don't need that now and do need that with this patch so it's
hard for me to think about that question in an intelligent way. If we
do need it, I think there might be cases where we should do something
smarter than just adding bool onCatalogAccessibleInLogicalDecoding to
the beginning of a whole bunch of WAL structs. In most cases we try to
avoid having padding bytes in the WAL struct. If we can, we try to lay
out the struct to avoid padding bytes. If we can't, we put the fields
requiring less alignment at the end of the struct and then have a
SizeOf<whatever> macro that is defined to not include the length of
any trailing padding which the compiler would insert. See, for
example, SizeOfHeapDelete. This patch doesn't do any of that, and it
should. It should also consider whether there's a way to avoid adding
any new bytes at all, e.g. it adds
onCatalogAccessibleInLogicalDecoding to xl_heap_visible, but that
struct has unused bits in 'flags'.

It would be very helpful if there were some place to refer to that
explained the design decisions here, like why the feature we're trying
to get requires this infrastructure around indexes to be added. It
could be in the commit messages, an email message, a README, or
whatever, but right now I don't see it anywhere in here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/12/22 6:41 PM, Robert Haas wrote:
> On Sat, Dec 10, 2022 at 3:09 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>> Attaching V30, mandatory rebase due to 66dcb09246.
> 
> It's a shame that this hasn't gotten more attention, because the topic
> is important, but I'm as guilty of being too busy to spend a lot of
> time on it as everyone else.

Thanks for looking at it! Yeah, I think this is an important feature too.

> 
> Anyway, while I'm not an expert on this topic, 

Then, we are two ;-)
I "just" resurrected this very old thread and do the best that I can to have it moving forward.

> I did spend a little
> time looking at it today, especially 0001. Here are a few comments:
> 
> I think that it's not good for IndexIsAccessibleInLogicalDecoding and
> RelationIsAccessibleInLogicalDecoding to both exist. Indexes and
> tables are types of relations, so this invites confusion: when the
> object in question is an index, it would seem that either one can be
> applied, based on the names.

Agree.

> I think the real problem here is that
> RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
> when the relation is a user-catalog table. It does so because it
> relies on RelationIsUsedAsCatalogTable, and that macro relies on
> checking whether the reloptions include user_catalog_table.
> 

I think the Macro is returning the right answer when the relation is a user-catalog table.
I think the purpose is to identify relations that are permitted in READ only access during logical decoding.
Those are the ones that have been created by initdb in the pg_catalog schema, or have been marked as user provided
catalogtables (that's what is documented in [1]).
 

Or did you mean when the relation is "NOT" a user-catalog table?

> But here we can see where past thinking of this topic has been,
> perhaps, a bit fuzzy. If that option were called user_catalog_relation
> and had to be set on both tables and indexes as appropriate, then
> RelationIsAccessibleInLogicalDecoding would already be doing the right
> thing, and consequently there would be no need to add
> IndexIsAccessibleInLogicalDecoding. 

Yeah, agree.

> I think we should explore the idea
> of making the existing macro return the correct answer rather than
> adding a new one. It's probably too late to redefine the semantics of
> user_catalog_table, although if anyone wants to argue that we could
> require logical decoding plugins to set this for both indexes and
> tables, and/or rename to say relation instead of table, and/or add a
> parallel reloption called user_catalog_index, then let's talk about
> that.
> 
> Otherwise, I think we can consider adjusting the definition of
> RelationIsUsedAsCatalogTable. The simplest way to do that would be to
> make it check indisusercatalog for indexes and do what it does already
> for tables. Then IndexIsUserCatalog and
> IndexIsAccessibleInLogicalDecoding go away and
> RelationIsAccessibleInLogicalDecoding returns the right answer in all
> cases.
> 

That does sound a valid option to me too, I'll look at it.

> But I also wonder if a new pg_index column is really the right
> approach here. One fairly obvious alternative is to try to use the
> user_catalog_table reloption in both places. We could try to propagate
> that reloption from the table to its indexes; whenever it's set or
> unset on the table, push that down to each index. We'd have to take
> care not to let the property be changed independently on indexes,
> though. This feels a little grotty to me, but it does have the
> advantage of symmetry. 

I thought about this approach too when working on it. But I thought it would be "weird" to start to propagate option(s)
fromtable(s) to indexe(s). I mean, if that's an "option" why should it be propagated?
 
Furthermore, it seems to me that this option does behave more like a property (that affects logical decoding), more
likelogged/unlogged (being reflected in pg_class.relpersistence not in reloptions).
 

> Another way to get symmetry is to go the other
> way and add a new column pg_class.relusercatalog which gets used
> instead of putting user_catalog_table in the reloptions, and
> propagated down to indexes. 

Yeah, agree (see my previous point above).

> But I have a feeling that the reloptions
> code is not very well-structured to allow reloptions to be stored any
> place but in pg_class.reloptions, so this may be difficult to
> implement. 

Why don't remove this "property" from reloptions? (would probably need much more changes that the current approach and
probablytake care of upgrade scenario too).
 
I did not look in details but logged/unlogged is also propagated to the indexes, so maybe we could use the same
approachhere. But is it worth the probably added complexity (as compare to the current approach)?
 

> Yet a third way is to have the index fetch the flag from
> the associated table, perhaps when the relcache entry is built. But I
> see no existing precedent for that in RelationInitIndexAccessInfo,
> which I think is where it would be if we had it -- and that makes me
> suspect that there might be good reasons why this isn't actually safe.
> So while I do not really like the approach of storing the same
> property in different ways for tables and for indexes, it's also not
> really obvious to me how to do better.
> 

I share the same thought and that's why I ended up doing it that way.

> Regarding the new flags that have been added to various WAL records, I
> am a bit curious as to whether there's some way that we can avoid the
> need to carry this information through the WAL, but I don't understand
> why we don't need that now and do need that with this patch so it's
> hard for me to think about that question in an intelligent way. If we
> do need it, I think there might be cases where we should do something
> smarter than just adding bool onCatalogAccessibleInLogicalDecoding to
> the beginning of a whole bunch of WAL structs. In most cases we try to
> avoid having padding bytes in the WAL struct. If we can, we try to lay
> out the struct to avoid padding bytes. If we can't, we put the fields
> requiring less alignment at the end of the struct and then have a
> SizeOf<whatever> macro that is defined to not include the length of
> any trailing padding which the compiler would insert. See, for
> example, SizeOfHeapDelete. This patch doesn't do any of that, and it
> should. It should also consider whether there's a way to avoid adding
> any new bytes at all, e.g. it adds
> onCatalogAccessibleInLogicalDecoding to xl_heap_visible, but that
> struct has unused bits in 'flags'.

Thanks for the hints! I'll look at it.

> 
> It would be very helpful if there were some place to refer to that
> explained the design decisions here, like why the feature we're trying
> to get requires this infrastructure around indexes to be added. It
> could be in the commit messages, an email message, a README, or
> whatever, but right now I don't see it anywhere in here.
> 

Like adding something around those lines in the commit message?

"
On a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed.
This is done thanks to the catalog_xmin associated with the logical replication slot.

With logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then,
hot_standby_feedbackwill work, but only while the connection is alive (for example a node restart would break it)
 

Then the primary may delete system catalog rows that could be needed by the logical decoding on the standby. Then, it’s
mandatoryto identify those rows and invalidate the slots that may need them if any.
 
"

[1]: https://www.postgresql.org/docs/current/logicaldecoding-output-plugin.html (Section 49.6.2. Capabilities)

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> > I think the real problem here is that
> > RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
> > when the relation is a user-catalog table. It does so because it
> > relies on RelationIsUsedAsCatalogTable, and that macro relies on
> > checking whether the reloptions include user_catalog_table.
>
> [ confusion ]

Sorry, I meant: RelationIsAccessibleInLogicalDecoding is returning
*the wrong answer* when the relation is an *INDEX*.

> > But I have a feeling that the reloptions
> > code is not very well-structured to allow reloptions to be stored any
> > place but in pg_class.reloptions, so this may be difficult to
> > implement.
>
> Why don't remove this "property" from reloptions? (would probably need much more changes that the current approach
andprobably take care of upgrade scenario too). 
> I did not look in details but logged/unlogged is also propagated to the indexes, so maybe we could use the same
approachhere. But is it worth the probably added complexity (as compare to the current approach)? 

I feel like changing the user-facing syntax is probably not a great
idea, as it inflicts upgrade pain that I don't see how we can really
fix.

> > It would be very helpful if there were some place to refer to that
> > explained the design decisions here, like why the feature we're trying
> > to get requires this infrastructure around indexes to be added. It
> > could be in the commit messages, an email message, a README, or
> > whatever, but right now I don't see it anywhere in here.
>
> Like adding something around those lines in the commit message?
>
> "
> On a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed.
> This is done thanks to the catalog_xmin associated with the logical replication slot.
>
> With logical decoding on standby, in the following cases:
>
> - hot_standby_feedback is off
> - hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then,
hot_standby_feedbackwill work, but only while the connection is alive (for example a node restart would break it) 
>
> Then the primary may delete system catalog rows that could be needed by the logical decoding on the standby. Then,
it’smandatory to identify those rows and invalidate the slots that may need them if any. 
> "

This is very helpful, yes. I think perhaps we need to work some of
this into the code comments someplace, but getting it into the commit
message would be a good first step.

What I infer from the above is that the overall design looks like this:

- We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.
- Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.
- To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.
- We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency. (Is this true? I
think so.)
- Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Does that sound right?

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important, and also because there aren't any
record types that fall into this category that are going to be emitted
so frequently as to make it a performance problem. But it's certainly
more complicated than one might wish.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/13/22 2:50 PM, Robert Haas wrote:
> On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>> I think the real problem here is that
>>> RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
>>> when the relation is a user-catalog table. It does so because it
>>> relies on RelationIsUsedAsCatalogTable, and that macro relies on
>>> checking whether the reloptions include user_catalog_table.
>>
>> [ confusion ]
> 
> Sorry, I meant: RelationIsAccessibleInLogicalDecoding is returning
> *the wrong answer* when the relation is an *INDEX*.
> 

Yeah, agree. Will fix it in the next patch proposal (adding the index case in it as you proposed up-thread).

>>> It would be very helpful if there were some place to refer to that
>>> explained the design decisions here, like why the feature we're trying
>>> to get requires this infrastructure around indexes to be added. It
>>> could be in the commit messages, an email message, a README, or
>>> whatever, but right now I don't see it anywhere in here.
>>
>> Like adding something around those lines in the commit message?
>>
>> "
>> On a primary database, any catalog rows that may be needed by a logical decoding replication slot are not removed.
>> This is done thanks to the catalog_xmin associated with the logical replication slot.
>>
>> With logical decoding on standby, in the following cases:
>>
>> - hot_standby_feedback is off
>> - hot_standby_feedback is on but there is no a physical slot between the primary and the standby. Then,
hot_standby_feedbackwill work, but only while the connection is alive (for example a node restart would break it)
 
>>
>> Then the primary may delete system catalog rows that could be needed by the logical decoding on the standby. Then,
it’smandatory to identify those rows and invalidate the slots that may need them if any.
 
>> "
> 
> This is very helpful, yes. I think perhaps we need to work some of
> this into the code comments someplace, but getting it into the commit
> message would be a good first step.

Thanks, will do.

> 
> What I infer from the above is that the overall design looks like this:
> 
> - We want to enable logical decoding on standbys, but replay of WAL
> from the primary might remove data that is needed by logical decoding,
> causing replication conflicts much as hot standby does.
> - Our chosen strategy for dealing with this type of replication slot
> is to invalidate logical slots for which needed data has been removed.
> - To do this we need the latestRemovedXid for each change, just as we
> do for physical replication conflicts, but we also need to know
> whether any particular change was to data that logical replication
> might access.
> - We can't rely on the standby's relcache entries for this purpose in
> any way, because the WAL record that causes the problem might be
> replayed before the standby even reaches consistency. (Is this true? I
> think so.)
> - Therefore every WAL record that potentially removes data from the
> index or heap must carry a flag indicating whether or not it is one
> that might be accessed during logical decoding.
> 
> Does that sound right?
> 

Yeah, that sounds all right to me.
One option could be to add my proposed wording in the commit message and put your wording above in a README.

> It seems kind of unfortunate to have to add payload to a whole bevy of
> record types for this feature. I think it's worth it, both because the
> feature is extremely important,

Agree and I don't think that there is other option than adding some payload in some WAL records (at the very beginning
theproposal was to periodically log a new record
 
that announces the current catalog xmin horizon).

> and also because there aren't any
> record types that fall into this category that are going to be emitted
> so frequently as to make it a performance problem. 

+1

If no objections from your side, I'll submit a patch proposal by tomorrow, which:

- get rid of IndexIsAccessibleInLogicalDecoding
- let RelationIsAccessibleInLogicalDecoding deals with the index case
- takes care of the padding where the new bool is added
- convert this new bool to a flag for the xl_heap_visible case (adding a new bit to the already existing flag)
- Add my proposed wording above to the commit message
- Add your proposed wording above in a README

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Dec 13, 2022 at 11:37 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> > It seems kind of unfortunate to have to add payload to a whole bevy of
> > record types for this feature. I think it's worth it, both because the
> > feature is extremely important,
>
> Agree and I don't think that there is other option than adding some payload in some WAL records (at the very
beginningthe proposal was to periodically log a new record
 
> that announces the current catalog xmin horizon).

Hmm, why did we abandon that approach? It sounds like it has some promise.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

On 12/13/22 5:42 PM, Robert Haas wrote:
> On Tue, Dec 13, 2022 at 11:37 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>> It seems kind of unfortunate to have to add payload to a whole bevy of
>>> record types for this feature. I think it's worth it, both because the
>>> feature is extremely important,
>>
>> Agree and I don't think that there is other option than adding some payload in some WAL records (at the very
beginningthe proposal was to periodically log a new record
 
>> that announces the current catalog xmin horizon).
> 
> Hmm, why did we abandon that approach? It sounds like it has some promise.
> 

I should have put the reference to the discussion up-thread, it's in [1].

[1]: https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Dec 13, 2022 at 11:46 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> >> Agree and I don't think that there is other option than adding some payload in some WAL records (at the very
beginningthe proposal was to periodically log a new record
 
> >> that announces the current catalog xmin horizon).
> >
> > Hmm, why did we abandon that approach? It sounds like it has some promise.
>
> I should have put the reference to the discussion up-thread, it's in [1].
>
> [1]: https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de

Gotcha, thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:

On 12/13/22 5:49 PM, Robert Haas wrote:
> On Tue, Dec 13, 2022 at 11:46 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>>> Agree and I don't think that there is other option than adding some payload in some WAL records (at the very
beginningthe proposal was to periodically log a new record
 
>>>> that announces the current catalog xmin horizon).
>>>
>>> Hmm, why did we abandon that approach? It sounds like it has some promise.
>>
>> I should have put the reference to the discussion up-thread, it's in [1].
>>
>> [1]: https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de
> 
> Gotcha, thanks.
> 

You're welcome!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/13/22 5:37 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 12/13/22 2:50 PM, Robert Haas wrote:
>> On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand
> 
>> It seems kind of unfortunate to have to add payload to a whole bevy of
>> record types for this feature. I think it's worth it, both because the
>> feature is extremely important,
> 
> Agree and I don't think that there is other option than adding some payload in some WAL records (at the very
beginningthe proposal was to periodically log a new record
 
> that announces the current catalog xmin horizon).
> 
>> and also because there aren't any
>> record types that fall into this category that are going to be emitted
>> so frequently as to make it a performance problem. 
> 
> +1
> 
> If no objections from your side, I'll submit a patch proposal by tomorrow, which:
> 
> - get rid of IndexIsAccessibleInLogicalDecoding
> - let RelationIsAccessibleInLogicalDecoding deals with the index case
> - takes care of the padding where the new bool is added
> - convert this new bool to a flag for the xl_heap_visible case (adding a new bit to the already existing flag)
> - Add my proposed wording above to the commit message
> - Add your proposed wording above in a README

Please find attached v31 with the changes mentioned above (except that I put your wording into the commit message
insteadof a README: I think it helps to make
 
clear what the "design" for the patch series is).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> Please find attached v31 with the changes mentioned above (except that I put your wording into the commit message
insteadof a README: I think it helps to make
 
> clear what the "design" for the patch series is).

Thanks, I think that's good clarification.

I read through 0001 again and I noticed this:

 typedef struct xl_heap_prune
 {
     TransactionId snapshotConflictHorizon;
     uint16      nredirected;
     uint16      ndead;
+    bool        onCatalogAccessibleInLogicalDecoding;
     /* OFFSET NUMBERS are in the block reference 0 */
 } xl_heap_prune;

I think this is unsafe on alignment-picky machines. I think it will
cause the offset numbers to be aligned at an odd address.
heap_xlog_prune() doesn't copy the data into aligned memory, so I
think this will result in a misaligned pointer being passed down to
heap_page_prune_execute.

I wonder what the best fix is here. We could (1) have
heap_page_prune_execute copy the data into a newly-palloc'd chunk,
which seems kind of sad on performance grounds, or we could (2) just
make the field here two bytes, or add a second byte as padding, but
that bloats the WAL slightly, or we could (3) try to steal a bit from
ndirected or ndead, if we think that we don't need all the bits. It
seems like the maximum block size is 32kB right now, which means
MaxOffsetNumber can't, I think, be more than 16kB. So maybe we could
think of replacing nredirected and ndead with uint32 flags and then
have accessor macros.

But it looks like we also have a bunch of similar issues elsewhere.
gistxlogDelete looks like it has the same problem. gistxlogPageReuse
is OK because there's no data afterwards. xl_hash_vacuum_one_page
looks like it has the same problem. So does xl_heap_prune.
xl_heap_freeze_page also has the issue: heap_xlog_freeze_page does
memcpy the plans, but not the offsets, and even for the plans, I think
for correctness we would need to treat the "plans" pointer as a void *
or char * because the pointer might be unaligned and the compiler, not
knowing that, could do bad things.

xl_btree_reuse_page is OK because no data follows the main record.
xl_btree_delete appears to have this problem if you just look at the
comments, because it says that offset numbers follow, and thus are
probably presumed aligned. However, they don't really follow, because
commit d2e5e20e57111cca9e14f6e5a99a186d4c66a5b7 moved the data from
the main data to the registered buffer data. However, AIUI, you're not
really supposed to assume that the registered buffer data is aligned.
I think this just happens to work because the length of the main
record is a multiple of the relevant small integer, and the size of
the block header is 4, so the buffer data ends up being accidentally
aligned. That might be worth fixing somehow independently of this
issue.

spgxlogVacuumRedirect is OK because the offsets array is part of the
struct, using FLEXIBLE_ARRAY_MEMBER, which will cause the offsets
field to be aligned properly. It means inserting a padding byte, but
it's not broken. If we don't mind adding padding bytes in some of the
other cases, we could potentially make use of this technique
elsewhere, I think.

Other comments:

+    if RelationIsAccessibleInLogicalDecoding(rel)
+        xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;

This is a few parentheses short of where it should be. Hilariously it
still compiles because there are parentheses in the macro definition.

+            xlrec.onCatalogAccessibleInLogicalDecoding =
RelationIsAccessibleInLogicalDecoding(relation);

These lines are quite long. I think we should consider (1) picking a
shorter name for the xlrec field and, if it's such lines are going to
still routinely exceed 80 characters, (2) splitting them into two
lines, with the second one indented to match pgindent's preferences in
such cases, which I think is something like this:

xlrec.onCatalogAccessibleInLogicalDecoding =
    RelationIsAccessibleInLogicalDecoding(relation);

As far as renaming, I think we could at least remove onCatalog part
from the identifier, as that doesn't seem to be adding much. And maybe
we could even think of changing it to something like
logicalDecodingConflict or even decodingConflict, which would shave
off a bunch more characters.

+    if (heapRelation->rd_options)
+        isusercatalog = ((StdRdOptions *)
(heapRelation)->rd_options)->user_catalog_table;

Couldn't you get rid of the if statement here and also the
initialization at the top of the function and just write isusercatalog
= RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
the variable entirely and pass
RelationIsUsedAsCatalogTable(heapRelation) as the argument to
UpdateIndexRelation directly?

I think this could use some test cases demonstrating that
indisusercatalog gets set correctly in all the relevant cases: table
is created with user_catalog_table = true/false, reloption is changed,
reloptions are reset, new index is added later, etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2022-12-14 10:55:31 -0500, Robert Haas wrote:
> I read through 0001 again and I noticed this:
> 
>  typedef struct xl_heap_prune
>  {
>      TransactionId snapshotConflictHorizon;
>      uint16      nredirected;
>      uint16      ndead;
> +    bool        onCatalogAccessibleInLogicalDecoding;
>      /* OFFSET NUMBERS are in the block reference 0 */
>  } xl_heap_prune;
> 
> I think this is unsafe on alignment-picky machines. I think it will
> cause the offset numbers to be aligned at an odd address.
> heap_xlog_prune() doesn't copy the data into aligned memory, so I
> think this will result in a misaligned pointer being passed down to
> heap_page_prune_execute.

I think the offset numbers are stored separately from the record, even
though it doesn't quite look like that in the above due to the way the
'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
block reference 0, the added boolean shouldn't make a difference
alignment wise?

Or am I misunderstanding your point?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Wed, Dec 14, 2022 at 12:35 PM Andres Freund <andres@anarazel.de> wrote:
> >  typedef struct xl_heap_prune
> >
> > I think this is unsafe on alignment-picky machines. I think it will
> > cause the offset numbers to be aligned at an odd address.
> > heap_xlog_prune() doesn't copy the data into aligned memory, so I
> > think this will result in a misaligned pointer being passed down to
> > heap_page_prune_execute.
>
> I think the offset numbers are stored separately from the record, even
> though it doesn't quite look like that in the above due to the way the
> 'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
> block reference 0, the added boolean shouldn't make a difference
> alignment wise?
>
> Or am I misunderstanding your point?

Oh, you're right. So this is another case similar to
xl_btree_reuse_page. In heap_xlog_prune(), we access the offset number
data like this:

                redirected = (OffsetNumber *)
XLogRecGetBlockData(record, 0, &datalen);
                end = (OffsetNumber *) ((char *) redirected + datalen);
                nowdead = redirected + (nredirected * 2);
                nowunused = nowdead + ndead;
                nunused = (end - nowunused);
                heap_page_prune_execute(buffer,

redirected, nredirected,
                                                                nowdead, ndead,

nowunused, nunused);

This is only safe if the return value of XLogRecGetBlockData is
guaranteed to be properly aligned, and I think that there is no such
guarantee in general. I think that it happens to come out properly
aligned because both the main body of the record (xl_heap_prune) and
the length of a block header (XLogRecordBlockHeader) happen to be
sufficiently aligned. But we just recently had discussion about trying
to make WAL records smaller by various means, and some of those
proposals involved changing the length of XLogRecordBlockHeader. And
the patch proposed here increases SizeOfHeapPrune by 1. So I think
with the patch, the offset number array would become misaligned.

No?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/14/22 6:48 PM, Robert Haas wrote:
> On Wed, Dec 14, 2022 at 12:35 PM Andres Freund <andres@anarazel.de> wrote:
>>>   typedef struct xl_heap_prune
>>>
>>> I think this is unsafe on alignment-picky machines. I think it will
>>> cause the offset numbers to be aligned at an odd address.
>>> heap_xlog_prune() doesn't copy the data into aligned memory, so I
>>> think this will result in a misaligned pointer being passed down to
>>> heap_page_prune_execute.
>>
>> I think the offset numbers are stored separately from the record, even
>> though it doesn't quite look like that in the above due to the way the
>> 'OFFSET NUMBERS' is embedded in the struct. As they're stored with the
>> block reference 0, the added boolean shouldn't make a difference
>> alignment wise?
>>
>> Or am I misunderstanding your point?
> 
> Oh, you're right. So this is another case similar to
> xl_btree_reuse_page. In heap_xlog_prune(), we access the offset number
> data like this:
> 
>                  redirected = (OffsetNumber *)
> XLogRecGetBlockData(record, 0, &datalen);
>                  end = (OffsetNumber *) ((char *) redirected + datalen);
>                  nowdead = redirected + (nredirected * 2);
>                  nowunused = nowdead + ndead;
>                  nunused = (end - nowunused);
>                  heap_page_prune_execute(buffer,
> 
> redirected, nredirected,
>                                                                  nowdead, ndead,
> 
> nowunused, nunused);
> 
> This is only safe if the return value of XLogRecGetBlockData is
> guaranteed to be properly aligned,

Why, could you please elaborate?

It looks to me that here we are "just" accessing the
members of the xl_heap_prune struct to get the numbers.

Then, the actual data will be read later in heap_page_prune_execute() from the buffer/page based on the numbers we got
fromxl_heap_prune.
 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/14/22 4:55 PM, Robert Haas wrote:
>   On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand> 
> Other comments:
> 
> +    if RelationIsAccessibleInLogicalDecoding(rel)
> +        xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
> 
> This is a few parentheses short of where it should be. Hilariously it
> still compiles because there are parentheses in the macro definition.

Oops, thanks will fix.

> 
> +            xlrec.onCatalogAccessibleInLogicalDecoding =
> RelationIsAccessibleInLogicalDecoding(relation);
> 
> These lines are quite long. I think we should consider (1) picking a
> shorter name for the xlrec field and, if it's such lines are going to
> still routinely exceed 80 characters, (2) splitting them into two
> lines, with the second one indented to match pgindent's preferences in
> such cases, which I think is something like this:
> 
> xlrec.onCatalogAccessibleInLogicalDecoding =
>      RelationIsAccessibleInLogicalDecoding(relation);
> 
> As far as renaming, I think we could at least remove onCatalog part
> from the identifier, as that doesn't seem to be adding much. And maybe
> we could even think of changing it to something like
> logicalDecodingConflict or even decodingConflict, which would shave
> off a bunch more characters.


I'm not sure I like the decodingConflict proposal. Indeed, it might be there is no conflict (depending of the xids
comparison).

What about "checkForConflict"?

> 
> +    if (heapRelation->rd_options)
> +        isusercatalog = ((StdRdOptions *)
> (heapRelation)->rd_options)->user_catalog_table;
>
> Couldn't you get rid of the if statement here and also the
> initialization at the top of the function and just write isusercatalog
> = RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
> the variable entirely and pass
> RelationIsUsedAsCatalogTable(heapRelation) as the argument to
> UpdateIndexRelation directly?
> 

Yeah, that's better, will do, thanks!

While at it, I'm not sure that isusercatalog should be visible in pg_index.
I mean, this information could be retrieved with a join on pg_class (on the table the index is linked to), so the
weirdnessto have it visible.
 
I did not check how difficult it would be to make it "invisible" though.
What do you think?

> I think this could use some test cases demonstrating that
> indisusercatalog gets set correctly in all the relevant cases: table
> is created with user_catalog_table = true/false, reloption is changed,
> reloptions are reset, new index is added later, etc.
> 

v31 already provides a few checks:

- After index creation on relation with user_catalog_table = true
- Propagation is done correctly after a user_catalog_table RESET
- Propagation is done correctly after an ALTER SET user_catalog_table = true
- Propagation is done correctly after an ALTER SET user_catalog_table = false

In v32, I can add a check for index creation after each of the last 3 mentioned above and one when a table is created
withuser_catalog_table = false.
 

Having said that, we would need a function to retrieve the isusercatalog value should we make it invisible.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/15/22 11:31 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 12/14/22 4:55 PM, Robert Haas wrote:
>>   On Wed, Dec 14, 2022 at 8:06 AM Drouvot, Bertrand> Other comments:
>>
>> +    if RelationIsAccessibleInLogicalDecoding(rel)
>> +        xlrec.flags |= VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING;
>>
>> This is a few parentheses short of where it should be. Hilariously it
>> still compiles because there are parentheses in the macro definition.
> 
> Oops, thanks will fix.

Fixed in v32 attached.

> 
>>
>> +            xlrec.onCatalogAccessibleInLogicalDecoding =
>> RelationIsAccessibleInLogicalDecoding(relation);
>>
>> These lines are quite long. I think we should consider (1) picking a
>> shorter name for the xlrec field and, if it's such lines are going to
>> still routinely exceed 80 characters, (2) splitting them into two
>> lines, with the second one indented to match pgindent's preferences in
>> such cases, which I think is something like this:
>>
>> xlrec.onCatalogAccessibleInLogicalDecoding =
>>      RelationIsAccessibleInLogicalDecoding(relation);
>>
>> As far as renaming, I think we could at least remove onCatalog part
>> from the identifier, as that doesn't seem to be adding much. And maybe
>> we could even think of changing it to something like
>> logicalDecodingConflict or even decodingConflict, which would shave
>> off a bunch more characters.
> 
> 
> I'm not sure I like the decodingConflict proposal. Indeed, it might be there is no conflict (depending of the xids
> comparison).
> 
> What about "checkForConflict"?

In v32 attached, it is renamed to mayConflictInLogicalDecoding (I think it's important it reflects
that it is linked to the logical decoding and the "uncertainty"  of the conflict). What do you think?

> 
>>
>> +    if (heapRelation->rd_options)
>> +        isusercatalog = ((StdRdOptions *)
>> (heapRelation)->rd_options)->user_catalog_table;
>>
>> Couldn't you get rid of the if statement here and also the
>> initialization at the top of the function and just write isusercatalog
>> = RelationIsUsedAsCatalogTable(heapRelation)? Or even just get rid of
>> the variable entirely and pass
>> RelationIsUsedAsCatalogTable(heapRelation) as the argument to
>> UpdateIndexRelation directly?
>>
> 
> Yeah, that's better, will do, thanks!

Fixed in v32 attached.

> 
> While at it, I'm not sure that isusercatalog should be visible in pg_index.
> I mean, this information could be retrieved with a join on pg_class (on the table the index is linked to), so the
weirdnessto have it visible.
 
> I did not check how difficult it would be to make it "invisible" though.
> What do you think?
> 

It's still visible in v32 attached.
I had a second thought on it and it does not seem like a "real" concern to me.


>> I think this could use some test cases demonstrating that
>> indisusercatalog gets set correctly in all the relevant cases: table
>> is created with user_catalog_table = true/false, reloption is changed,
>> reloptions are reset, new index is added later, etc.
>>
> 
> v31 already provides a few checks:
> 
> - After index creation on relation with user_catalog_table = true
> - Propagation is done correctly after a user_catalog_table RESET
> - Propagation is done correctly after an ALTER SET user_catalog_table = true
> - Propagation is done correctly after an ALTER SET user_catalog_table = false
> 
> In v32, I can add a check for index creation after each of the last 3 mentioned above and one when a table is created
withuser_catalog_table = false.
 
> 

v32 attached is adding the checks mentioned above.

v32 does not change anything linked to the alignment discussion, as I think this will depend of its outcome.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2022-12-16 11:33:50 +0100, Drouvot, Bertrand wrote:
> @@ -235,13 +236,14 @@ typedef struct xl_btree_delete
>      TransactionId snapshotConflictHorizon;
>      uint16        ndeleted;
>      uint16        nupdated;
> +    bool        mayConflictInLogicalDecoding;

After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
it should be a riff on snapshotConflictHorizon?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/16/22 2:51 PM, Andres Freund wrote:
> Hi,
> 
> On 2022-12-16 11:33:50 +0100, Drouvot, Bertrand wrote:
>> @@ -235,13 +236,14 @@ typedef struct xl_btree_delete
>>       TransactionId snapshotConflictHorizon;
>>       uint16        ndeleted;
>>       uint16        nupdated;
>> +    bool        mayConflictInLogicalDecoding;
> 
> After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
> it should be a riff on snapshotConflictHorizon?
> 

Gotcha, what about logicalSnapshotConflictThreat?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> > After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
> > it should be a riff on snapshotConflictHorizon?
>
> Gotcha, what about logicalSnapshotConflictThreat?

logicalConflictPossible? checkDecodingConflict?

I think we should try to keep this to three words if we can. There's
not likely to be enough value in a fourth word to make up for the
downside of being more verbose.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/16/22 5:38 PM, Robert Haas wrote:
> On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>> After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
>>> it should be a riff on snapshotConflictHorizon?
>>
>> Gotcha, what about logicalSnapshotConflictThreat?
> 
> logicalConflictPossible? checkDecodingConflict?
> 
> I think we should try to keep this to three words if we can. There's
> not likely to be enough value in a fourth word to make up for the
> downside of being more verbose.
> 


Yeah agree, I'd vote for logicalConflictPossible then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/16/22 6:24 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 12/16/22 5:38 PM, Robert Haas wrote:
>> On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>>> After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
>>>> it should be a riff on snapshotConflictHorizon?
>>>
>>> Gotcha, what about logicalSnapshotConflictThreat?
>>
>> logicalConflictPossible? checkDecodingConflict?
>>
>> I think we should try to keep this to three words if we can. There's
>> not likely to be enough value in a fourth word to make up for the
>> downside of being more verbose.
>>
> 
> 
> Yeah agree, I'd vote for logicalConflictPossible then.
> 

Please find attached v33 using logicalConflictPossible as the new field name instead of mayConflictInLogicalDecoding.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
On 2022-12-16 11:38:33 -0500, Robert Haas wrote:
> On Fri, Dec 16, 2022 at 10:08 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> > > After 1489b1ce728 the name mayConflictInLogicalDecoding seems odd. Seems
> > > it should be a riff on snapshotConflictHorizon?
> >
> > Gotcha, what about logicalSnapshotConflictThreat?
> 
> logicalConflictPossible? checkDecodingConflict?
> 
> I think we should try to keep this to three words if we can. There's
> not likely to be enough value in a fourth word to make up for the
> downside of being more verbose.

I don't understand what the "may*" or "*Possible" really are
about. snapshotConflictHorizon is a conflict with a certain xid - there
commonly won't be anything to conflict with. If there's a conflict in
the logical-decoding-on-standby case, we won't be able to apply it only
sometimes or such.

How about "affectsLogicalDecoding", "conflictsWithSlots" or
"isCatalogRel" or such?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Wed, Dec 14, 2022 at 12:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
> No?

Nope, I was wrong. The block reference data is stored in the WAL
record *before* the main data, so it was wrong to imagine (as I did)
that the alignment of the main data would affect the alignment of the
block data. If anything, it's the other way around. That means that
the only records where this patch could conceivably cause a problem
are those where something is stored in the main data after the main
struct. And there aren't many of those, because an awful lot of record
types have moved to using the block data.

I'm going to go through all the record types one by one before
commenting further.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Dec 20, 2022 at 1:19 PM Andres Freund <andres@anarazel.de> wrote:
> I don't understand what the "may*" or "*Possible" really are
> about. snapshotConflictHorizon is a conflict with a certain xid - there
> commonly won't be anything to conflict with. If there's a conflict in
> the logical-decoding-on-standby case, we won't be able to apply it only
> sometimes or such.

The way I was imagining it is that snapshotConflictHorizon tells us
whether there is a conflict with this record and then, if there is,
this new Boolean tells us whether it's relevant to logical decoding as
well.

> How about "affectsLogicalDecoding", "conflictsWithSlots" or
> "isCatalogRel" or such?

isCatalogRel seems fine to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/20/22 7:31 PM, Robert Haas wrote:
> On Tue, Dec 20, 2022 at 1:19 PM Andres Freund <andres@anarazel.de> wrote:
>> I don't understand what the "may*" or "*Possible" really are
>> about. snapshotConflictHorizon is a conflict with a certain xid - there
>> commonly won't be anything to conflict with. If there's a conflict in
>> the logical-decoding-on-standby case, we won't be able to apply it only
>> sometimes or such.
> 
> The way I was imagining it is that snapshotConflictHorizon tells us
> whether there is a conflict with this record and then, if there is,
> this new Boolean tells us whether it's relevant to logical decoding as
> well.
> 

the "may*" or "*Possible" was, most probably, because I preferred to have the uncertainty of the conflict mentioned in
thename.
 
But, somehow, I was forgetting about the relationship with snapshotConflictHorizon.
So, agree with both of you that mentioning about the uncertainty of the conflict is useless.

>> How about "affectsLogicalDecoding", "conflictsWithSlots" or
>> "isCatalogRel" or such?
> 
> isCatalogRel seems fine to me.

For me too, please find attached v34 using it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Dec 20, 2022 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm going to go through all the record types one by one before
> commenting further.

OK, so xl_hash_vacuum_one_page, at least, is a live issue. To reproduce:

./configure <whatever your usual options are>
echo 'COPT=-fsanitize=alignment -fno-sanitize-recover=all' > src/Makefile.custom
make -j8
make install
initdb
postgres

Then in another window:

pg_basebackup -D pgstandby -R
# edit postgresql.conf, change port number
postgres -D pgstandby

Then in a third window, using the attached files:

pgbench -i -s 10
psql -f kpt_hash_setup.sql
pgbench -T 10 -c 4 -j 4 -f kpt_hash_pgbench.sql

With the patch, the standby falls over:

bufpage.c:1194:31: runtime error: load of misaligned address
0x7fa62f05d119 for type 'OffsetNumber' (aka 'unsigned short'), which
requires 2 byte alignment
0x7fa62f05d119: note: pointer points here
 00 00 00  00 e5 00 8f 00 00 00 00  87 00 ab 20 20 20 20 20  20 20 20
20 20 20 20 20  20 20 20 20 20
              ^

Without the patch, this doesn't occur.

I think this might be the only WAL record type where there's a
problem, but I haven't fully confirmed that yet.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think this might be the only WAL record type where there's a
> problem, but I haven't fully confirmed that yet.

It's not. GIST has the same issue. The same test case demonstrates the
problem there, if you substitute this test script for
kpt_hash_setup.sql and possibly also run it for somewhat longer. One
might think that this wouldn't be a problem, because the comments for
gistxlogDelete say this:

     /*
      * In payload of blk 0 : todelete OffsetNumbers
      */

But it's not in the payload of blk 0. It follows the main payload.

This is the reverse of xl_heap_freeze_page, which claims that freeze
plans and offset numbers follow, but they don't: they're in the data
for block 0. xl_btree_delete is also wrong, claiming that the data
follows when it's really attached to block 0. I guess whatever else we
do here, we should fix the comments.

Bottom line is that I think the two cases that have alignment issues
as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
else is OK, as far as I can tell right now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/20/22 10:41 PM, Robert Haas wrote:
> On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this might be the only WAL record type where there's a
>> problem, but I haven't fully confirmed that yet.
> 
> It's not. GIST has the same issue. The same test case demonstrates the
> problem there, if you substitute this test script for
> kpt_hash_setup.sql and possibly also run it for somewhat longer. One
> might think that this wouldn't be a problem, because the comments for
> gistxlogDelete say this:
> 
>       /*
>        * In payload of blk 0 : todelete OffsetNumbers
>        */
> 
> But it's not in the payload of blk 0. It follows the main payload.

Oh right, nice catch!

Indeed, we can see in gistRedoDeleteRecord():

"
todelete = (OffsetNumber *) ((char *) xldata + SizeOfGistxlogDelete);
"

> 
> This is the reverse of xl_heap_freeze_page, which claims that freeze
> plans and offset numbers follow, but they don't: they're in the data
> for block 0. 

oh right, we can see in heap_xlog_freeze_page():

"
        plans = (xl_heap_freeze_plan *) XLogRecGetBlockData(record, 0, NULL);
        offsets = (OffsetNumber *) ((char *) plans +
                                    (xlrec->nplans *
                                     sizeof(xl_heap_freeze_plan)));
"


> xl_btree_delete is also wrong, claiming that the data
> follows when it's really attached to block 0. 


oh right, we can see in btree_xlog_delete():

"
        char       *ptr = XLogRecGetBlockData(record, 0, NULL);

        page = (Page) BufferGetPage(buffer);

        if (xlrec->nupdated > 0)
        {
            OffsetNumber *updatedoffsets;
            xl_btree_update *updates;

            updatedoffsets = (OffsetNumber *)
                (ptr + xlrec->ndeleted * sizeof(OffsetNumber));
            updates = (xl_btree_update *) ((char *) updatedoffsets +
                                           xlrec->nupdated *
                                           sizeof(OffsetNumber));
"



> I guess whatever else we
> do here, we should fix the comments.
> 

Agree, please find attached a patch proposal doing so.


> Bottom line is that I think the two cases that have alignment issues
> as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
> else is OK, as far as I can tell right now.
> 

Thanks a lot for the repro(s) and explanations! That's very useful/helpful.

Based on your discovery about the wrong comments above, I'm now tempted to fix those 2 alignment issues
by using a FLEXIBLE_ARRAY_MEMBER within those structs (as you proposed in [1]) (as that should also prevent
any possible wrong comments about where the array is located).

What do you think?

[1]: https://www.postgresql.org/message-id/CA%2BTgmoaVcu_mbxbH%3DEccvKG6u8%2BMdQf9zx98uAL9zsStFwrYsQ%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/21/22 10:06 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 12/20/22 10:41 PM, Robert Haas wrote:
>> On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> I guess whatever else we
>> do here, we should fix the comments.
>>
> 
> Agree, please find attached a patch proposal doing so.
> 
> 
>> Bottom line is that I think the two cases that have alignment issues
>> as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
>> else is OK, as far as I can tell right now.
>>
> 
> Thanks a lot for the repro(s) and explanations! That's very useful/helpful.
> 
> Based on your discovery about the wrong comments above, I'm now tempted to fix those 2 alignment issues
> by using a FLEXIBLE_ARRAY_MEMBER within those structs (as you proposed in [1]) (as that should also prevent
> any possible wrong comments about where the array is located).
> 
> What do you think?

As mentioned above, It looks to me that making use of a FLEXIBLE_ARRAY_MEMBER is a good choice.
So, please find attached v35 making use of a FLEXIBLE_ARRAY_MEMBER in xl_hash_vacuum_one_page and gistxlogDelete (your
2repros are not failing anymore).
 
I've also added a few words in the commit message in 0001 about it.

So, we end up with:

(gdb) ptype /o struct xl_hash_vacuum_one_page
/* offset      |    size */  type = struct xl_hash_vacuum_one_page {
/*      0      |       4 */    TransactionId snapshotConflictHorizon;
/*      4      |       4 */    int ntuples;
/*      8      |       1 */    _Bool isCatalogRel;
/* XXX  1-byte hole      */
/*     10      |       0 */    OffsetNumber offsets[];
/* XXX  2-byte padding   */

                                /* total size (bytes):   12 */
                              }

(gdb) ptype /o struct gistxlogDelete
/* offset      |    size */  type = struct gistxlogDelete {
/*      0      |       4 */    TransactionId snapshotConflictHorizon;
/*      4      |       2 */    uint16 ntodelete;
/*      6      |       1 */    _Bool isCatalogRel;
/* XXX  1-byte hole      */
/*      8      |       0 */    OffsetNumber offsets[];

                                /* total size (bytes):    8 */
                              }

While looking at it, I've a question: xl_hash_vacuum_one_page.ntuples is an int, do you see any reason why it is not an
uint16?(we would get rid of 4 bytes in the struct).
 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 12/22/22 8:50 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 12/21/22 10:06 AM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 12/20/22 10:41 PM, Robert Haas wrote:
>>> On Tue, Dec 20, 2022 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>> I guess whatever else we
>>> do here, we should fix the comments.
>>>
>>
>> Agree, please find attached a patch proposal doing so.
>>
>>
>>> Bottom line is that I think the two cases that have alignment issues
>>> as coded are xl_hash_vacuum_one_page and gistxlogDelete. Everything
>>> else is OK, as far as I can tell right now.
>>>
>>
>> Thanks a lot for the repro(s) and explanations! That's very useful/helpful.
>>
>> Based on your discovery about the wrong comments above, I'm now tempted to fix those 2 alignment issues
>> by using a FLEXIBLE_ARRAY_MEMBER within those structs (as you proposed in [1]) (as that should also prevent
>> any possible wrong comments about where the array is located).
>>
>> What do you think?
> 
> As mentioned above, It looks to me that making use of a FLEXIBLE_ARRAY_MEMBER is a good choice.
> So, please find attached v35 making use of a FLEXIBLE_ARRAY_MEMBER in xl_hash_vacuum_one_page and gistxlogDelete
(your2 repros are not failing anymore).
 
> I've also added a few words in the commit message in 0001 about it.
> 
> So, we end up with:
> 
> (gdb) ptype /o struct xl_hash_vacuum_one_page
> /* offset      |    size */  type = struct xl_hash_vacuum_one_page {
> /*      0      |       4 */    TransactionId snapshotConflictHorizon;
> /*      4      |       4 */    int ntuples;
> /*      8      |       1 */    _Bool isCatalogRel;
> /* XXX  1-byte hole      */
> /*     10      |       0 */    OffsetNumber offsets[];
> /* XXX  2-byte padding   */
> 
>                                 /* total size (bytes):   12 */
>                               }
> 
> (gdb) ptype /o struct gistxlogDelete
> /* offset      |    size */  type = struct gistxlogDelete {
> /*      0      |       4 */    TransactionId snapshotConflictHorizon;
> /*      4      |       2 */    uint16 ntodelete;
> /*      6      |       1 */    _Bool isCatalogRel;
> /* XXX  1-byte hole      */
> /*      8      |       0 */    OffsetNumber offsets[];
> 
>                                 /* total size (bytes):    8 */
>                               }
> 
> While looking at it, I've a question: xl_hash_vacuum_one_page.ntuples is an int, do you see any reason why it is not
anuint16? (we would get rid of 4 bytes in the struct).
 
> 

Please find attached v36, tiny rebase due to 1de58df4fe.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> Please find attached v36, tiny rebase due to 1de58df4fe.

0001 looks committable to me now, though we probably shouldn't do that
unless we're pretty confident about shipping enough of the rest of
this to accomplish something useful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

Thomas, there's one point at the bottom wrt ConditionVariables that'd be
interesting for you to comment on.


On 2023-01-05 16:15:39 -0500, Robert Haas wrote:
> On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> > Please find attached v36, tiny rebase due to 1de58df4fe.
>
> 0001 looks committable to me now, though we probably shouldn't do that
> unless we're pretty confident about shipping enough of the rest of
> this to accomplish something useful.

Cool!

ISTM that the ordering of patches isn't quite right later on. ISTM that it
doesn't make sense to introduce working logic decoding without first fixing
WalSndWaitForWal() (i.e. patch 0006). What made you order the patches that
way?

0001:
> 4. We can't rely on the standby's relcache entries for this purpose in
> any way, because the WAL record that causes the problem might be
> replayed before the standby even reaches consistency.

The startup process can't access catalog contents in the first place, so the
consistency issue is secondary.


ISTM that the commit message omits a fairly significant portion of the change:
The introduction of indisusercatalog / the reason for its introduction.

Why is indisusercatalog stored as "full" column, whereas we store the fact of
table being used as a catalog table in a reloption? I'm not adverse to moving
to a full column, but then I think we should do the same for tables.

Earlier version of the patches IIRC sourced the "catalogness" from the
relation. What lead you to changing that? I'm not saying it's wrong, just not
sure it's right either.

It'd be good to introduce cross-checks that indisusercatalog is set
correctly. RelationGetIndexList() seems like a good candidate.

I'd probably split the introduction of indisusercatalog into a separate patch.

Why was HEAP_DEFAULT_USER_CATALOG_TABLE introduced in this patch?


I wonder if we instead should compute a relation's "catalogness" in the
relcache. That'd would have the advantage of making
RelationIsUsedAsCatalogTable() cheaper and working for all kinds of
relations.


VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING is a very long
identifier. Given that the field in the xlog records is just named
isCatalogRel, any reason to not just name it correspondingly?


0002:


> +/*
> + * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
> + * and mark it invalid, if necessary and possible.
> + *
> + * Returns whether ReplicationSlotControlLock was released in the interim (and
> + * in that case we're not holding the lock at return, otherwise we are).
> + *
> + * This is inherently racy, because we release the LWLock
> + * for syscalls, so caller must restart if we return true.
> + */
> +static bool
> +InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)

This appears to be a near complete copy of InvalidatePossiblyObsoleteSlot(). I
don't think we should have two versions of that non-trivial code. Seems we
could just have an additional argument for InvalidatePossiblyObsoleteSlot()?


> +            ereport(LOG,
> +                    (errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
> +

I think this should report more details, similar to what
InvalidateObsoleteReplicationSlots() does.


> --- a/src/backend/replication/logical/logicalfuncs.c
> +++ b/src/backend/replication/logical/logicalfuncs.c
> @@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
>
>          /*
>           * After the sanity checks in CreateDecodingContext, make sure the
> -         * restart_lsn is valid.  Avoid "cannot get changes" wording in this
> +         * restart_lsn is valid or both xmin and catalog_xmin are valid.
> +         * Avoid "cannot get changes" wording in this
>           * errmsg because that'd be confusingly ambiguous about no changes
>           * being available.
>           */
> -        if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
> +        if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
> +            || (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
> +                && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
>              ereport(ERROR,
>                      (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                       errmsg("can no longer get changes from replication slot \"%s\"",

Hm. Feels like we should introduce a helper like SlotIsInvalidated() instead
of having this condition in a bunch of places.


> +    if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
> +         && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("cannot read from logical replication slot \"%s\"",
> +                        cmd->slotname),
> +                 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
> +

This is a more precise error than the one in
pg_logical_slot_get_changes_guts().

I think both places should output the same error. ISTM that the relevant code
should be in CreateDecodingContext(). Imo the code to deal with the WAL
version of this has been misplaced...


> --- a/src/backend/storage/ipc/procarray.c
> +++ b/src/backend/storage/ipc/procarray.c
> @@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
>
>          GET_VXID_FROM_PGPROC(procvxid, *proc);
>
> +        /*
> +         * Note: vxid.localTransactionId can be invalid, which means the
> +         * request is to signal the pid that is not running a transaction.
> +         */
>          if (procvxid.backendId == vxid.backendId &&
>              procvxid.localTransactionId == vxid.localTransactionId)
>          {

I can't really parse the comment.


> @@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
>                                             PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
>                                             WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
>                                             true);
> +
> +    if (isCatalogRel)
> +        InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
>  }

Might be worth checking if wal_level >= logical before the somewhat expensive
InvalidateConflictingLogicalReplicationSlots().


> @@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>              case PROCSIG_RECOVERY_CONFLICT_LOCK:
>              case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
>              case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
> +            case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
> +                /*
> +                 * For conflicts that require a logical slot to be invalidated, the
> +                 * requirement is for the signal receiver to release the slot,
> +                 * so that it could be invalidated by the signal sender. So for
> +                 * normal backends, the transaction should be aborted, just
> +                 * like for other recovery conflicts. But if it's walsender on
> +                 * standby, then it has to be killed so as to release an
> +                 * acquired logical slot.
> +                 */
> +                if (am_cascading_walsender &&
> +                    reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
> +                    MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
> +                {
> +                    RecoveryConflictPending = true;
> +                    QueryCancelPending = true;
> +                    InterruptPending = true;
> +                    break;
> +                }
>
>                  /*
>                   * If we aren't in a transaction any longer then ignore.

Why does the walsender need to be killed? I think it might just be that
IsTransactionOrTransactionBlock() might return false, even though we want to
cancel. The code actually seems to only cancel (QueryCancelPending is set
rather than ProcDiePending), but the comment talks about killing?


0003:

> Allow a logical slot to be created on standby. Restrict its usage
> or its creation if wal_level on primary is less than logical.
> During slot creation, it's restart_lsn is set to the last replayed
> LSN. Effectively, a logical slot creation on standby waits for an
> xl_running_xact record to arrive from primary. Conflicting slots
> would be handled in next commits.

I think the commit message might be outdated, the next commit is a test.


> +            /*
> +             * Replay pointer may point one past the end of the record. If that
> +             * is a XLOG page boundary, it will not be a valid LSN for the
> +             * start of a record, so bump it up past the page header.
> +             */
> +            if (!XRecOffIsValid(restart_lsn))
> +            {
> +                if (restart_lsn % XLOG_BLCKSZ != 0)
> +                    elog(ERROR, "invalid replay pointer");
> +
> +                /* For the first page of a segment file, it's a long header */
> +                if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
> +                    restart_lsn += SizeOfXLogLongPHD;
> +                else
> +                    restart_lsn += SizeOfXLogShortPHD;
> +            }

Is this actually needed? Supposedly xlogreader can work just fixe with an
address at the start of a page?

        /*
         * Caller supplied a position to start at.
         *
         * In this case, NextRecPtr should already be pointing either to a
         * valid record starting position or alternatively to the beginning of
         * a page. See the header comments for XLogBeginRead.
         */
        Assert(RecPtr % XLOG_BLCKSZ == 0 || XRecOffIsValid(RecPtr));


>      /*
> -     * Since logical decoding is only permitted on a primary server, we know
> -     * that the current timeline ID can't be changing any more. If we did this
> -     * on a standby, we'd have to worry about the values we compute here
> -     * becoming invalid due to a promotion or timeline change.
> +     * Since logical decoding is also permitted on a standby server, we need
> +     * to check if the server is in recovery to decide how to get the current
> +     * timeline ID (so that it also cover the promotion or timeline change cases).
>       */
> +    if (!RecoveryInProgress())
> +        currTLI = GetWALInsertionTimeLine();
> +    else
> +        GetXLogReplayRecPtr(&currTLI);
> +

This seems to remove some content from the !recovery case.

It's a bit odd that here RecoveryInProgress() is used, whereas further down
am_cascading_walsender is used.


> @@ -3074,10 +3078,12 @@ XLogSendLogical(void)
>       * If first time through in this session, initialize flushPtr.  Otherwise,
>       * we only need to update flushPtr if EndRecPtr is past it.
>       */
> -    if (flushPtr == InvalidXLogRecPtr)
> -        flushPtr = GetFlushRecPtr(NULL);
> -    else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
> -        flushPtr = GetFlushRecPtr(NULL);
> +    if (flushPtr == InvalidXLogRecPtr ||
> +        logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
> +    {
> +        flushPtr = (am_cascading_walsender ?
> +                    GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
> +    }
>
>      /* If EndRecPtr is still past our flushPtr, it means we caught up. */
>      if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)

A short if inside a normal if seems ugly to me.


0004:

> @@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
>
>  =pod
>
> +=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
> +
> +Create logical replication slot on given standby
> +
> +=cut
> +
> +sub create_logical_slot_on_standby
> +{

Any reason this has to be standby specific?


> +    # Now arrange for the xl_running_xacts record for which pg_recvlogical
> +    # is waiting.
> +    $master->safe_psql('postgres', 'CHECKPOINT');
> +

Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
for us instead? This will likely also be needed in real applications after all.


> +    print "starting pg_recvlogical\n";

I don't think tests should just print somewhere. Either diag() or note()
should be used.


> +    if ($wait)
> +    # make sure activeslot is in use
> +    {
> +        $node_standby->poll_query_until('testdb',
> +            "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT
NULL)"
> +        ) or die "slot never became active";
> +    }

That comment placement imo is quite odd.


> +# test if basic decoding works
> +is(scalar(my @foobar = split /^/m, $result),
> +    14, 'Decoding produced 14 rows');

Maybe mention that it's 2 transactions + 10 rows?


> +$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));

There's enough copies of this that I wonder if we shouldn't introduce a
Cluster.pm level helper for this.


> +print "waiting to replay $endpos\n";

See above.


> +my $stdout_recv = $node_standby->pg_recvlogical_upto(
> +    'testdb', 'activeslot', $endpos, 180,
> +    'include-xids'     => '0',
> +    'skip-empty-xacts' => '1');

I don't think this should use a hardcoded 180 but
$PostgreSQL::Test::Utils::timeout_default.


> +# One way to reproduce recovery conflict is to run VACUUM FULL with
> +# hot_standby_feedback turned off on the standby.
> +$node_standby->append_conf('postgresql.conf',q[
> +hot_standby_feedback = off
> +]);
> +$node_standby->restart;

IIRC a reload should suffice.


> +# This should trigger the conflict
> +$node_primary->safe_psql('testdb', 'VACUUM FULL');

Can we do something cheaper than rewriting the entire database? Seems
rewriting a single table ought to be sufficient?

I think it'd also be good to test that rewriting a non-catalog table doesn't
trigger an issue.


> +##################################################
> +# Recovery conflict: Invalidate conflicting slots, including in-use slots
> +# Scenario 2: conflict due to row removal with hot_standby_feedback off.
> +##################################################
> +
> +# get the position to search from in the standby logfile
> +my $logstart = -s $node_standby->logfile;
> +
> +# drop the logical slots
> +$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
> +$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
> +
> +create_logical_slots();
> +
> +# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
> +# with hot_standby_feedback turned off on the standby.
> +$node_standby->append_conf('postgresql.conf',q[
> +hot_standby_feedback = off
> +]);
> +$node_standby->restart;
> +# ensure walreceiver feedback off by waiting for expected xmin and
> +# catalog_xmin on primary. Both should be NULL since hs_feedback is off
> +wait_for_xmins($node_primary, $primary_slotname,
> +               "xmin IS NULL AND catalog_xmin IS NULL");
> +
> +$handle = make_slot_active(1);

This is a fair bit of repeated setup, maybe put it into a function?


I think it'd be good to test the ongoing decoding via the SQL interface also
gets correctly handled. But it might be too hard to do reliably.


> +##################################################
> +# Test standby promotion and logical decoding behavior
> +# after the standby gets promoted.
> +##################################################
> +

I think this also should test the streaming / walsender case.



0006:

> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index bc3c3eb3e7..98c96eb864 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
>      RecoveryPauseState recoveryPauseState;
>      ConditionVariable recoveryNotPausedCV;
>
> +    /* Replay state (see getReplayedCV() for more explanation) */
> +    ConditionVariable replayedCV;
> +
>      slock_t        info_lck;        /* locks shared variables shown above */
>  } XLogRecoveryCtlData;
>

getReplayedCV() doesn't seem to fit into any of the naming scheems in use for
xlogrecovery.h.


> -         * Sleep until something happens or we time out.  Also wait for the
> -         * socket becoming writable, if there's still pending output.
> +         * When not in recovery, sleep until something happens or we time out.
> +         * Also wait for the socket becoming writable, if there's still pending output.

Hm. Is there a problem with not handling the becoming-writable case in the
in-recovery case?


> +        else
> +        /*
> +         * We are in the logical decoding on standby case.
> +         * We are waiting for the startup process to replay wal record(s) using
> +         * a timeout in case we are requested to stop.
> +         */
> +        {

I don't think pgindent will like that formatting....


> +            ConditionVariablePrepareToSleep(replayedCV);
> +            ConditionVariableTimedSleep(replayedCV, 1000,
> +                                        WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
> +        }

I think this is racy, see ConditionVariablePrepareToSleep()'s comment:

 * Caution: "before entering the loop" means you *must* test the exit
 * condition between calling ConditionVariablePrepareToSleep and calling
 * ConditionVariableSleep.  If that is inconvenient, omit calling
 * ConditionVariablePrepareToSleep.

Basically, the ConditionVariablePrepareToSleep() should be before the loop
body.


I don't think the fixed timeout here makes sense. For one, we need to wake up
based on WalSndComputeSleeptime(), otherwise we're ignoring wal_sender_timeout
(which can be quite small).  It's also just way too frequent - we're trying to
avoid constantly waking up unnecessarily.


Perhaps we could deal with the pq_is_send_pending() issue by having a version
of ConditionVariableTimedSleep() that accepts a WaitEventSet?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:
> Hi,
> 
> Thomas, there's one point at the bottom wrt ConditionVariables that'd be
> interesting for you to comment on.
> 
> 
> On 2023-01-05 16:15:39 -0500, Robert Haas wrote:
>> On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>> Please find attached v36, tiny rebase due to 1de58df4fe.
>>
>> 0001 looks committable to me now, though we probably shouldn't do that
>> unless we're pretty confident about shipping enough of the rest of
>> this to accomplish something useful.
> 

Thanks for your precious help reaching this state!

> Cool!
> 
> ISTM that the ordering of patches isn't quite right later on. ISTM that it
> doesn't make sense to introduce working logic decoding without first fixing
> WalSndWaitForWal() (i.e. patch 0006). What made you order the patches that
> way?
> 

Idea was to ease the review: 0001 to 0005 to introduce the feature and 0006 to deal
with this race condition.

I thought it would be easier to review that way (given the complexity of "just" adding the
feature itself).

> 0001:
>> 4. We can't rely on the standby's relcache entries for this purpose in
>> any way, because the WAL record that causes the problem might be
>> replayed before the standby even reaches consistency.
> 
> The startup process can't access catalog contents in the first place, so the
> consistency issue is secondary.
> 

Thanks for pointing out, I'll update the commit message.

> 
> ISTM that the commit message omits a fairly significant portion of the change:
> The introduction of indisusercatalog / the reason for its introduction.

Agree, will do (or create a dedicated path as you are suggesting below).

> 
> Why is indisusercatalog stored as "full" column, whereas we store the fact of
> table being used as a catalog table in a reloption? I'm not adverse to moving
> to a full column, but then I think we should do the same for tables.
> 
> Earlier version of the patches IIRC sourced the "catalogness" from the
> relation. What lead you to changing that? I'm not saying it's wrong, just not
> sure it's right either.

That's right it's started retrieving this information from the relation.

Then, Robert made a comment in [1] saying it's not safe to call
table_open() while holding a buffer lock.

Then, I worked on other options and submitted the current one.

While reviewing 0001, Robert's also thought of it (see [2])) and finished with:

"
So while I do not really like the approach of storing the same
property in different ways for tables and for indexes, it's also not
really obvious to me how to do better.
"

That's also my thought.

> 
> It'd be good to introduce cross-checks that indisusercatalog is set
> correctly. RelationGetIndexList() seems like a good candidate.
> 

Good point, will look at it.

> I'd probably split the introduction of indisusercatalog into a separate patch.

You mean, completely outside of this patch series or a sub-patch in this series?
If the former, I'm not sure it would make sense outside of the current context.

> 
> Why was HEAP_DEFAULT_USER_CATALOG_TABLE introduced in this patch?
> 
> 

To help in case of reset on the table (ensure the default gets also propagated to the indexes).

> I wonder if we instead should compute a relation's "catalogness" in the
> relcache. That'd would have the advantage of making
> RelationIsUsedAsCatalogTable() cheaper and working for all kinds of
> relations.
> 

Any idea on where and how you'd do that? (that's one option I explored in vain before
submitting the current proposal).

It does look like that's also an option explored by Robert in [2]:

"
Yet a third way is to have the index fetch the flag from
the associated table, perhaps when the relcache entry is built. But I
see no existing precedent for that in RelationInitIndexAccessInfo,
which I think is where it would be if we had it -- and that makes me
suspect that there might be good reasons why this isn't actually safe.
"

> 
> VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING is a very long
> identifier. Given that the field in the xlog records is just named
> isCatalogRel, any reason to not just name it correspondingly?
> 

Agree, VISIBILITYMAP_IS_CATALOG_REL maybe?

I'll look at the other comments too and work on/reply later on.

[1]: https://www.postgresql.org/message-id/CA%2BTgmobgOLH-JpBoBSdu4i%2BsjRdgwmDEZGECkmowXqQgQL6WhQ%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/CA%2BTgmoY0df9X%2B5ENg8P0BGj0odhM45sdQ7kB4JMo4NKaoFy-Vg%40mail.gmail.com

Thanks for your help,
Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:
> Hi,
> 0002:
> 
> 
>> +/*
>> + * Helper for InvalidateConflictingLogicalReplicationSlot -- acquires the given slot
>> + * and mark it invalid, if necessary and possible.
>> + *
>> + * Returns whether ReplicationSlotControlLock was released in the interim (and
>> + * in that case we're not holding the lock at return, otherwise we are).
>> + *
>> + * This is inherently racy, because we release the LWLock
>> + * for syscalls, so caller must restart if we return true.
>> + */
>> +static bool
>> +InvalidatePossiblyConflictingLogicalReplicationSlot(ReplicationSlot *s, TransactionId xid)
> 
> This appears to be a near complete copy of InvalidatePossiblyObsoleteSlot(). I
> don't think we should have two versions of that non-trivial code. Seems we
> could just have an additional argument for InvalidatePossiblyObsoleteSlot()?


Good point, done in V37 attached.
The new logical slot invalidation handling has been "merged" with the obsolete LSN case into 2 new
functions (InvalidateObsoleteOrConflictingLogicalReplicationSlots() and
InvalidatePossiblyObsoleteOrConflictingLogicalSlot()),
removing InvalidateObsoleteReplicationSlots() and InvalidatePossiblyObsoleteSlot().

> 
> 
>> +            ereport(LOG,
>> +                    (errmsg("invalidating slot \"%s\" because it conflicts with recovery", NameStr(slotname))));
>> +
> 
> I think this should report more details, similar to what
> InvalidateObsoleteReplicationSlots() does.
> 

Agree, done in V37 (adding more details about the xid horizon and wal_level < logical on master cases).

> 
>> --- a/src/backend/replication/logical/logicalfuncs.c
>> +++ b/src/backend/replication/logical/logicalfuncs.c
>> @@ -216,11 +216,14 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
>>
>>           /*
>>            * After the sanity checks in CreateDecodingContext, make sure the
>> -         * restart_lsn is valid.  Avoid "cannot get changes" wording in this
>> +         * restart_lsn is valid or both xmin and catalog_xmin are valid.
>> +         * Avoid "cannot get changes" wording in this
>>            * errmsg because that'd be confusingly ambiguous about no changes
>>            * being available.
>>            */
>> -        if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
>> +        if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn)
>> +            || (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
>> +                && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin)))
>>               ereport(ERROR,
>>                       (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>>                        errmsg("can no longer get changes from replication slot \"%s\"",
> 
> Hm. Feels like we should introduce a helper like SlotIsInvalidated() instead
> of having this condition in a bunch of places.
> 

Agree, LogicalReplicationSlotIsInvalid() has been added in V37.

>> +    if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
>> +         && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
>> +        ereport(ERROR,
>> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                 errmsg("cannot read from logical replication slot \"%s\"",
>> +                        cmd->slotname),
>> +                 errdetail("This slot has been invalidated because it was conflicting with recovery.")));
>> +
> 
> This is a more precise error than the one in
> pg_logical_slot_get_changes_guts().
> 
> I think both places should output the same error.

Agree, done in V37 attached.

> ISTM that the relevant code
> should be in CreateDecodingContext(). Imo the code to deal with the WAL
> version of this has been misplaced...
> 

Looks like a good idea. I'll start a dedicated thread to move the already existing
error reporting code part of pg_logical_slot_get_changes_guts() and StartLogicalReplication() into
CreateDecodingContext().

>> --- a/src/backend/storage/ipc/procarray.c
>> +++ b/src/backend/storage/ipc/procarray.c
>> @@ -3477,6 +3477,10 @@ SignalVirtualTransaction(VirtualTransactionId vxid, ProcSignalReason sigmode,
>>
>>           GET_VXID_FROM_PGPROC(procvxid, *proc);
>>
>> +        /*
>> +         * Note: vxid.localTransactionId can be invalid, which means the
>> +         * request is to signal the pid that is not running a transaction.
>> +         */
>>           if (procvxid.backendId == vxid.backendId &&
>>               procvxid.localTransactionId == vxid.localTransactionId)
>>           {
> 
> I can't really parse the comment.
> 

Looks like it's there since a long time ago (before I started working on this thread).
I did not pay that much attention to it, but now that you say it I'm not
sure why it as been previously added.

Given that 1) I don't get it too and 2) that this comment is the only one modification in this file then V37 just
removesit.
 

>> @@ -500,6 +502,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
>>                                              PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
>>                                              WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
>>                                              true);
>> +
>> +    if (isCatalogRel)
>> +        InvalidateConflictingLogicalReplicationSlots(locator.dbOid, snapshotConflictHorizon);
>>   }
> 
> Might be worth checking if wal_level >= logical before the somewhat expensive
> InvalidateConflictingLogicalReplicationSlots().
> 

Good point, done in V37.

> 
>> @@ -3051,6 +3054,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>>               case PROCSIG_RECOVERY_CONFLICT_LOCK:
>>               case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
>>               case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
>> +            case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
>> +                /*
>> +                 * For conflicts that require a logical slot to be invalidated, the
>> +                 * requirement is for the signal receiver to release the slot,
>> +                 * so that it could be invalidated by the signal sender. So for
>> +                 * normal backends, the transaction should be aborted, just
>> +                 * like for other recovery conflicts. But if it's walsender on
>> +                 * standby, then it has to be killed so as to release an
>> +                 * acquired logical slot.
>> +                 */
>> +                if (am_cascading_walsender &&
>> +                    reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
>> +                    MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
>> +                {
>> +                    RecoveryConflictPending = true;
>> +                    QueryCancelPending = true;
>> +                    InterruptPending = true;
>> +                    break;
>> +                }
>>
>>                   /*
>>                    * If we aren't in a transaction any longer then ignore.
> 
> Why does the walsender need to be killed? I think it might just be that
> IsTransactionOrTransactionBlock() might return false, even though we want to
> cancel. The code actually seems to only cancel (QueryCancelPending is set
> rather than ProcDiePending), but the comment talks about killing?
> 

Oh right, this comment is also there since a long time ago. I think the code
is OK (as we break in that case and so we don't go through IsTransactionOrTransactionBlock()).

So, V37 just modifies the comment.

Please find attached, V37 taking care of:

0001: commit message modifications and renaming VISIBILITYMAP_ON_CATALOG_ACCESSIBLE_IN_LOGICAL_DECODING
to VISIBILITYMAP_IS_CATALOG_REL (It does not touch the other remarks as they are still discussed in [1]).

0002: All your remarks mentioned above.

I'll look at the ones you've done in [2] on 0003, 0004 and 0006.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: https://www.postgresql.org/message-id/5c5151a6-a1a3-6c38-7d68-543c9faa22f4%40gmail.com
[2]: https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5ipet%40awork3.anarazel.de
Attachment

Re: Minimal logical decoding on standbys

From
Bharath Rupireddy
Date:
On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Please find attached, V37 taking care of:

Thanks. I started to digest the design specified in the commit message
and these patches. Here are some quick comments:

1. Does logical decoding on standby work without any issues if the
standby is set for cascading replication?

2. Do logical decoding output plugins work without any issues on the
standby with decoding enabled, say, when the slot is invalidated?

3. Is this feature still a 'minimal logical decoding on standby'?
Firstly, why is it 'minimal'?

4. What happens in case of failover to the standby that's already
decoding for its clients? Will the decoding work seamlessly? If not,
what are recommended things that users need to take care of
during/after failovers?

0002:
1.
-    if (InvalidateObsoleteReplicationSlots(_logSegNo))
+    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
&invalidated, InvalidOid, NULL);

Isn't the function name too long and verbose? How about just
InvalidateLogicalReplicationSlots() let the function comment talk
about what sorts of replication slots it invalides?

2.
+                                errdetail("Logical decoding on
standby requires wal_level to be at least logical on master"));
+ *     master wal_level is set back to replica, so existing logical
slots need to
invalidate such slots. Also do the same thing if wal_level on master

Can we use 'primary server' instead of 'master' like elsewhere? This
comment also applies for other patches too, if any.

3. Can we show a new status in pg_get_replication_slots's wal_status
for invalidated due to the conflict so that the user can monitor for
the new status and take necessary actions?

4. How will the user be notified when logical replication slots are
invalidated due to conflict with the primary server? How can they
(firstly, is there a way?) repair/restore such replication slots? Or
is recreating/reestablishing logical replication only the way out for
them? What users can do to avoid their logical replication slots
getting invalidated and run into these situations? Because
recreating/reestablishing logical replication comes with cost
(sometimes huge) as it involves building another instance, syncing
tables etc. Isn't it a good idea to touch up on all these aspects in
the documentation at least as to why we're doing this and why we can't
do this?

5.
@@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)

     ReplicationSlotAcquire(cmd->slotname, true);

+    if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
+         && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("cannot read from logical replication slot \"%s\"",
+                        cmd->slotname),
+                 errdetail("This slot has been invalidated because it
was conflicting with recovery.")));

Having the invalidation check in StartLogicalReplication() looks fine,
however, what happens if the slot gets invalidated when the
replication is in-progress? Do we need to error out in WalSndLoop() or
XLogSendLogical() too? Or is it already taken care of somewhere?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/11/23 8:32 AM, Bharath Rupireddy wrote:
> On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> Please find attached, V37 taking care of:
> 
> Thanks. I started to digest the design specified in the commit message
> and these patches. 

Thanks for looking at it!

> Here are some quick comments:
> 
> 1. Does logical decoding on standby work without any issues if the
> standby is set for cascading replication?
> 

Without "any issues" is hard to guarantee ;-) But according to my tests:

Primary -> Standby1 with or without logical replication slot -> Standby2 with or without logical replication slot

works as expected (and also with cascading promotion).
We can add some TAP tests in 0004 though.

> 2. Do logical decoding output plugins work without any issues on the
> standby with decoding enabled, say, when the slot is invalidated?
> 

Not sure, I got the question.
If the slot is invalidated then it's expected to get errors like:

pg_recvlogical: error: unexpected termination of replication stream: ERROR:  canceling statement due to conflict with
recovery
DETAIL:  User was using the logical slot that must be dropped.

or

pg_recvlogical: error: could not send replication command "START_REPLICATION SLOT "bdt_slot" LOGICAL 0/0": ERROR:
cannotread from logical replication slot "bdt_slot"
 
DETAIL:  This slot has been invalidated because it was conflicting with recovery.


> 3. Is this feature still a 'minimal logical decoding on standby'?
> Firstly, why is it 'minimal'?
> 

Good question and I don't have the answer.
That's how it has been named when this thread started back in 2018.

> 4. What happens in case of failover to the standby that's already
> decoding for its clients? Will the decoding work seamlessly? If not,
> what are recommended things that users need to take care of
> during/after failovers?

Yes, it's expected to work seamlessly. There is a TAP test in
0004 for this scenario.

> 
> 0002:
> 1.
> -    if (InvalidateObsoleteReplicationSlots(_logSegNo))
> +    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
> &invalidated, InvalidOid, NULL);
> 
> Isn't the function name too long and verbose? How about just
> InvalidateLogicalReplicationSlots() 

The function also takes care of Invalidation of Physical replication slots
that are Obsolete (aka LSN case).

InvalidateObsoleteOrConflictingReplicationSlots() maybe?


> let the function comment talk
> about what sorts of replication slots it invalides?

Agree to make the comment more clear.

> 
> 2.
> +                                errdetail("Logical decoding on
> standby requires wal_level to be at least logical on master"));
> + *     master wal_level is set back to replica, so existing logical
> slots need to
> invalidate such slots. Also do the same thing if wal_level on master
> 
> Can we use 'primary server' instead of 'master' like elsewhere? This
> comment also applies for other patches too, if any.
> 

Sure.

> 3. Can we show a new status in pg_get_replication_slots's wal_status
> for invalidated due to the conflict so that the user can monitor for
> the new status and take necessary actions?
> 

Not sure you've seen but the patch series is adding a new field (confl_active_logicalslot) in
pg_stat_database_conflicts.

That said, I like your idea about adding a new status in pg_replication_slots too.
Do you think it's mandatory for this patch series? (I mean it could be added once this patch series is committed).

I'm asking because this patch series looks already like a "big" one, is more than 4 years old
and I'm afraid of adding more "reporting" stuff to it (unless we feel a strong need for it of course).

> 4. How will the user be notified when logical replication slots are
> invalidated due to conflict with the primary server? 

Emitting messages, like the ones mentioned above introduced in 0002.

> How can they
> (firstly, is there a way?) repair/restore such replication slots? Or
> is recreating/reestablishing logical replication only the way out for
> them? 

Drop/recreate is what is part of the current design and discussed up-thread IIRC.

> What users can do to avoid their logical replication slots
> getting invalidated and run into these situations? Because
> recreating/reestablishing logical replication comes with cost
> (sometimes huge) as it involves building another instance, syncing
> tables etc. Isn't it a good idea to touch up on all these aspects in
> the documentation at least as to why we're doing this and why we can't
> do this?
> 

0005 adds a few words about it:

+    <para>
+     A logical replication slot can also be created on a hot standby. To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
+    </para>

> 5.
> @@ -1253,6 +1253,14 @@ StartLogicalReplication(StartReplicationCmd *cmd)
> 
>       ReplicationSlotAcquire(cmd->slotname, true);
> 
> +    if (!TransactionIdIsValid(MyReplicationSlot->data.xmin)
> +         && !TransactionIdIsValid(MyReplicationSlot->data.catalog_xmin))
> +        ereport(ERROR,
> +                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                 errmsg("cannot read from logical replication slot \"%s\"",
> +                        cmd->slotname),
> +                 errdetail("This slot has been invalidated because it
> was conflicting with recovery.")));
> 
> Having the invalidation check in StartLogicalReplication() looks fine,
> however, what happens if the slot gets invalidated when the
> replication is in-progress? Do we need to error out in WalSndLoop() or
> XLogSendLogical() too? Or is it already taken care of somewhere?
> 

Yes, it's already taken care in pg_logical_slot_get_changes_guts().

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-01-11 13:02:13 +0530, Bharath Rupireddy wrote:
> 3. Is this feature still a 'minimal logical decoding on standby'?
> Firstly, why is it 'minimal'?

It's minimal in comparison to other proposals at the time that did explicit /
active coordination between primary and standby to allow logical decoding.



> 0002:
> 1.
> -    if (InvalidateObsoleteReplicationSlots(_logSegNo))
> +    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
> &invalidated, InvalidOid, NULL);
> 
> Isn't the function name too long and verbose?

+1


> How about just InvalidateLogicalReplicationSlots() let the function comment
> talk about what sorts of replication slots it invalides?

I'd just leave the name unmodified at InvalidateObsoleteReplicationSlots().


> 2.
> +                                errdetail("Logical decoding on
> standby requires wal_level to be at least logical on master"));
> + *     master wal_level is set back to replica, so existing logical
> slots need to
> invalidate such slots. Also do the same thing if wal_level on master
> 
> Can we use 'primary server' instead of 'master' like elsewhere? This
> comment also applies for other patches too, if any.

+1


> 3. Can we show a new status in pg_get_replication_slots's wal_status
> for invalidated due to the conflict so that the user can monitor for
> the new status and take necessary actions?

Invalidated slots are not a new concept introduced in this patchset, so I'd
say we can introduce such a field separately.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:
> Hi,
> On 2023-01-05 16:15:39 -0500, Robert Haas wrote:
>> On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:

> 0003:
> 
>> Allow a logical slot to be created on standby. Restrict its usage
>> or its creation if wal_level on primary is less than logical.
>> During slot creation, it's restart_lsn is set to the last replayed
>> LSN. Effectively, a logical slot creation on standby waits for an
>> xl_running_xact record to arrive from primary. Conflicting slots
>> would be handled in next commits.
> 
> I think the commit message might be outdated, the next commit is a test.

Oops, thanks, fixed in V38 attached.

> 
>> +            /*
>> +             * Replay pointer may point one past the end of the record. If that
>> +             * is a XLOG page boundary, it will not be a valid LSN for the
>> +             * start of a record, so bump it up past the page header.
>> +             */
>> +            if (!XRecOffIsValid(restart_lsn))
>> +            {
>> +                if (restart_lsn % XLOG_BLCKSZ != 0)
>> +                    elog(ERROR, "invalid replay pointer");
>> +
>> +                /* For the first page of a segment file, it's a long header */
>> +                if (XLogSegmentOffset(restart_lsn, wal_segment_size) == 0)
>> +                    restart_lsn += SizeOfXLogLongPHD;
>> +                else
>> +                    restart_lsn += SizeOfXLogShortPHD;
>> +            }
> 
> Is this actually needed? Supposedly xlogreader can work just fixe with an
> address at the start of a page?
> 
>         /*
>          * Caller supplied a position to start at.
>          *
>          * In this case, NextRecPtr should already be pointing either to a
>          * valid record starting position or alternatively to the beginning of
>          * a page. See the header comments for XLogBeginRead.
>          */
>         Assert(RecPtr % XLOG_BLCKSZ == 0 || XRecOffIsValid(RecPtr));
> 


Oh you're right, thanks, indeed that's not needed anymore now that XLogDecodeNextRecord() exists.
Removed in V38.

> 
>>       /*
>> -     * Since logical decoding is only permitted on a primary server, we know
>> -     * that the current timeline ID can't be changing any more. If we did this
>> -     * on a standby, we'd have to worry about the values we compute here
>> -     * becoming invalid due to a promotion or timeline change.
>> +     * Since logical decoding is also permitted on a standby server, we need
>> +     * to check if the server is in recovery to decide how to get the current
>> +     * timeline ID (so that it also cover the promotion or timeline change cases).
>>        */
>> +    if (!RecoveryInProgress())
>> +        currTLI = GetWALInsertionTimeLine();
>> +    else
>> +        GetXLogReplayRecPtr(&currTLI);
>> +
> 
> This seems to remove some content from the !recovery case.
> 
> It's a bit odd that here RecoveryInProgress() is used, whereas further down
> am_cascading_walsender is used.
> 
> 

Agree, using am_cascading_walsender instead in V38.


>> @@ -3074,10 +3078,12 @@ XLogSendLogical(void)
>>        * If first time through in this session, initialize flushPtr.  Otherwise,
>>        * we only need to update flushPtr if EndRecPtr is past it.
>>        */
>> -    if (flushPtr == InvalidXLogRecPtr)
>> -        flushPtr = GetFlushRecPtr(NULL);
>> -    else if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
>> -        flushPtr = GetFlushRecPtr(NULL);
>> +    if (flushPtr == InvalidXLogRecPtr ||
>> +        logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
>> +    {
>> +        flushPtr = (am_cascading_walsender ?
>> +                    GetStandbyFlushRecPtr(NULL) : GetFlushRecPtr(NULL));
>> +    }
>>
>>       /* If EndRecPtr is still past our flushPtr, it means we caught up. */
>>       if (logical_decoding_ctx->reader->EndRecPtr >= flushPtr)
> 
> A short if inside a normal if seems ugly to me.
> 

Using 2 normal if in v38.

Please find V38 attached, I'll look at the other comments you've done in [1] on 0004 and 0006.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5ipet%40awork3.anarazel.de
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:
> On 1/6/23 4:40 AM, Andres Freund wrote:
> > ISTM that the ordering of patches isn't quite right later on. ISTM that it
> > doesn't make sense to introduce working logic decoding without first fixing
> > WalSndWaitForWal() (i.e. patch 0006). What made you order the patches that
> > way?
> > 
> 
> Idea was to ease the review: 0001 to 0005 to introduce the feature and 0006 to deal
> with this race condition.

> I thought it would be easier to review that way (given the complexity of "just" adding the
> feature itself).

The problem I have with that is that I saw a lot of flakiness in the tests due
to the race condition. So introducing them in that order just doesn't make a
whole lot of sense to me. It's also something that can be committed
independently, I think.


> > Why is indisusercatalog stored as "full" column, whereas we store the fact of
> > table being used as a catalog table in a reloption? I'm not adverse to moving
> > to a full column, but then I think we should do the same for tables.
> > 
> > Earlier version of the patches IIRC sourced the "catalogness" from the
> > relation. What lead you to changing that? I'm not saying it's wrong, just not
> > sure it's right either.
> 
> That's right it's started retrieving this information from the relation.
> 
> Then, Robert made a comment in [1] saying it's not safe to call
> table_open() while holding a buffer lock.

The suggested path in earlier versions to avoid doing so was to make sure that
we pass down the Relation for the table into the necessary functions. Did you
explore that any further?


> Then, I worked on other options and submitted the current one.
> 
> While reviewing 0001, Robert's also thought of it (see [2])) and finished with:
> 
> "
> So while I do not really like the approach of storing the same
> property in different ways for tables and for indexes, it's also not
> really obvious to me how to do better.
> "
> 
> That's also my thought.

I still dislike this approach. The code for cascading the change to the index
attributes is complicated. RelationIsAccessibleInLogicalDecoding() is getting
slower. We add unnecessary storage space to all pg_index rows.

Now I even wonder if this doesn't break the pg_index.indcheckxmin logic (which
I really dislike, but that's a separate discussion). I think updating pg_index
to set indisusercatalog might cause the index to be considered unusable, if
indcheckxmin = true. See

            /*
             * If the index is valid, but cannot yet be used, ignore it; but
             * mark the plan we are generating as transient. See
             * src/backend/access/heap/README.HOT for discussion.
             */
            if (index->indcheckxmin &&
                !TransactionIdPrecedes(HeapTupleHeaderGetXmin(indexRelation->rd_indextuple->t_data),
                                       TransactionXmin))
            {
                root->glob->transientPlan = true;
                index_close(indexRelation, NoLock);
                continue;
            }


The reason we went with the indcheckxmin approach, instead of storing the xmin
after which an index is uable directly, was that that way we don't need
special logic around vacuum to reset the stored xid to prevent the index to
become unusable after xid wraparound. But these days we could just store a
64bit xid...

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/11/23 7:04 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> Please find V38 attached, I'll look at the other comments you've done in [1] on 0004 and 0006.
> 
> 

Please find attached V39, tiny rebase due to 50767705ed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Ashutosh Sharma
Date:
Hi,

On Thu, Jan 12, 2023 at 5:29 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On 1/11/23 7:04 PM, Drouvot, Bertrand wrote:
> > Hi,
> >
> > Please find V38 attached, I'll look at the other comments you've done in [1] on 0004 and 0006.
> >
> >

Sorry for joining late. I totally missed it. AFAICU, with this patch
users can now do LR from standby, previously they could only do it
from the primary server.

To start with, I have one small question:

I previously participated in the discussion on "Synchronizing the
logical replication slots from Primary to Standby" and one of the
purposes of that project was to synchronize logical slots from primary
to standby so that if failover occurs, it will not affect the logical
subscribers of the old primary much. Can someone help me understand
how we are going to solve this problem with this patch? Are we going
to encourage users to do LR from standby instead of primary to get rid
of such problems during failover?

Also, one small observation:

I just played around with the latest (v38) patch a bit and found that
when a new logical subscriber of standby is created, it actually
creates two logical replication slots for it on the standby server.
May I know the reason for creating an extra replication slot other
than the one created by create subscription command? See below:

Subscriber:
=========
create subscription t1_sub connection 'host=127.0.0.1 port=38500
dbname=postgres user=ashu' publication t1_pub;

Standby:
=======
postgres=# select * from pg_replication_slots;
                slot_name                |  plugin  | slot_type |
datoid | database | temporary | active | active_pid | xmin |
catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_w
al_size | two_phase

-----------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+-------
--------+-----------
 pg_16399_sync_16392_7187728548042694423 | pgoutput | logical   |
5 | postgres | f         | t      |     112595 |      |          760 |
0/3082528   |                     | reserved   |
        | f
 t1_sub                                  | pgoutput | logical   |
5 | postgres | f         | t      |     111940 |      |          760 |
0/30824F0   | 0/3082528           | reserved   |
        | f

May I know the reason for creating pg_16399_sync_16392_7187728548042694423?

--
With Regards,
Ashutosh Sharma.



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-01-12 20:08:55 +0530, Ashutosh Sharma wrote:
> I previously participated in the discussion on "Synchronizing the
> logical replication slots from Primary to Standby" and one of the
> purposes of that project was to synchronize logical slots from primary
> to standby so that if failover occurs, it will not affect the logical
> subscribers of the old primary much. Can someone help me understand
> how we are going to solve this problem with this patch? Are we going
> to encourage users to do LR from standby instead of primary to get rid
> of such problems during failover?

It only provides a building block towards that. The "Synchronizing the logical
replication slots from Primary to Standby" project IMO needs all of the
infrastructure in this patch. With the patch, a logical rep solution can
e.g. maintain one slot on the primary and one on the standby, and occasionally
forward the slot on the standby to the position of the slot on the primary. In
case of a failover it can just start consuming changes from the former
standby, all the necessary changes are guaranteed to be present.


> Also, one small observation:
> 
> I just played around with the latest (v38) patch a bit and found that
> when a new logical subscriber of standby is created, it actually
> creates two logical replication slots for it on the standby server.
> May I know the reason for creating an extra replication slot other
> than the one created by create subscription command? See below:

That's unrelated to this patch. There's no changes to the "higher level"
logical replication code dealing with pubs and subs, it's all on the "logical
decoding" level.

I think this because logical rep wants to be able to concurrently perform
ongoing replication, and synchronize tables added to the replication set. The
pg_16399_sync_16392_7187728548042694423 slot should vanish after the initial
synchronization.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/11/23 9:27 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:
> 
> The problem I have with that is that I saw a lot of flakiness in the tests due
> to the race condition. So introducing them in that order just doesn't make a
> whole lot of sense to me. 

You are right it does not make sense to introduce fixing the race condition after the TAP tests
and after introducing the decoding logic. I'll reorder the sub-patches.

> It's also something that can be committed
> independently, I think.

Right but could this race condition occur outside of the context of this new feature?
  
>> That's right it's started retrieving this information from the relation.
>>
>> Then, Robert made a comment in [1] saying it's not safe to call
>> table_open() while holding a buffer lock.
> 
> The suggested path in earlier versions to avoid doing so was to make sure that
> we pass down the Relation for the table into the necessary functions. Did you
> explore that any further?

So, for gistXLogPageReuse() and _bt_delitems_delete() this is "easy" to pass the Heap Relation.
This is what was done in earlier versions of this patch series.

But we would need to define a way to propagate the Heap Relation for those 2 functions:

_bt_log_reuse_page()
vacuumRedirectAndPlaceholder()

When I first looked at it and saw the number of places where _bt_getbuf() is called
then I preferred to have a look to the current proposal.

I will give it another look, also because I just realized that it could be beneficial
for vacuumRedirectAndPlaceholder() too, as per this comment:

"
    /* XXX: providing heap relation would allow more pruning */
    vistest = GlobalVisTestFor(NULL);
"

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/13/23 10:17 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/11/23 9:27 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:
>>
>> The problem I have with that is that I saw a lot of flakiness in the tests due
>> to the race condition. So introducing them in that order just doesn't make a
>> whole lot of sense to me. 
> 
> You are right it does not make sense to introduce fixing the race condition after the TAP tests
> and after introducing the decoding logic. I'll reorder the sub-patches.
> 

V40 attached is changing the sub-patches ordering.

>> The suggested path in earlier versions to avoid doing so was to make sure that
>> we pass down the Relation for the table into the necessary functions. Did you
>> explore that any further?
> 
> So, for gistXLogPageReuse() and _bt_delitems_delete() this is "easy" to pass the Heap Relation.
> This is what was done in earlier versions of this patch series.
> 
> But we would need to define a way to propagate the Heap Relation for those 2 functions:
> 
> _bt_log_reuse_page()
> vacuumRedirectAndPlaceholder()
> 

V40 is getting rid of the new indisusercatalog field in pg_index and is passing the
heap relation all the way down to _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() instead
(and obviously to gistXLogPageReuse() and _bt_delitems_delete() too).

Remarks:

1) V40 adds the heap relation in the IndexVacuumInfo and ParallelVacuumState structs. It is used
for the _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() cases where I did not find any place
where to get the heap relation from in the existing code path.

2) V40 adds a "real" heap relation to all the _bt_getbuf() calls. Another option could have been
to add it only for the code paths leading to _bt_log_reuse_page() but I thought it is cleaner to
do it for all of them.

> I will give it another look, also because I just realized that it could be beneficial
> for vacuumRedirectAndPlaceholder() too, as per this comment:
> 
> "
>      /* XXX: providing heap relation would allow more pruning */
>      vistest = GlobalVisTestFor(NULL);
> "

Now, we could also pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder().
Could be done in or independently of this patch series once committed (it's not part of V40).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:
> Hi,
> 0004:
> 
>> @@ -3037,6 +3037,43 @@ $SIG{TERM} = $SIG{INT} = sub {
>>
>>   =pod
>>
>> +=item $node->create_logical_slot_on_standby(self, master, slot_name, dbname)
>> +
>> +Create logical replication slot on given standby
>> +
>> +=cut
>> +
>> +sub create_logical_slot_on_standby
>> +{
> 
> Any reason this has to be standby specific?
> 

Due to the extra work to be done for this case (aka wait for restart_lsn
and trigger a checkpoint on the primary).

> 
>> +    # Now arrange for the xl_running_xacts record for which pg_recvlogical
>> +    # is waiting.
>> +    $master->safe_psql('postgres', 'CHECKPOINT');
>> +
> 
> Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
> for us instead? This will likely also be needed in real applications after all.
> 

Not sure I got it. What the C helper would be supposed to do?

> 
>> +    print "starting pg_recvlogical\n";
> 
> I don't think tests should just print somewhere. Either diag() or note()
> should be used.
> 
> 

Will be done.

>> +    if ($wait)
>> +    # make sure activeslot is in use
>> +    {
>> +        $node_standby->poll_query_until('testdb',
>> +            "SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'activeslot' AND active_pid IS NOT
NULL)"
>> +        ) or die "slot never became active";
>> +    }
> 
> That comment placement imo is quite odd.
> 
> 

Agree, will be done.

>> +# test if basic decoding works
>> +is(scalar(my @foobar = split /^/m, $result),
>> +    14, 'Decoding produced 14 rows');
> 
> Maybe mention that it's 2 transactions + 10 rows?
> 
> 

Agree, will be done.

>> +$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('flush'));
> 
> There's enough copies of this that I wonder if we shouldn't introduce a
> Cluster.pm level helper for this.
> 

Done in [1].

> 
>> +print "waiting to replay $endpos\n";
> 
> See above.
> 
> 

Will be done.

>> +my $stdout_recv = $node_standby->pg_recvlogical_upto(
>> +    'testdb', 'activeslot', $endpos, 180,
>> +    'include-xids'     => '0',
>> +    'skip-empty-xacts' => '1');
> 
> I don't think this should use a hardcoded 180 but
> $PostgreSQL::Test::Utils::timeout_default.
> 
> 

Agree, will be done.

>> +# One way to reproduce recovery conflict is to run VACUUM FULL with
>> +# hot_standby_feedback turned off on the standby.
>> +$node_standby->append_conf('postgresql.conf',q[
>> +hot_standby_feedback = off
>> +]);
>> +$node_standby->restart;
> 
> IIRC a reload should suffice.
> 
> 

Right.

With a reload in place in my testing, now I notice that the catalog_xmin
is updated on the primary physical slot after logical slots invalidation
when reloading hot_standby_feedback from "off" to "on".

This is not the case after a re-start (aka catalog_xmin is NULL).

I think a re-start and reload should produce identical behavior on
the primary physical slot. If so, I'm tempted to think that the catalog_xmin
should be updated in case of a re-start too (even if all the logical slots are invalidated)
because the slots are not dropped yet. What do you think?

>> +# This should trigger the conflict
>> +$node_primary->safe_psql('testdb', 'VACUUM FULL');
> 
> Can we do something cheaper than rewriting the entire database? Seems
> rewriting a single table ought to be sufficient?
> 

While implementing the test at the table level I discovered that It looks like there is no guarantee that say a "vacuum
fullpg_class;" would
 
produce a conflict.

Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0:

"rmgr: Heap2       len (rec/tot):    107/   107, tx:        848, lsn: 0/03B98B30, prev 0/03B98AF0, desc: PRUNE
snapshotConflictHorizon0"
 


Having a snapshotConflictHorizon to zero leads to ResolveRecoveryConflictWithSnapshot() simply returning
without any conflict handling.

It does look like that in the standby decoding case that's not the right behavior (and that the xid that generated the
PRUNINGshould be used instead)
 
, what do you think?

> I think it'd also be good to test that rewriting a non-catalog table doesn't
> trigger an issue.
> 
> 

Good point, but need to understand the above first.


>> +##################################################
>> +# Recovery conflict: Invalidate conflicting slots, including in-use slots
>> +# Scenario 2: conflict due to row removal with hot_standby_feedback off.
>> +##################################################
>> +
>> +# get the position to search from in the standby logfile
>> +my $logstart = -s $node_standby->logfile;
>> +
>> +# drop the logical slots
>> +$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('inactiveslot')]);
>> +$node_standby->psql('postgres', q[SELECT pg_drop_replication_slot('activeslot')]);
>> +
>> +create_logical_slots();
>> +
>> +# One way to produce recovery conflict is to create/drop a relation and launch a vacuum
>> +# with hot_standby_feedback turned off on the standby.
>> +$node_standby->append_conf('postgresql.conf',q[
>> +hot_standby_feedback = off
>> +]);
>> +$node_standby->restart;
>> +# ensure walreceiver feedback off by waiting for expected xmin and
>> +# catalog_xmin on primary. Both should be NULL since hs_feedback is off
>> +wait_for_xmins($node_primary, $primary_slotname,
>> +               "xmin IS NULL AND catalog_xmin IS NULL");
>> +
>> +$handle = make_slot_active(1);
> 
> This is a fair bit of repeated setup, maybe put it into a function?
> 
> 

Yeah, good point: will be done.

> I think it'd be good to test the ongoing decoding via the SQL interface also
> gets correctly handled. But it might be too hard to do reliably.
> 
> 
>> +##################################################
>> +# Test standby promotion and logical decoding behavior
>> +# after the standby gets promoted.
>> +##################################################
>> +
> 
> I think this also should test the streaming / walsender case.
> 

Do you mean cascading standby?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: https://www.postgresql.org/message-id/flat/846724b5-0723-f4c2-8b13-75301ec7509e%40gmail.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:
> On 1/6/23 4:40 AM, Andres Freund wrote:
> > Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
> > for us instead? This will likely also be needed in real applications after all.
> > 
> 
> Not sure I got it. What the C helper would be supposed to do?

Call LogStandbySnapshot().


> With a reload in place in my testing, now I notice that the catalog_xmin
> is updated on the primary physical slot after logical slots invalidation
> when reloading hot_standby_feedback from "off" to "on".
> 
> This is not the case after a re-start (aka catalog_xmin is NULL).
> 
> I think a re-start and reload should produce identical behavior on
> the primary physical slot. If so, I'm tempted to think that the catalog_xmin
> should be updated in case of a re-start too (even if all the logical slots are invalidated)
> because the slots are not dropped yet. What do you think?

I can't quite follow the steps leading up to the difference. Could you list
them in a bit more detail?



> > Can we do something cheaper than rewriting the entire database? Seems
> > rewriting a single table ought to be sufficient?
> > 
> 
> While implementing the test at the table level I discovered that It looks like there is no guarantee that say a
"vacuumfull pg_class;" would
 
> produce a conflict.

I assume that's mostly when there weren't any removal


> Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0:
> 
> "rmgr: Heap2       len (rec/tot):    107/   107, tx:        848, lsn: 0/03B98B30, prev 0/03B98AF0, desc: PRUNE
snapshotConflictHorizon0"
 
> 
> 
> Having a snapshotConflictHorizon to zero leads to ResolveRecoveryConflictWithSnapshot() simply returning
> without any conflict handling.

That doesn't have to mean anything bad. Some row versions can be removed without
creating a conflict. See HeapTupleHeaderAdvanceConflictHorizon(), specifically

     * Ignore tuples inserted by an aborted transaction or if the tuple was
     * updated/deleted by the inserting transaction.



> It does look like that in the standby decoding case that's not the right behavior (and that the xid that generated
thePRUNING should be used instead)
 
> , what do you think?

That'd not work, because that'll be typically newer than the catalog_xmin. So
we'd start invalidating things left and right, despite not needing to.


Did you see anything else around this making you suspicious?


> > > +##################################################
> > > +# Test standby promotion and logical decoding behavior
> > > +# after the standby gets promoted.
> > > +##################################################
> > > +
> > 
> > I think this also should test the streaming / walsender case.
> > 
> 
> Do you mean cascading standby?

I mean a logical walsender that starts on a standby and continues across
promotion of the standby.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/16/23 3:28 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/13/23 10:17 AM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 1/11/23 9:27 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2023-01-06 10:52:06 +0100, Drouvot, Bertrand wrote:
>>>
>>> The problem I have with that is that I saw a lot of flakiness in the tests due
>>> to the race condition. So introducing them in that order just doesn't make a
>>> whole lot of sense to me. 
>>
>> You are right it does not make sense to introduce fixing the race condition after the TAP tests
>> and after introducing the decoding logic. I'll reorder the sub-patches.
>>
> 
> V40 attached is changing the sub-patches ordering.
> 
>>> The suggested path in earlier versions to avoid doing so was to make sure that
>>> we pass down the Relation for the table into the necessary functions. Did you
>>> explore that any further?
>>
>> So, for gistXLogPageReuse() and _bt_delitems_delete() this is "easy" to pass the Heap Relation.
>> This is what was done in earlier versions of this patch series.
>>
>> But we would need to define a way to propagate the Heap Relation for those 2 functions:
>>
>> _bt_log_reuse_page()
>> vacuumRedirectAndPlaceholder()
>>
> 
> V40 is getting rid of the new indisusercatalog field in pg_index and is passing the
> heap relation all the way down to _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() instead
> (and obviously to gistXLogPageReuse() and _bt_delitems_delete() too).
> 
> Remarks:
> 
> 1) V40 adds the heap relation in the IndexVacuumInfo and ParallelVacuumState structs. It is used
> for the _bt_log_reuse_page() and vacuumRedirectAndPlaceholder() cases where I did not find any place
> where to get the heap relation from in the existing code path.
> 
> 2) V40 adds a "real" heap relation to all the _bt_getbuf() calls. Another option could have been
> to add it only for the code paths leading to _bt_log_reuse_page() but I thought it is cleaner to
> do it for all of them.
> 
>> I will give it another look, also because I just realized that it could be beneficial
>> for vacuumRedirectAndPlaceholder() too, as per this comment:
>>
>> "
>>      /* XXX: providing heap relation would allow more pruning */
>>      vistest = GlobalVisTestFor(NULL);
>> "
> 
> Now, we could also pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder().
> Could be done in or independently of this patch series once committed (it's not part of V40).


Please find attached V41, tiny rebase due to 47bb9db759.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/19/23 3:46 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:
>> On 1/6/23 4:40 AM, Andres Freund wrote:
>>> Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
>>> for us instead? This will likely also be needed in real applications after all.
>>>
>>
>> Not sure I got it. What the C helper would be supposed to do?
> 
> Call LogStandbySnapshot().
> 

Got it, I like the idea, will do.

> 
>> With a reload in place in my testing, now I notice that the catalog_xmin
>> is updated on the primary physical slot after logical slots invalidation
>> when reloading hot_standby_feedback from "off" to "on".
>>
>> This is not the case after a re-start (aka catalog_xmin is NULL).
>>
>> I think a re-start and reload should produce identical behavior on
>> the primary physical slot. If so, I'm tempted to think that the catalog_xmin
>> should be updated in case of a re-start too (even if all the logical slots are invalidated)
>> because the slots are not dropped yet. What do you think?
> 
> I can't quite follow the steps leading up to the difference. Could you list
> them in a bit more detail?
> 
> 

Sure, so with:

1) hot_standby_feedback set to off on the standby
2) create 2 logical replication slots on the standby and activate one
3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
4) change hot_standby_feedback to on on the standby

If:

5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin  from pg_replication_slots ;
  slot_type | xmin | catalog_xmin
-----------+------+--------------
  physical  |  822 |          748
(1 row)

But if:

5) re-start the standby, then on the primary we get an empty catalog_xmin
for the physical slot that the standby is attached to:

postgres=# select slot_type,xmin,catalog_xmin  from pg_replication_slots ;
  slot_type | xmin | catalog_xmin
-----------+------+--------------
  physical  |  816 |
(1 row)

> 
>>> Can we do something cheaper than rewriting the entire database? Seems
>>> rewriting a single table ought to be sufficient?
>>>
>>
>> While implementing the test at the table level I discovered that It looks like there is no guarantee that say a
"vacuumfull pg_class;" would
 
>> produce a conflict.
> 
> I assume that's mostly when there weren't any removal
> 
> 
>> Indeed, from what I can see in my testing it could generate a XLOG_HEAP2_PRUNE with snapshotConflictHorizon to 0:
>>
>> "rmgr: Heap2       len (rec/tot):    107/   107, tx:        848, lsn: 0/03B98B30, prev 0/03B98AF0, desc: PRUNE
snapshotConflictHorizon0"
 
>>
>>
>> Having a snapshotConflictHorizon to zero leads to ResolveRecoveryConflictWithSnapshot() simply returning
>> without any conflict handling.
> 
> That doesn't have to mean anything bad. Some row versions can be removed without
> creating a conflict. See HeapTupleHeaderAdvanceConflictHorizon(), specifically
> 
>      * Ignore tuples inserted by an aborted transaction or if the tuple was
>      * updated/deleted by the inserting transaction.
> 
> 
> 
>> It does look like that in the standby decoding case that's not the right behavior (and that the xid that generated
thePRUNING should be used instead)
 
>> , what do you think?
> 
> That'd not work, because that'll be typically newer than the catalog_xmin. So
> we'd start invalidating things left and right, despite not needing to.
> 
> 

Okay, thanks for the explanations that makes sense.

> Did you see anything else around this making you suspicious?
> 

No, but a question still remains to me:

Given the fact that the row removal case is already done
in the next test (aka Scenario 2), If we want to replace the "vacuum full" test
on the database (done in Scenario 1) with a cheaper one at the table level,
what could it be to guarantee an invalidation?

Same as scenario 2 but with "vacuum full pg_class" would not really add value
to the tests, right?

>>>> +##################################################
>>>> +# Test standby promotion and logical decoding behavior
>>>> +# after the standby gets promoted.
>>>> +##################################################
>>>> +
>>>
>>> I think this also should test the streaming / walsender case.
>>>
>>
>> Do you mean cascading standby?
> 
> I mean a logical walsender that starts on a standby and continues across
> promotion of the standby.
> 

Got it, thanks, will do.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/19/23 3:46 AM, Andres Freund wrote:
>> Hi,
>>
>> I mean a logical walsender that starts on a standby and continues across
>> promotion of the standby.
>>
> 
> Got it, thanks, will do.
> 

While working on it, I noticed that with V41 a:

pg_recvlogical -S active_slot -P test_decoding -d postgres -f - --start

on the standby is getting:

pg_recvlogical: error: unexpected termination of replication stream: ERROR:  could not find record while sending
logically-decodeddata: invalid record length at 0/311C438: wanted 24, got 0
 
pg_recvlogical: disconnected; waiting 5 seconds to try again

when the standby gets promoted (the logical decoding is able to resume correctly after the error though).

This is fixed in V42 attached (no error anymore and logical decoding through the walsender works correctly after the
promotion).

The fix is in 0003 where in logical_read_xlog_page() (as compare to V41):

- We now check if RecoveryInProgress() (instead of relying on am_cascading_walsender) to check if the standby got
promoted
- Based on this, the currTLI is being retrieved with GetXLogReplayRecPtr() or GetWALInsertionTimeLine() (so, with
GetWALInsertionTimeLine()after promotion)
 
- This currTLI is being used as an argument in WALRead() (instead of state->seg.ws_tli, which anyhow sounds weird as
being
  compared with itself that way "tli != state->seg.ws_tli" in WALRead()). That way WALRead() discovers that the
timelinechanged and then opens the right WAL file.
 

Please find V42 attached.

I'll resume working on the TAP tests comments.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Melanie Plageman
Date:
I'm new to this thread and subject, but I had a few basic thoughts about
the first patch in the set.

On Mon, Jan 23, 2023 at 12:03:35PM +0100, Drouvot, Bertrand wrote:
> 
> Please find V42 attached.
> 
> From 3c206bd77831d507f4f95e1942eb26855524571a Mon Sep 17 00:00:00 2001
> From: bdrouvotAWS <bdrouvot@amazon.com>
> Date: Mon, 23 Jan 2023 10:07:51 +0000
> Subject: [PATCH v42 1/6] Add info in WAL records in preparation for logical
>  slot conflict handling.
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Overall design:
> 
> 1. We want to enable logical decoding on standbys, but replay of WAL
> from the primary might remove data that is needed by logical decoding,
> causing replication conflicts much as hot standby does.

It is a little confusing to mention replication conflicts in point 1. It
makes it sound like it already logs a recovery conflict. Without the
recovery conflict handling in this patchset, logical decoding of
statements using data that has been removed will fail with some error
like :
    ERROR:  could not map filenumber "xxx" to relation OID
Part of what this patchset does is introduce the concept of a new kind
of recovery conflict and a resolution process.
 
> 2. Our chosen strategy for dealing with this type of replication slot
> is to invalidate logical slots for which needed data has been removed.
> 
> 3. To do this we need the latestRemovedXid for each change, just as we
> do for physical replication conflicts, but we also need to know
> whether any particular change was to data that logical replication
> might access.

It isn't clear from the above sentence why you would need both. I think
it has something to do with what is below (hot_standby_feedback being
off), but I'm not sure, so the order is confusing.
 
> 4. We can't rely on the standby's relcache entries for this purpose in
> any way, because the startup process can't access catalog contents.
> 
> 5. Therefore every WAL record that potentially removes data from the
> index or heap must carry a flag indicating whether or not it is one
> that might be accessed during logical decoding.
> 
> Why do we need this for logical decoding on standby?
> 
> First, let's forget about logical decoding on standby and recall that
> on a primary database, any catalog rows that may be needed by a logical
> decoding replication slot are not removed.
> 
> This is done thanks to the catalog_xmin associated with the logical
> replication slot.
> 
> But, with logical decoding on standby, in the following cases:
> 
> - hot_standby_feedback is off
> - hot_standby_feedback is on but there is no a physical slot between
>   the primary and the standby. Then, hot_standby_feedback will work,
>   but only while the connection is alive (for example a node restart
>   would break it)
> 
> Then, the primary may delete system catalog rows that could be needed
> by the logical decoding on the standby (as it does not know about the
> catalog_xmin on the standby).
> 
> So, it’s mandatory to identify those rows and invalidate the slots
> that may need them if any. Identifying those rows is the purpose of
> this commit.

I would like a few more specifics about this commit (patch 1 in the set)
itself in the commit message.

I think it would be good to have the commit message mention what kinds
of operations require WAL to contain information about whether or not it
is operating on a catalog table and why this is.

For example, I think the explanation of the feature makes it clear why
vacuum and pruning operations would require isCatalogRel, however it
isn't immediately obvious why page reuse would.

Also, because the diff has so many function signatures changed, it is
hard to tell which xlog record types are actually being changed to
include isCatalogRel. It might be too detailed/repetitive for the final
commit message to have a list of the xlog types requiring this info
(gistxlogPageReuse, spgxlogVacuumRedirect, xl_hash_vacuum_one_page,
xl_btree_reuse_page, xl_btree_delete, xl_heap_visible, xl_heap_prune,
xl_heap_freeze_page) but perhaps you could enumerate all the general
operations (freeze page, vacuum, prune, etc).

> Implementation:
> 
> When a WAL replay on standby indicates that a catalog table tuple is
> to be deleted by an xid that is greater than a logical slot's
> catalog_xmin, then that means the slot's catalog_xmin conflicts with
> the xid, and we need to handle the conflict. While subsequent commits
> will do the actual conflict handling, this commit adds a new field
> isCatalogRel in such WAL records (and a new bit set in the
> xl_heap_visible flags field), that is true for catalog tables, so as to
> arrange for conflict handling.

You do mention it a bit here, but I think it could be more clear and
specific.

> diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
> index ba394f08f6..235f1a1843 100644
> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
>          for (; ptr; ptr = ptr->next)
>          {
>              /* Allocate new page */
> -            ptr->buffer = gistNewBuffer(rel);
> +            ptr->buffer = gistNewBuffer(heapRel, rel);
>              GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
>              ptr->page = BufferGetPage(ptr->buffer);
>              ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
> diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
> index d21a308d41..a87890b965 100644
> --- a/src/backend/access/gist/gistbuild.c
> +++ b/src/backend/access/gist/gistbuild.c
> @@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
>          Page        page;
>  
>          /* initialize the root page */
> -        buffer = gistNewBuffer(index);
> +        buffer = gistNewBuffer(heap, index);
>          Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
>          page = BufferGetPage(buffer);
>  
> diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
> index 56451fede1..119e34ce0f 100644
> --- a/src/backend/access/gist/gistutil.c
> +++ b/src/backend/access/gist/gistutil.c
> @@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
>   * Caller is responsible for initializing the page by calling GISTInitBuffer
>   */
>  Buffer
> -gistNewBuffer(Relation r)
> +gistNewBuffer(Relation heaprel, Relation r)
>  {

It is not very important, but I noticed you made "heaprel" the last
parameter to all of the btree-related functions but the first parameter
to the gist functions. I thought it might be nice to make the order
consistent. I also was wondering why you made it the last argument to
all the btree functions to begin with (i.e. instead of directly after
the first rel argument).

> diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
> index 8af33d7b40..9bdac12baf 100644
> --- a/src/include/access/gist_private.h
> +++ b/src/include/access/gist_private.h
> @@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
>                                       FullTransactionId xid, Buffer parentBuffer,
>                                       OffsetNumber downlinkOffset);
>  
> -extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
> +extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
>                                FullTransactionId deleteXid);
>  
>  extern XLogRecPtr gistXLogUpdate(Buffer buffer,
> @@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
>  extern bool gistfitpage(IndexTuple *itvec, int len);
>  extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
>  extern void gistcheckpage(Relation rel, Buffer buf);
> -extern Buffer gistNewBuffer(Relation r);
> +extern Buffer gistNewBuffer(Relation heaprel, Relation r);
>  extern bool gistPageRecyclable(Page page);
>  extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
>                             OffsetNumber off);
> diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
> index 09f9b0f8c6..191f0e5808 100644
> --- a/src/include/access/gistxlog.h
> +++ b/src/include/access/gistxlog.h
> @@ -51,13 +51,13 @@ typedef struct gistxlogDelete
>  {
>      TransactionId snapshotConflictHorizon;
>      uint16        ntodelete;        /* number of deleted offsets */
> +    bool        isCatalogRel;

In some of these struct definitions, I think it would help comprehension
to have a comment explaining the purpose of this member.

>  
> -    /*
> -     * In payload of blk 0 : todelete OffsetNumbers
> -     */
> +    /* TODELETE OFFSET NUMBERS */
> +    OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];

Thanks for all your hard work on this feature!

Best,
Melanie



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:
> > > With a reload in place in my testing, now I notice that the catalog_xmin
> > > is updated on the primary physical slot after logical slots invalidation
> > > when reloading hot_standby_feedback from "off" to "on".
> > > 
> > > This is not the case after a re-start (aka catalog_xmin is NULL).
> > > 
> > > I think a re-start and reload should produce identical behavior on
> > > the primary physical slot. If so, I'm tempted to think that the catalog_xmin
> > > should be updated in case of a re-start too (even if all the logical slots are invalidated)
> > > because the slots are not dropped yet. What do you think?
> > 
> > I can't quite follow the steps leading up to the difference. Could you list
> > them in a bit more detail?
> > 
> > 
> 
> Sure, so with:
> 
> 1) hot_standby_feedback set to off on the standby
> 2) create 2 logical replication slots on the standby and activate one
> 3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
> 4) change hot_standby_feedback to on on the standby
> 
> If:
> 
> 5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
> for the physical slot that the standby is attached to:
> 
> postgres=# select slot_type,xmin,catalog_xmin  from pg_replication_slots ;
>  slot_type | xmin | catalog_xmin
> -----------+------+--------------
>  physical  |  822 |          748
> (1 row)

How long did you wait for this to change? I don't think there's anything right
now that'd force a new hot-standby-feedback message to be sent to the primary,
after slots got invalidated.

I suspect that if you terminated the walsender connection on the primary,
you'd not see it anymore either?

If that isn't it, something is broken in InvalidateObsolete...


> No, but a question still remains to me:
> 
> Given the fact that the row removal case is already done
> in the next test (aka Scenario 2), If we want to replace the "vacuum full" test
> on the database (done in Scenario 1) with a cheaper one at the table level,
> what could it be to guarantee an invalidation?
> 
> Same as scenario 2 but with "vacuum full pg_class" would not really add value
> to the tests, right?

A database wide VACUUM FULL is also just a row removal test, no? I think it
makes sense to test that both VACUUM and VACUUM FULL both trigger conflicts,
because they internally use *very* different mechanisms.  It'd probably be
good to test at least conflicts triggered due to row removal via on-access
pruning as well. And perhaps also for btree killtuples.  I think those are the
common cases for catalog tables.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/24/23 1:46 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:
>>>> With a reload in place in my testing, now I notice that the catalog_xmin
>>>> is updated on the primary physical slot after logical slots invalidation
>>>> when reloading hot_standby_feedback from "off" to "on".
>>>>
>>>> This is not the case after a re-start (aka catalog_xmin is NULL).
>>>>
>>>> I think a re-start and reload should produce identical behavior on
>>>> the primary physical slot. If so, I'm tempted to think that the catalog_xmin
>>>> should be updated in case of a re-start too (even if all the logical slots are invalidated)
>>>> because the slots are not dropped yet. What do you think?
>>>
>>> I can't quite follow the steps leading up to the difference. Could you list
>>> them in a bit more detail?
>>>
>>>
>>
>> Sure, so with:
>>
>> 1) hot_standby_feedback set to off on the standby
>> 2) create 2 logical replication slots on the standby and activate one
>> 3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
>> 4) change hot_standby_feedback to on on the standby
>>
>> If:
>>
>> 5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
>> for the physical slot that the standby is attached to:
>>
>> postgres=# select slot_type,xmin,catalog_xmin  from pg_replication_slots ;
>>   slot_type | xmin | catalog_xmin
>> -----------+------+--------------
>>   physical  |  822 |          748
>> (1 row)
> 
> How long did you wait for this to change? 

Almost instantaneous after pg_reload_conf() on the standby.

> I don't think there's anything right
> now that'd force a new hot-standby-feedback message to be sent to the primary,
> after slots got invalidated.
> 
> I suspect that if you terminated the walsender connection on the primary,
> you'd not see it anymore either?
> 

Still there after the standby is shutdown but disappears when the standby is re-started.

> If that isn't it, something is broken in InvalidateObsolete...
> 

Will look at what's going on and ensure catalog_xmin is not sent to the primary after pg_reload_conf() (if the slots
areinvalidated).
 

> 
>> No, but a question still remains to me:
>>
>> Given the fact that the row removal case is already done
>> in the next test (aka Scenario 2), If we want to replace the "vacuum full" test
>> on the database (done in Scenario 1) with a cheaper one at the table level,
>> what could it be to guarantee an invalidation?
>>
>> Same as scenario 2 but with "vacuum full pg_class" would not really add value
>> to the tests, right?
> 
> A database wide VACUUM FULL is also just a row removal test, no? 

Yeah, so I was wondering if Scenario 1 was simply not just useless.

> I think it
> makes sense to test that both VACUUM and VACUUM FULL both trigger conflicts,
> because they internally use *very* different mechanisms.  

Got it, will do and replace Scenario 1 as you suggested initially.

> It'd probably be
> good to test at least conflicts triggered due to row removal via on-access
> pruning as well. And perhaps also for btree killtuples.  I think those are the
> common cases for catalog tables.
> 

Thanks for the proposal, will look at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/24/23 6:20 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/24/23 1:46 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:
>>> Sure, so with:
>>>
>>> 1) hot_standby_feedback set to off on the standby
>>> 2) create 2 logical replication slots on the standby and activate one
>>> 3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
>>> 4) change hot_standby_feedback to on on the standby
>>>
>>> If:
>>>
>>> 5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
>>> for the physical slot that the standby is attached to:
>>>
>>> postgres=# select slot_type,xmin,catalog_xmin  from pg_replication_slots ;
>>>   slot_type | xmin | catalog_xmin
>>> -----------+------+--------------
>>>   physical  |  822 |          748
>>> (1 row)
>>
>> How long did you wait for this to change? 
> 
> Almost instantaneous after pg_reload_conf() on the standby.
> 
>> I don't think there's anything right
>> now that'd force a new hot-standby-feedback message to be sent to the primary,
>> after slots got invalidated.
>>
>> I suspect that if you terminated the walsender connection on the primary,
>> you'd not see it anymore either?
>>
> 
> Still there after the standby is shutdown but disappears when the standby is re-started.
> 
>> If that isn't it, something is broken in InvalidateObsolete...
>>

Yeah, you are right: ReplicationSlotsComputeRequiredXmin() is missing for the
logical slot invalidation case (and ReplicationSlotsComputeRequiredXmin() also
needs to take care of it).

I'll provide a fix in the next revision along with the TAP tests comments addressed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/24/23 12:21 AM, Melanie Plageman wrote:
> I'm new to this thread and subject, but I had a few basic thoughts about
> the first patch in the set.
> 

Thanks for looking at it!

> On Mon, Jan 23, 2023 at 12:03:35PM +0100, Drouvot, Bertrand wrote:
>>
>> Please find V42 attached.
>>
>>  From 3c206bd77831d507f4f95e1942eb26855524571a Mon Sep 17 00:00:00 2001
>> From: bdrouvotAWS <bdrouvot@amazon.com>
>> Date: Mon, 23 Jan 2023 10:07:51 +0000
>> Subject: [PATCH v42 1/6] Add info in WAL records in preparation for logical
>>   slot conflict handling.
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>>
>> Overall design:
>>
>> 1. We want to enable logical decoding on standbys, but replay of WAL
>> from the primary might remove data that is needed by logical decoding,
>> causing replication conflicts much as hot standby does.
> 
> It is a little confusing to mention replication conflicts in point 1. It
> makes it sound like it already logs a recovery conflict. Without the
> recovery conflict handling in this patchset, logical decoding of
> statements using data that has been removed will fail with some error
> like :
>     ERROR:  could not map filenumber "xxx" to relation OID
> Part of what this patchset does is introduce the concept of a new kind
> of recovery conflict and a resolution process.
>   

I think I understand what you mean, what about the following?

1. We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing error(s) on the standby.

To prevent those errors a new replication conflict scenario
needs to be addressed (as much as hot standby does.)

>> 2. Our chosen strategy for dealing with this type of replication slot
>> is to invalidate logical slots for which needed data has been removed.
>>
>> 3. To do this we need the latestRemovedXid for each change, just as we
>> do for physical replication conflicts, but we also need to know
>> whether any particular change was to data that logical replication
>> might access.

> It isn't clear from the above sentence why you would need both. I think
> it has something to do with what is below (hot_standby_feedback being
> off), but I'm not sure, so the order is confusing.
>   

Right, it has to deal with the xid horizons too. So the idea is to check if
1) there is a risk of conflict and 2) if there is a risk then check
is there a conflict? (based on the xid). I'll reword this part.

>> 4. We can't rely on the standby's relcache entries for this purpose in
>> any way, because the startup process can't access catalog contents.
>>
>> 5. Therefore every WAL record that potentially removes data from the
>> index or heap must carry a flag indicating whether or not it is one
>> that might be accessed during logical decoding.
>>
>> Why do we need this for logical decoding on standby?
>>
>> First, let's forget about logical decoding on standby and recall that
>> on a primary database, any catalog rows that may be needed by a logical
>> decoding replication slot are not removed.
>>
>> This is done thanks to the catalog_xmin associated with the logical
>> replication slot.
>>
>> But, with logical decoding on standby, in the following cases:
>>
>> - hot_standby_feedback is off
>> - hot_standby_feedback is on but there is no a physical slot between
>>    the primary and the standby. Then, hot_standby_feedback will work,
>>    but only while the connection is alive (for example a node restart
>>    would break it)
>>
>> Then, the primary may delete system catalog rows that could be needed
>> by the logical decoding on the standby (as it does not know about the
>> catalog_xmin on the standby).
>>
>> So, it’s mandatory to identify those rows and invalidate the slots
>> that may need them if any. Identifying those rows is the purpose of
>> this commit.
> 
> I would like a few more specifics about this commit (patch 1 in the set)
> itself in the commit message.
> 
> I think it would be good to have the commit message mention what kinds
> of operations require WAL to contain information about whether or not it
> is operating on a catalog table and why this is.
> 
> For example, I think the explanation of the feature makes it clear why
> vacuum and pruning operations would require isCatalogRel, however it
> isn't immediately obvious why page reuse would.
> 

What do you think about putting those extra explanations in the code instead?

> Also, because the diff has so many function signatures changed, it is
> hard to tell which xlog record types are actually being changed to
> include isCatalogRel. It might be too detailed/repetitive for the final
> commit message to have a list of the xlog types requiring this info
> (gistxlogPageReuse, spgxlogVacuumRedirect, xl_hash_vacuum_one_page,
> xl_btree_reuse_page, xl_btree_delete, xl_heap_visible, xl_heap_prune,
> xl_heap_freeze_page) but perhaps you could enumerate all the general
> operations (freeze page, vacuum, prune, etc).
> 

Right, at the end there is only a few making "real" use of it: they can be
enumerated in the commit message. Will do.

>> Implementation:
>>
>> When a WAL replay on standby indicates that a catalog table tuple is
>> to be deleted by an xid that is greater than a logical slot's
>> catalog_xmin, then that means the slot's catalog_xmin conflicts with
>> the xid, and we need to handle the conflict. While subsequent commits
>> will do the actual conflict handling, this commit adds a new field
>> isCatalogRel in such WAL records (and a new bit set in the
>> xl_heap_visible flags field), that is true for catalog tables, so as to
>> arrange for conflict handling.
> 
> You do mention it a bit here, but I think it could be more clear and
> specific.

Ok, will try to be more clear.

> 
>> diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
>> index ba394f08f6..235f1a1843 100644
>> --- a/src/backend/access/gist/gist.c
>> +++ b/src/backend/access/gist/gist.c
>> @@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
>>           for (; ptr; ptr = ptr->next)
>>           {
>>               /* Allocate new page */
>> -            ptr->buffer = gistNewBuffer(rel);
>> +            ptr->buffer = gistNewBuffer(heapRel, rel);
>>               GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
>>               ptr->page = BufferGetPage(ptr->buffer);
>>               ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
>> diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
>> index d21a308d41..a87890b965 100644
>> --- a/src/backend/access/gist/gistbuild.c
>> +++ b/src/backend/access/gist/gistbuild.c
>> @@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
>>           Page        page;
>>   
>>           /* initialize the root page */
>> -        buffer = gistNewBuffer(index);
>> +        buffer = gistNewBuffer(heap, index);
>>           Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
>>           page = BufferGetPage(buffer);
>>   
>> diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
>> index 56451fede1..119e34ce0f 100644
>> --- a/src/backend/access/gist/gistutil.c
>> +++ b/src/backend/access/gist/gistutil.c
>> @@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
>>    * Caller is responsible for initializing the page by calling GISTInitBuffer
>>    */
>>   Buffer
>> -gistNewBuffer(Relation r)
>> +gistNewBuffer(Relation heaprel, Relation r)
>>   {
> 
> It is not very important, but I noticed you made "heaprel" the last
> parameter to all of the btree-related functions but the first parameter
> to the gist functions. I thought it might be nice to make the order
> consistent.

Agree, will do.

> I also was wondering why you made it the last argument to
> all the btree functions to begin with (i.e. instead of directly after
> the first rel argument).
> 

No real reasons, will put all of them after the first rel argument (that seems a better place).

>> diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
>> index 8af33d7b40..9bdac12baf 100644
>> --- a/src/include/access/gist_private.h
>> +++ b/src/include/access/gist_private.h
>> @@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
>>                                        FullTransactionId xid, Buffer parentBuffer,
>>                                        OffsetNumber downlinkOffset);
>>   
>> -extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
>> +extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
>>                                 FullTransactionId deleteXid);
>>   
>>   extern XLogRecPtr gistXLogUpdate(Buffer buffer,
>> @@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
>>   extern bool gistfitpage(IndexTuple *itvec, int len);
>>   extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
>>   extern void gistcheckpage(Relation rel, Buffer buf);
>> -extern Buffer gistNewBuffer(Relation r);
>> +extern Buffer gistNewBuffer(Relation heaprel, Relation r);
>>   extern bool gistPageRecyclable(Page page);
>>   extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
>>                              OffsetNumber off);
>> diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
>> index 09f9b0f8c6..191f0e5808 100644
>> --- a/src/include/access/gistxlog.h
>> +++ b/src/include/access/gistxlog.h
>> @@ -51,13 +51,13 @@ typedef struct gistxlogDelete
>>   {
>>       TransactionId snapshotConflictHorizon;
>>       uint16        ntodelete;        /* number of deleted offsets */
>> +    bool        isCatalogRel;
> 
> In some of these struct definitions, I think it would help comprehension
> to have a comment explaining the purpose of this member.
> 

Yeah, agree but it could be done in another patch (outside of this feature), agree?

> Thanks for all your hard work on this feature!

Thanks for the review and the feedback!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/24/23 3:31 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/24/23 6:20 AM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 1/24/23 1:46 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2023-01-19 10:43:27 +0100, Drouvot, Bertrand wrote:
>>>> Sure, so with:
>>>>
>>>> 1) hot_standby_feedback set to off on the standby
>>>> 2) create 2 logical replication slots on the standby and activate one
>>>> 3) Invalidate the logical slots on the standby with VACUUM FULL on the primary
>>>> 4) change hot_standby_feedback to on on the standby
>>>>
>>>> If:
>>>>
>>>> 5) pg_reload_conf() on the standby, then on the primary we get a catalog_xmin
>>>> for the physical slot that the standby is attached to:
>>>>
>>>> postgres=# select slot_type,xmin,catalog_xmin  from pg_replication_slots ;
>>>>   slot_type | xmin | catalog_xmin
>>>> -----------+------+--------------
>>>>   physical  |  822 |          748
>>>> (1 row)
>>>
>>> How long did you wait for this to change? 
>>
>> Almost instantaneous after pg_reload_conf() on the standby.
>>
>>> I don't think there's anything right
>>> now that'd force a new hot-standby-feedback message to be sent to the primary,
>>> after slots got invalidated.
>>>
>>> I suspect that if you terminated the walsender connection on the primary,
>>> you'd not see it anymore either?
>>>
>>
>> Still there after the standby is shutdown but disappears when the standby is re-started.
>>
>>> If that isn't it, something is broken in InvalidateObsolete...
>>>
> 
> Yeah, you are right: ReplicationSlotsComputeRequiredXmin() is missing for the
> logical slot invalidation case (and ReplicationSlotsComputeRequiredXmin() also
> needs to take care of it).
> 
> I'll provide a fix in the next revision along with the TAP tests comments addressed.
> 

Please find attached V43 addressing the comments related to the TAP tests (in 0004 at that time) that have been done in
[1].
 

Remarks:

- The C helper function to call LogStandbySnapshot() is not done yet.
- While working on it, I discovered that the new isCatalogRel field was not populated in gistXLogDelete(): fixed in
V43.
- The issue described above is also fixed so that a standby restart or a reload would produce the same behavior
on the primary physical slot (aka catalog_xmin is empty if logical slots are invalidated).
- A test with pg_recvlogical started before the standby promotion has been added.
- A test for conflict due to row removal via on-access pruning has been added.
- I'm struggling to create a test for btree killtuples as there is a need for rows removal on the table (that could
producea conflict too):
 
Do you've a scenario in mind for this one? (and btw in what kind of WAL record should the conflict be detected in such
acase? xl_btree_delete?)
 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5ipet%40awork3.anarazel.de
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote:
> - I'm struggling to create a test for btree killtuples as there is a need for rows removal on the table (that could
producea conflict too):
 
> Do you've a scenario in mind for this one? (and btw in what kind of WAL record should the conflict be detected in
sucha case? xl_btree_delete?)
 

Hm, it might indeed be hard in "modern" postgres.  I think you'd need at least
two concurrent sessions, to prevent on-access pruning on the table.


DROP TABLE IF EXISTS indexdel;
CREATE TABLE indexdel(id int8 primary key);
INSERT INTO indexdel SELECT generate_series(1, 10000);
VACUUM indexdel; -- ensure hint bits are set etc

DELETE FROM indexdel;

SELECT pg_current_wal_insert_lsn();

SET enable_indexonlyscan = false;
-- This scan finds that the index items are dead - but doesn't yet issue a
-- btree delete WAL record, that only happens when needing space on the page
-- again.
EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY id ASC;
SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC;

-- The insertions into the range of values prev
INSERT INTO indexdel SELECT generate_series(1, 100);


Does generate the btree deletion record, but it also does emit a PRUNE (from
heapam_index_fetch_tuple() -> heap_page_prune_opt()).

While the session could create a cursor to prevent later HOT cleanup, the
query would also trigger hot pruning (or prevent the rows from being dead, if
you declare the cursor before the DELETE). So you'd need overlapping cursors
in a concurrent session...

Too complicated.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/26/23 9:13 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-01-26 18:56:10 +0100, Drouvot, Bertrand wrote:
>> - I'm struggling to create a test for btree killtuples as there is a need for rows removal on the table (that could
producea conflict too):
 
>> Do you've a scenario in mind for this one? (and btw in what kind of WAL record should the conflict be detected in
sucha case? xl_btree_delete?)
 
> 
> Hm, it might indeed be hard in "modern" postgres.  I think you'd need at least
> two concurrent sessions, to prevent on-access pruning on the table.
> 
> 
> DROP TABLE IF EXISTS indexdel;
> CREATE TABLE indexdel(id int8 primary key);
> INSERT INTO indexdel SELECT generate_series(1, 10000);
> VACUUM indexdel; -- ensure hint bits are set etc
> 
> DELETE FROM indexdel;
> 
> SELECT pg_current_wal_insert_lsn();
> 
> SET enable_indexonlyscan = false;
> -- This scan finds that the index items are dead - but doesn't yet issue a
> -- btree delete WAL record, that only happens when needing space on the page
> -- again.
> EXPLAIN (COSTS OFF, SUMMARY OFF) SELECT id FROM indexdel WHERE id < 10 ORDER BY id ASC;
> SELECT id FROM indexdel WHERE id < 100 ORDER BY id ASC;
> 
> -- The insertions into the range of values prev
> INSERT INTO indexdel SELECT generate_series(1, 100);
> 
> 
> Does generate the btree deletion record, but it also does emit a PRUNE (from
> heapam_index_fetch_tuple() -> heap_page_prune_opt()).
> 
> While the session could create a cursor to prevent later HOT cleanup, the
> query would also trigger hot pruning (or prevent the rows from being dead, if
> you declare the cursor before the DELETE). So you'd need overlapping cursors
> in a concurrent session...
> 

Thanks for the scenario and explanation!

I agree that a second session would be needed (and so I understand why I was
struggling when trying with a single session ;-) )

> Too complicated.
> 

Yeah.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/11/23 5:52 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-01-11 13:02:13 +0530, Bharath Rupireddy wrote:
>> 3. Is this feature still a 'minimal logical decoding on standby'?
>> Firstly, why is it 'minimal'?
> 
> It's minimal in comparison to other proposals at the time that did explicit /
> active coordination between primary and standby to allow logical decoding.
> 
> 
> 
>> 0002:
>> 1.
>> -    if (InvalidateObsoleteReplicationSlots(_logSegNo))
>> +    InvalidateObsoleteOrConflictingLogicalReplicationSlots(_logSegNo,
>> &invalidated, InvalidOid, NULL);
>>
>> Isn't the function name too long and verbose?
> 
> +1
> 
> 
>> How about just InvalidateLogicalReplicationSlots() let the function comment
>> talk about what sorts of replication slots it invalides?
> 
> I'd just leave the name unmodified at InvalidateObsoleteReplicationSlots().
> 
> 

Done in V44 attached.


>> 2.
>> +                                errdetail("Logical decoding on
>> standby requires wal_level to be at least logical on master"));
>> + *     master wal_level is set back to replica, so existing logical
>> slots need to
>> invalidate such slots. Also do the same thing if wal_level on master
>>
>> Can we use 'primary server' instead of 'master' like elsewhere? This
>> comment also applies for other patches too, if any.
> 
> +1
> 

Done in V44.

> 
>> 3. Can we show a new status in pg_get_replication_slots's wal_status
>> for invalidated due to the conflict so that the user can monitor for
>> the new status and take necessary actions?
> 
> Invalidated slots are not a new concept introduced in this patchset, so I'd
> say we can introduce such a field separately.
> 

In V44, adding a new field "conflicting" in pg_replication_slots which is:

- NULL for physical slots
- True if the slot is a logical one and has been invalidated due to recovery conflict
- False if the slot is a logical one and has not been invalidated due to recovery conflict

I'm not checking if recovery is in progress while displaying the "conflicting". The reason is to still display
the right status after a promotion.

TAP tests are also updated to test that this new field behaves as expected (for both physical and logical slots).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

Just realized that Melanie was missing in the up-thread reply to
her feedback (not sure what happened, sorry about that).... So, adding her here.

Please find attached V45 addressing Melanie's feedback.

On 1/24/23 3:59 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/24/23 12:21 AM, Melanie Plageman wrote:
>> I'm new to this thread and subject, but I had a few basic thoughts about
>> the first patch in the set.
>>
> 
> Thanks for looking at it!
> 
>> On Mon, Jan 23, 2023 at 12:03:35PM +0100, Drouvot, Bertrand wrote:
>>> 1. We want to enable logical decoding on standbys, but replay of WAL
>>> from the primary might remove data that is needed by logical decoding,
>>> causing replication conflicts much as hot standby does.
>>
>> It is a little confusing to mention replication conflicts in point 1. It
>> makes it sound like it already logs a recovery conflict. Without the
>> recovery conflict handling in this patchset, logical decoding of
>> statements using data that has been removed will fail with some error
>> like :
>>     ERROR:  could not map filenumber "xxx" to relation OID
>> Part of what this patchset does is introduce the concept of a new kind
>> of recovery conflict and a resolution process.

Changed the wording in V45's commit message.

>>> 3. To do this we need the latestRemovedXid for each change, just as we
>>> do for physical replication conflicts, but we also need to know
>>> whether any particular change was to data that logical replication
>>> might access.
> 
>> It isn't clear from the above sentence why you would need both. I think
>> it has something to do with what is below (hot_standby_feedback being
>> off), but I'm not sure, so the order is confusing.
> 

Trying to be more clear in the new commit message.

> 
>>> Implementation:
>>>
>>> When a WAL replay on standby indicates that a catalog table tuple is
>>> to be deleted by an xid that is greater than a logical slot's
>>> catalog_xmin, then that means the slot's catalog_xmin conflicts with
>>> the xid, and we need to handle the conflict. While subsequent commits
>>> will do the actual conflict handling, this commit adds a new field
>>> isCatalogRel in such WAL records (and a new bit set in the
>>> xl_heap_visible flags field), that is true for catalog tables, so as to
>>> arrange for conflict handling.
>>
>> You do mention it a bit here, but I think it could be more clear and
>> specific.
> 

Added the WAL record types impacted by the change in the new commit message.


>>
>> It is not very important, but I noticed you made "heaprel" the last
>> parameter to all of the btree-related functions but the first parameter
>> to the gist functions. I thought it might be nice to make the order
>> consistent.
> 
> Agree, will do.

Done.

> 
>> I also was wondering why you made it the last argument to
>> all the btree functions to begin with (i.e. instead of directly after
>> the first rel argument).
>>
> 
> No real reasons, will put all of them after the first rel argument (that seems a better place).

Done.

> 
>>> diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
>>> index 8af33d7b40..9bdac12baf 100644
>>> --- a/src/include/access/gist_private.h
>>> +++ b/src/include/access/gist_private.h
>>> @@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
>>>                                        FullTransactionId xid, Buffer parentBuffer,
>>>                                        OffsetNumber downlinkOffset);
>>> -extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
>>> +extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
>>>                                 FullTransactionId deleteXid);
>>>   extern XLogRecPtr gistXLogUpdate(Buffer buffer,
>>> @@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
>>>   extern bool gistfitpage(IndexTuple *itvec, int len);
>>>   extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
>>>   extern void gistcheckpage(Relation rel, Buffer buf);
>>> -extern Buffer gistNewBuffer(Relation r);
>>> +extern Buffer gistNewBuffer(Relation heaprel, Relation r);
>>>   extern bool gistPageRecyclable(Page page);
>>>   extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
>>>                              OffsetNumber off);
>>> diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
>>> index 09f9b0f8c6..191f0e5808 100644
>>> --- a/src/include/access/gistxlog.h
>>> +++ b/src/include/access/gistxlog.h
>>> @@ -51,13 +51,13 @@ typedef struct gistxlogDelete
>>>   {
>>>       TransactionId snapshotConflictHorizon;
>>>       uint16        ntodelete;        /* number of deleted offsets */
>>> +    bool        isCatalogRel;
>>
>> In some of these struct definitions, I think it would help comprehension
>> to have a comment explaining the purpose of this member.
>>
> 
> Yeah, agree but it could be done in another patch (outside of this feature), agree?

Please forget about my previous reply (I misunderstood and thought you were mentioning the offset's Array).

Added comments about isCatalogRel in V45 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/6/23 4:40 AM, Andres Freund wrote:
> Hi,
> On 2023-01-05 16:15:39 -0500, Robert Haas wrote:
>> On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
> 0006:
> 
>> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
>> index bc3c3eb3e7..98c96eb864 100644
>> --- a/src/backend/access/transam/xlogrecovery.c
>> +++ b/src/backend/access/transam/xlogrecovery.c
>> @@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
>>       RecoveryPauseState recoveryPauseState;
>>       ConditionVariable recoveryNotPausedCV;
>>
>> +    /* Replay state (see getReplayedCV() for more explanation) */
>> +    ConditionVariable replayedCV;
>> +
>>       slock_t        info_lck;        /* locks shared variables shown above */
>>   } XLogRecoveryCtlData;
>>
> 
> getReplayedCV() doesn't seem to fit into any of the naming scheems in use for
> xlogrecovery.h.

Changed to check_for_replay() in V46 attached.

>> -         * Sleep until something happens or we time out.  Also wait for the
>> -         * socket becoming writable, if there's still pending output.
>> +         * When not in recovery, sleep until something happens or we time out.
>> +         * Also wait for the socket becoming writable, if there's still pending output.
> 
> Hm. Is there a problem with not handling the becoming-writable case in the
> in-recovery case?
> 
> 

Yes, when not in recovery we'd wait for the timeout to occur in  ConditionVariableTimedSleep()
(as the CV is broadcasted only in ApplyWalRecord()).

>> +        else
>> +        /*
>> +         * We are in the logical decoding on standby case.
>> +         * We are waiting for the startup process to replay wal record(s) using
>> +         * a timeout in case we are requested to stop.
>> +         */
>> +        {
> 
> I don't think pgindent will like that formatting....

Oops, fixed.

> 
> 
>> +            ConditionVariablePrepareToSleep(replayedCV);
>> +            ConditionVariableTimedSleep(replayedCV, 1000,
>> +                                        WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
>> +        }
> 
> I think this is racy, see ConditionVariablePrepareToSleep()'s comment:
> 
>   * Caution: "before entering the loop" means you *must* test the exit
>   * condition between calling ConditionVariablePrepareToSleep and calling
>   * ConditionVariableSleep.  If that is inconvenient, omit calling
>   * ConditionVariablePrepareToSleep.
> 
> Basically, the ConditionVariablePrepareToSleep() should be before the loop
> body.
> 

I missed it, thanks! Moved it before the loop body.

> 
> I don't think the fixed timeout here makes sense. For one, we need to wake up
> based on WalSndComputeSleeptime(), otherwise we're ignoring wal_sender_timeout
> (which can be quite small).  

Good point. Making use of WalSndComputeSleeptime() instead in V46.

> It's also just way too frequent - we're trying to
> avoid constantly waking up unnecessarily.
> 
> 
> Perhaps we could deal with the pq_is_send_pending() issue by having a version
> of ConditionVariableTimedSleep() that accepts a WaitEventSet?
> 

What issue do you see?
The one that I see with V46 (keeping the in/not recovery branches) is that one may need to wait
for wal_sender_timeout to see changes that occurred right after the promotion.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/31/23 12:50 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/6/23 4:40 AM, Andres Freund wrote:
>> Hi,
>> On 2023-01-05 16:15:39 -0500, Robert Haas wrote:
>>> On Tue, Jan 3, 2023 at 2:42 AM Drouvot, Bertrand
>>> <bertranddrouvot.pg@gmail.com> wrote:
>> 0006:
>>
>>> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
>>> index bc3c3eb3e7..98c96eb864 100644
>>> --- a/src/backend/access/transam/xlogrecovery.c
>>> +++ b/src/backend/access/transam/xlogrecovery.c
>>> @@ -358,6 +358,9 @@ typedef struct XLogRecoveryCtlData
>>>       RecoveryPauseState recoveryPauseState;
>>>       ConditionVariable recoveryNotPausedCV;
>>>
>>> +    /* Replay state (see getReplayedCV() for more explanation) */
>>> +    ConditionVariable replayedCV;
>>> +
>>>       slock_t        info_lck;        /* locks shared variables shown above */
>>>   } XLogRecoveryCtlData;
>>>
>>
>> getReplayedCV() doesn't seem to fit into any of the naming scheems in use for
>> xlogrecovery.h.
> 
> Changed to check_for_replay() in V46 attached.
> 
>>> -         * Sleep until something happens or we time out.  Also wait for the
>>> -         * socket becoming writable, if there's still pending output.
>>> +         * When not in recovery, sleep until something happens or we time out.
>>> +         * Also wait for the socket becoming writable, if there's still pending output.
>>
>> Hm. Is there a problem with not handling the becoming-writable case in the
>> in-recovery case?
>>
>>
> 
> Yes, when not in recovery we'd wait for the timeout to occur in  ConditionVariableTimedSleep()
> (as the CV is broadcasted only in ApplyWalRecord()).
> 
>>> +        else
>>> +        /*
>>> +         * We are in the logical decoding on standby case.
>>> +         * We are waiting for the startup process to replay wal record(s) using
>>> +         * a timeout in case we are requested to stop.
>>> +         */
>>> +        {
>>
>> I don't think pgindent will like that formatting....
> 
> Oops, fixed.
> 
>>
>>
>>> +            ConditionVariablePrepareToSleep(replayedCV);
>>> +            ConditionVariableTimedSleep(replayedCV, 1000,
>>> +                                        WAIT_EVENT_WAL_SENDER_WAIT_REPLAY);
>>> +        }
>>
>> I think this is racy, see ConditionVariablePrepareToSleep()'s comment:
>>
>>   * Caution: "before entering the loop" means you *must* test the exit
>>   * condition between calling ConditionVariablePrepareToSleep and calling
>>   * ConditionVariableSleep.  If that is inconvenient, omit calling
>>   * ConditionVariablePrepareToSleep.
>>
>> Basically, the ConditionVariablePrepareToSleep() should be before the loop
>> body.
>>
> 
> I missed it, thanks! Moved it before the loop body.
> 
>>
>> I don't think the fixed timeout here makes sense. For one, we need to wake up
>> based on WalSndComputeSleeptime(), otherwise we're ignoring wal_sender_timeout
>> (which can be quite small). 
> 
> Good point. Making use of WalSndComputeSleeptime() instead in V46.
> 
>> It's also just way too frequent - we're trying to
>> avoid constantly waking up unnecessarily.
>>
>>
>> Perhaps we could deal with the pq_is_send_pending() issue by having a version
>> of ConditionVariableTimedSleep() that accepts a WaitEventSet?
>>
> 
> What issue do you see?
> The one that I see with V46 (keeping the in/not recovery branches) is that one may need to wait
> for wal_sender_timeout to see changes that occurred right after the promotion.
> 
> Regards,
> 

Attaching a tiny rebase (V47) due to f9bc34fcb6.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/11/23 4:23 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/11/23 8:32 AM, Bharath Rupireddy wrote:
>> On Tue, Jan 10, 2023 at 2:03 PM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>>
>>> Please find attached, V37 taking care of:
>>
>> Thanks. I started to digest the design specified in the commit message
>> and these patches. 
> 
> Thanks for looking at it!
> 
>> Here are some quick comments:
>>
>> 1. Does logical decoding on standby work without any issues if the
>> standby is set for cascading replication?
>>
> 
> Without "any issues" is hard to guarantee ;-) But according to my tests:
> 
> Primary -> Standby1 with or without logical replication slot -> Standby2 with or without logical replication slot
> 
> works as expected (and also with cascading promotion).
> We can add some TAP tests in 0004 though.

Cascading standby tests have been added in V48 attached.

It does test that:

- a sql logical decoding session on the cascading standby get expected output before/after the standby promotion
- a pg_recvlogical logical decoding session on the cascading standby (started before the standby promotion) get
expectedoutput before/after the standby promotion
 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/19/23 3:46 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:
>>> On 1/6/23 4:40 AM, Andres Freund wrote:
>>>> Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
>>>> for us instead? This will likely also be needed in real applications after all.
>>>>
>>>
>>> Not sure I got it. What the C helper would be supposed to do?
>>
>> Call LogStandbySnapshot().
>>
> 
> Got it, I like the idea, will do.
> 

0005 in V49 attached is introducing a new pg_log_standby_snapshot() function
and the TAP test is making use of it.

Documentation about this new function is also added in the "Snapshot Synchronization Functions"
section. I'm not sure that's the best place for it but did not find a better place yet.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 2/7/23 4:29 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 1/19/23 3:46 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:
>>>> On 1/6/23 4:40 AM, Andres Freund wrote:
>>>>> Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
>>>>> for us instead? This will likely also be needed in real applications after all.
>>>>>
>>>>
>>>> Not sure I got it. What the C helper would be supposed to do?
>>>
>>> Call LogStandbySnapshot().
>>>
>>
>> Got it, I like the idea, will do.
>>
> 
> 0005 in V49 attached is introducing a new pg_log_standby_snapshot() function
> and the TAP test is making use of it.
> 
> Documentation about this new function is also added in the "Snapshot Synchronization Functions"
> section. I'm not sure that's the best place for it but did not find a better place yet.
> 

Attaching V50, tiny update in the TAP test (aka 0005) to make use of the wait_for_replay_catchup()
wrapper just added in a1acdacada.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Ashutosh Sharma
Date:
On Thu, Jan 12, 2023 at 11:46 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-01-12 20:08:55 +0530, Ashutosh Sharma wrote:
> > I previously participated in the discussion on "Synchronizing the
> > logical replication slots from Primary to Standby" and one of the
> > purposes of that project was to synchronize logical slots from primary
> > to standby so that if failover occurs, it will not affect the logical
> > subscribers of the old primary much. Can someone help me understand
> > how we are going to solve this problem with this patch? Are we going
> > to encourage users to do LR from standby instead of primary to get rid
> > of such problems during failover?
>
> It only provides a building block towards that. The "Synchronizing the logical
> replication slots from Primary to Standby" project IMO needs all of the
> infrastructure in this patch. With the patch, a logical rep solution can
> e.g. maintain one slot on the primary and one on the standby, and occasionally
> forward the slot on the standby to the position of the slot on the primary. In
> case of a failover it can just start consuming changes from the former
> standby, all the necessary changes are guaranteed to be present.
>
>
> > Also, one small observation:
> >
> > I just played around with the latest (v38) patch a bit and found that
> > when a new logical subscriber of standby is created, it actually
> > creates two logical replication slots for it on the standby server.
> > May I know the reason for creating an extra replication slot other
> > than the one created by create subscription command? See below:
>
> That's unrelated to this patch. There's no changes to the "higher level"
> logical replication code dealing with pubs and subs, it's all on the "logical
> decoding" level.
>
> I think this because logical rep wants to be able to concurrently perform
> ongoing replication, and synchronize tables added to the replication set. The
> pg_16399_sync_16392_7187728548042694423 slot should vanish after the initial
> synchronization.
>

Thanks Andres. I have one more query (both for you and Bertrand). I
don't know if this has already been answered somewhere in this mail
thread, if yes, please let me know the mail that answers this query.

Will there be a problem if we mandate the use of physical replication
slots and hot_standby_feedback to support minimum LD on standby. I
know people can do a physical replication setup without a replication
slot or even with hot_standby_feedback turned off, but are we going to
have any issue if we ask them to use a physical replication slot and
turn on hot_standby_feedback for LD on standby. This will reduce the
code changes required to do conflict handling for logical slots on
standby which is being done by v50-0001 and v50-0002* patches
currently.

IMHO even in normal scenarios i.e. when we are not doing LD on
standby, we should mandate the use of a physical replication slot.

--
With Regards,
Ashutosh Sharma.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 2/15/23 1:32 PM, Ashutosh Sharma wrote:
> Thanks Andres. I have one more query (both for you and Bertrand). I
> don't know if this has already been answered somewhere in this mail
> thread, if yes, please let me know the mail that answers this query.
> 
> Will there be a problem if we mandate the use of physical replication
> slots and hot_standby_feedback to support minimum LD on standby. 

I don't think we have to make it mandatory. There is use cases
where it's not needed and mentioned by Andres up-thread [1] (see the comment
"The patch deals with this...")

> I know people can do a physical replication setup without a replication
> slot or even with hot_standby_feedback turned off, but are we going to
> have any issue if we ask them to use a physical replication slot and
> turn on hot_standby_feedback for LD on standby. This will reduce the
> code changes required to do conflict handling for logical slots on
> standby which is being done by v50-0001 and v50-0002* patches
> currently.
> 

But on the other hand we'd need to ensure that the primary physical slot
and HSF set to on on the standby are always in place. That would probably lead
to extra code too.

I'm -1 on that but +1 on the fact that it should be documented (as done in 0006).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

[1]: https://www.postgresql.org/message-id/20211028210755.afmwcvpo6ajwdx6n%40alap3.anarazel.de



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-02-15 18:02:11 +0530, Ashutosh Sharma wrote:
> Thanks Andres. I have one more query (both for you and Bertrand). I
> don't know if this has already been answered somewhere in this mail
> thread, if yes, please let me know the mail that answers this query.
> 
> Will there be a problem if we mandate the use of physical replication
> slots and hot_standby_feedback to support minimum LD on standby. I
> know people can do a physical replication setup without a replication
> slot or even with hot_standby_feedback turned off, but are we going to
> have any issue if we ask them to use a physical replication slot and
> turn on hot_standby_feedback for LD on standby. This will reduce the
> code changes required to do conflict handling for logical slots on
> standby which is being done by v50-0001 and v50-0002* patches
> currently.

I don't think it would. E.g. while restoring from archives we can't rely on
knowing that the slot still exists on the primary.

We can't just do corrupt things, including potentially crashing, when the
configuration is wrong. We can't ensure that the configuration is accurate all
the time. So we need to detect this case. Hence needing to detect conflicts.


> IMHO even in normal scenarios i.e. when we are not doing LD on
> standby, we should mandate the use of a physical replication slot.

I don't think that's going to fly. There plenty scenarios where you e.g. don't
want to use a slot, e.g. when you want to limit space use on the primary.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Ashutosh Sharma
Date:
On Wed, Feb 15, 2023 at 11:48 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-02-15 18:02:11 +0530, Ashutosh Sharma wrote:
> > Thanks Andres. I have one more query (both for you and Bertrand). I
> > don't know if this has already been answered somewhere in this mail
> > thread, if yes, please let me know the mail that answers this query.
> >
> > Will there be a problem if we mandate the use of physical replication
> > slots and hot_standby_feedback to support minimum LD on standby. I
> > know people can do a physical replication setup without a replication
> > slot or even with hot_standby_feedback turned off, but are we going to
> > have any issue if we ask them to use a physical replication slot and
> > turn on hot_standby_feedback for LD on standby. This will reduce the
> > code changes required to do conflict handling for logical slots on
> > standby which is being done by v50-0001 and v50-0002* patches
> > currently.
>
> I don't think it would. E.g. while restoring from archives we can't rely on
> knowing that the slot still exists on the primary.
>
> We can't just do corrupt things, including potentially crashing, when the
> configuration is wrong. We can't ensure that the configuration is accurate all
> the time. So we need to detect this case. Hence needing to detect conflicts.
>

OK. Got it, thanks.

>
> > IMHO even in normal scenarios i.e. when we are not doing LD on
> > standby, we should mandate the use of a physical replication slot.
>
> I don't think that's going to fly. There plenty scenarios where you e.g. don't
> want to use a slot, e.g. when you want to limit space use on the primary.
>

I think this can be controlled via max_slot_wal_keep_size GUC, if I
understand it correctly.

--
With Regards,
Ashutosh Sharma.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 2/13/23 4:27 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 2/7/23 4:29 PM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 1/19/23 10:43 AM, Drouvot, Bertrand wrote:
>>> Hi,
>>>
>>> On 1/19/23 3:46 AM, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> On 2023-01-18 11:24:19 +0100, Drouvot, Bertrand wrote:
>>>>> On 1/6/23 4:40 AM, Andres Freund wrote:
>>>>>> Hm, that's quite expensive. Perhaps worth adding a C helper that can do that
>>>>>> for us instead? This will likely also be needed in real applications after all.
>>>>>>
>>>>>
>>>>> Not sure I got it. What the C helper would be supposed to do?
>>>>
>>>> Call LogStandbySnapshot().
>>>>
>>>
>>> Got it, I like the idea, will do.
>>>
>>
>> 0005 in V49 attached is introducing a new pg_log_standby_snapshot() function
>> and the TAP test is making use of it.
>>
>> Documentation about this new function is also added in the "Snapshot Synchronization Functions"
>> section. I'm not sure that's the best place for it but did not find a better place yet.
>>
> 
> Attaching V50, tiny update in the TAP test (aka 0005) to make use of the wait_for_replay_catchup()
> wrapper just added in a1acdacada.
> 


Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and 8a8661828a (for 0005).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Mon, 2023-02-27 at 09:40 +0100, Drouvot, Bertrand wrote:
> Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and
> 8a8661828a (for 0005).

[ Jumping into this thread late, so I apologize if these comments have
already been covered. ]

Regarding v51-0004:

* Why is the CV sleep not being canceled?
* Comments on WalSndWaitForWal need to be updated to explain the
difference between the flush (primary) and the replay (standby) cases.

Overall, it seems like what you really want for the sleep/wakeup logic
in WalSndWaitForLSN is something like this:

   condVar = RecoveryInProgress() ? replayCV : flushCV;
   waitEvent = RecoveryInProgress() ?
       WAIT_EVENT_WAL_SENDER_WAIT_REPLAY :
       WAIT_EVENT_WAL_SENDER_WAIT_FLUSH;

   ConditionVariablePrepareToSleep(condVar);
   for(;;)
   {
      ...
      sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp());
      socketEvents = WL_SOCKET_READABLE;
      if (pq_is_send_pending())
          socketEvents = WL_SOCKET_WRITABLE;
      ConditionVariableTimedSleepOrEvents(
          condVar, sleeptime, socketEvents, waitEvent);
   }
   ConditionVariableCancelSleep();


But the problem is that ConditionVariableTimedSleepOrEvents() doesn't
exist, and I think that's what Andres was suggesting here[1].
WalSndWait() only waits for a timeout or a socket event, but not a CV;
ConditionVariableTimedSleep() only waits for a timeout or a CV, but not
a socket event.

I'm also missing how WalSndWait() works currently. It calls
ModifyWaitEvent() with NULL for the latch, so how does WalSndWakeup()
wake it up?

Assuming I'm wrong, and WalSndWait() does use the latch, then I guess
it could be extended by having two different latches in the WalSnd
structure, and waking them up separately and waiting on the right one.
Not sure if that's a good idea though.

[1]
https://www.postgresql.org/message-id/20230106034036.2m4qnn7ep7b5ipet@awork3.anarazel.de

--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/1/23 1:48 AM, Jeff Davis wrote:
> On Mon, 2023-02-27 at 09:40 +0100, Drouvot, Bertrand wrote:
>> Please find attached V51 tiny rebase due to a6cd1fc692 (for 0001) and
>> 8a8661828a (for 0005).
> 
> [ Jumping into this thread late, so I apologize if these comments have
> already been covered. ]
> 

Thanks for looking at it!

> Regarding v51-0004:
> 
> * Why is the CV sleep not being canceled?

I think that's an oversight, I'll look at it.

> * Comments on WalSndWaitForWal need to be updated to explain the
> difference between the flush (primary) and the replay (standby) cases.
> 

Yeah, will do.

> Overall, it seems like what you really want for the sleep/wakeup logic
> in WalSndWaitForLSN 

Typo for WalSndWaitForWal()?

> is something like this:
> 
>     condVar = RecoveryInProgress() ? replayCV : flushCV;
>     waitEvent = RecoveryInProgress() ?
>         WAIT_EVENT_WAL_SENDER_WAIT_REPLAY :
>         WAIT_EVENT_WAL_SENDER_WAIT_FLUSH;
> 
>     ConditionVariablePrepareToSleep(condVar);
>     for(;;)
>     {
>        ...
>        sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp());
>        socketEvents = WL_SOCKET_READABLE;
>        if (pq_is_send_pending())
>            socketEvents = WL_SOCKET_WRITABLE;
>        ConditionVariableTimedSleepOrEvents(
>            condVar, sleeptime, socketEvents, waitEvent);
>     }
>     ConditionVariableCancelSleep();
> 
> 
> But the problem is that ConditionVariableTimedSleepOrEvents() doesn't
> exist, and I think that's what Andres was suggesting here[1].
> WalSndWait() only waits for a timeout or a socket event, but not a CV;
> ConditionVariableTimedSleep() only waits for a timeout or a CV, but not
> a socket event.
> 
> I'm also missing how WalSndWait() works currently. It calls
> ModifyWaitEvent() with NULL for the latch, so how does WalSndWakeup()
> wake it up?

I think it works because the latch is already assigned to the FeBeWaitSet
in pq_init()->AddWaitEventToSet() (for latch_pos).

> 
> Assuming I'm wrong, and WalSndWait() does use the latch, then I guess
> it could be extended by having two different latches in the WalSnd
> structure, and waking them up separately and waiting on the right one.

I'm not sure this is needed in this particular case, because:

Why not "simply" call ConditionVariablePrepareToSleep() without any call to ConditionVariableTimedSleep() later?

In that case the walsender will be put in the wait queue (thanks to ConditionVariablePrepareToSleep())
and will be waked up by the event on the socket, the timeout or the CV broadcast (since IIUC they all rely on the same
latch).

So, something like:

    condVar = RecoveryInProgress() ? replayCV : flushCV;
    ConditionVariablePrepareToSleep(condVar);
    for(;;)
    {
            ...
            sleeptime = WalSndComputeSleepTime(GetCurrentTimestamp());
            socketEvents = WL_SOCKET_READABLE;
            if (pq_is_send_pending())
              socketEvents = WL_SOCKET_WRITABLE;
       WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_WAL);  <-- Note: the code within the loop does not
changeat all
 

          }
          ConditionVariableCancelSleep();


If the walsender is waked up by the CV broadcast, then it means the flush/replay occurred and then we should exit the
loop
right after due to:

"
         /* check whether we're done */
         if (loc <= RecentFlushPtr)
             break;

"

meaning that in this particular case there is only one wake up due to the CV broadcast before exiting the loop.

That looks weird to use ConditionVariablePrepareToSleep() without actually using ConditionVariableTimedSleep()
but it looks to me that it would achieve the same goal: having the walsender being waked up
by the event on the socket, the timeout or the CV broadcast.

In that case we would be missing the WAIT_EVENT_WAL_SENDER_WAIT_REPLAY and/or the WAIT_EVENT_WAL_SENDER_WAIT_FLUSH
wait events thought (and we'd just provide the WAIT_EVENT_WAL_SENDER_WAIT_WAL one) but I'm not sure that's a big
issue.

What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Wed, 2023-03-01 at 11:51 +0100, Drouvot, Bertrand wrote:

>
> Why not "simply" call ConditionVariablePrepareToSleep() without any
> call to ConditionVariableTimedSleep() later?

ConditionVariableSleep() re-inserts itself into the queue if it was
previously removed. Without that, a single wakeup could remove it from
the wait queue, and the effects of ConditionVariablePrepareToSleep()
would be lost.

> In that case the walsender will be put in the wait queue (thanks to
> ConditionVariablePrepareToSleep())
> and will be waked up by the event on the socket, the timeout or the
> CV broadcast

I believe it will only be awakened once, and if it enters WalSndWait()
again, future ConditionVariableBroadcast/Signal() calls won't wake it
up any more.

>  (since IIUC they all rely on the same latch).

Relying on that fact seems like too much action-at-a-distance to me. If
we change the implementation of condition variables, then it would stop
working.

Also, since they are using the same latch, that means we are still
waking up too frequently, right? We haven't really solved the problem.

> That looks weird to use ConditionVariablePrepareToSleep() without
> actually using ConditionVariableTimedSleep()
> but it looks to me that it would achieve the same goal: having the
> walsender being waked up
> by the event on the socket, the timeout or the CV broadcast.

I don't think it actually works, because something needs to keep re-
inserting it into the queue after it gets removed. You could maybe hack
it to put ConditionVariablePrepareToSleep() *in* the loop, and never
sleep. But that just seems like too much of a hack, and I didn't really
look at the details to see if that would actually work.

To use condition variables properly, I think we'd need an API like
ConditionVariableEventsSleep(), which takes a WaitEventSet and a
timeout. I think this is what Andres was suggesting and seems like a
good idea. I looked into it and I don't think it's too hard to
implement -- we just need to WaitEventSetWait instead of WaitLatch.
There are a few details to sort out, like how to enable callers to
easily create the right WaitEventSet (it obviously needs to include
MyLatch, for instance) and update it with the right socket events.



--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/2/23 1:40 AM, Jeff Davis wrote:
> On Wed, 2023-03-01 at 11:51 +0100, Drouvot, Bertrand wrote:
> 
>>
>> Why not "simply" call ConditionVariablePrepareToSleep() without any
>> call to ConditionVariableTimedSleep() later?
> 
> ConditionVariableSleep() re-inserts itself into the queue if it was
> previously removed. Without that, a single wakeup could remove it from
> the wait queue, and the effects of ConditionVariablePrepareToSleep()
> would be lost.

Right, but in our case, right after the wakeup (the one due to the CV broadcast,
aka the one that will remove it from the wait queue) we'll exit the loop due to:

"
         /* check whether we're done */
         if (loc <= RecentFlushPtr)
             break;
"

as the CV broadcast means that a flush/replay occurred.

So I don't see any issue in this particular case (as we are removed from the queue
but we'll not have to wait anymore).

> 
>> In that case the walsender will be put in the wait queue (thanks to
>> ConditionVariablePrepareToSleep())
>> and will be waked up by the event on the socket, the timeout or the
>> CV broadcast
> 
> I believe it will only be awakened once, and if it enters WalSndWait()
> again, future ConditionVariableBroadcast/Signal() calls won't wake it
> up any more.

I don't think that's right and that's not what my testing shows (please find attached 0004-CV-POC.txt,
a .txt file to not break the CF bot), as:

- If it is awakened due to the CV broadcast, then we'll right after exit the loop (see above)
- If it is awakened due to the timeout or the socket event then we're still in the CV wait queue
(as nothing removed it from the CV wait queue).

> 
>>   (since IIUC they all rely on the same latch).
> 
> Relying on that fact seems like too much action-at-a-distance to me
> If
> we change the implementation of condition variables, then it would stop
> working.
> 

I'm not sure about this one. I mean it would depend what the implementation changes are.
Also the related TAP test (0005) would probably fail or start taking a long time due to
the corner case we are trying to solve here coming back (like it was detected in [1])

> Also, since they are using the same latch, that means we are still
> waking up too frequently, right? We haven't really solved the problem.
> 

I don't think so as the first CV broadcast will make us exit the loop.
So, ISTM that we'll wake up as we currently do, expect when there is a flush/replay
which is what we want, right?

>> That looks weird to use ConditionVariablePrepareToSleep() without
>> actually using ConditionVariableTimedSleep()
>> but it looks to me that it would achieve the same goal: having the
>> walsender being waked up
>> by the event on the socket, the timeout or the CV broadcast.
> 
> I don't think it actually works, because something needs to keep re-
> inserting it into the queue after it gets removed.

I think that's not needed as we'd exit the loop right after we are awakened by a CV broadcast.
> 
> To use condition variables properly, I think we'd need an API like
> ConditionVariableEventsSleep(), which takes a WaitEventSet and a
> timeout. I think this is what Andres was suggesting and seems like a
> good idea. I looked into it and I don't think it's too hard to
> implement -- we just need to WaitEventSetWait instead of WaitLatch.
> There are a few details to sort out, like how to enable callers to
> easily create the right WaitEventSet (it obviously needs to include
> MyLatch, for instance) and update it with the right socket events.
> 

I agree that's a good idea and that it should/would work too. I just wanted to highlight that in this particular
case that might not be necessary to build this new API.

[1]: https://www.postgresql.org/message-id/47606911-cf44-5a62-21d5-366d3bc6e445%40enterprisedb.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Thu, 2023-03-02 at 10:20 +0100, Drouvot, Bertrand wrote:
> Right, but in our case, right after the wakeup (the one due to the CV
> broadcast,
> aka the one that will remove it from the wait queue) we'll exit the
> loop due to:
>
> "
>          /* check whether we're done */
>          if (loc <= RecentFlushPtr)
>              break;
> "
>
> as the CV broadcast means that a flush/replay occurred.

But does it mean that the flush/replay advanced *enough* to be greater
than or equal to loc?

> - If it is awakened due to the CV broadcast, then we'll right after
> exit the loop (see above)

...

> I think that's not needed as we'd exit the loop right after we are
> awakened by a CV broadcast.

See the comment here:

 * If this process has been taken out of the wait list, then we know
 * that it has been signaled by ConditionVariableSignal (or
 * ConditionVariableBroadcast), so we should return to the caller. But
 * that doesn't guarantee that the exit condition is met, only that we
 * ought to check it.

You seem to be arguing that in this case, it doesn't matter; that
walreceiver knows what walsender is waiting for, and will never wake it
up before it's ready. I don't think that's true, and even if it is, it
needs explanation.

>
> I agree that's a good idea and that it should/would work too. I just
> wanted to highlight that in this particular
> case that might not be necessary to build this new API.

In this case it looks easier to add the right API than to be sure about
whether it's needed or not.

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
> In this case it looks easier to add the right API than to be sure
> about
> whether it's needed or not.

I attached a sketch of one approach. I'm not very confident that it's
the right API or even that it works as I intended it, but if others
like the approach I can work on it some more.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/2/23 8:45 PM, Jeff Davis wrote:
> On Thu, 2023-03-02 at 10:20 +0100, Drouvot, Bertrand wrote:
>> Right, but in our case, right after the wakeup (the one due to the CV
>> broadcast,
>> aka the one that will remove it from the wait queue) we'll exit the
>> loop due to:
>>
>> "
>>           /* check whether we're done */
>>           if (loc <= RecentFlushPtr)
>>               break;
>> "
>>
>> as the CV broadcast means that a flush/replay occurred.
> 
> But does it mean that the flush/replay advanced *enough* to be greater
> than or equal to loc?
> 

Yes I think so: loc is when we started waiting initially
and RecentFlushPtr is >= to when the broadcast has been sent.

>> - If it is awakened due to the CV broadcast, then we'll right after
>> exit the loop (see above)
> 
> ...
> 
>> I think that's not needed as we'd exit the loop right after we are
>> awakened by a CV broadcast.
> 
> See the comment here:
> WalSndWaitForWal
>   * If this process has been taken out of the wait list, then we know
>   * that it has been signaled by ConditionVariableSignal (or
>   * ConditionVariableBroadcast), so we should return to the caller. But
>   * that doesn't guarantee that the exit condition is met, only that we
>   * ought to check it.
> 
> You seem to be arguing that in this case, it doesn't matter; that
> walreceiver knows what walsender is waiting for, and will never wake it
> up before it's ready. I don't think that's true, and even if it is, it
> needs explanation.
> 

What I think is that, in this particular case, we are sure that
the loop exit condition is met as we know that loc <= RecentFlushPtr.

>>
>> I agree that's a good idea and that it should/would work too. I just
>> wanted to highlight that in this particular
>> case that might not be necessary to build this new API.
> 
> In this case it looks easier to add the right API than to be sure about
> whether it's needed or not.
> 

What I meant is that of course I might be wrong.

If we do not agree that the new API (in this particular case) is not needed then
I agree that building the new API is the way to go ;-) (+ it offers the advantage to
be able to be more precise while reporting the wait event).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/3/23 8:58 AM, Jeff Davis wrote:
> On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
>> In this case it looks easier to add the right API than to be sure
>> about
>> whether it's needed or not.
> 
> I attached a sketch of one approach. 

Oh, that's very cool, thanks a lot!

> I'm not very confident that it's
> the right API or even that it works as I intended it, but if others
> like the approach I can work on it some more.
> 

I'll look at it early next week.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/3/23 5:26 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 3/3/23 8:58 AM, Jeff Davis wrote:
>> On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
>>> In this case it looks easier to add the right API than to be sure
>>> about
>>> whether it's needed or not.
>>
>> I attached a sketch of one approach. 
> 
> Oh, that's very cool, thanks a lot!
> 
>> I'm not very confident that it's
>> the right API or even that it works as I intended it, but if others
>> like the approach I can work on it some more.
>>
> 
> I'll look at it early next week.
> 

Just attaching a tiny rebase due to ebd551f586 breaking 0001 (did not look at your patch yet).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/3/23 5:26 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 3/3/23 8:58 AM, Jeff Davis wrote:
>> On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
>>> In this case it looks easier to add the right API than to be sure
>>> about
>>> whether it's needed or not.
>>
>> I attached a sketch of one approach. 
> 
> Oh, that's very cool, thanks a lot!
> 
>> I'm not very confident that it's
>> the right API or even that it works as I intended it, but if others
>> like the approach I can work on it some more.
>>
> 
> I'll look at it early next week.
> 

So, I took your patch and as an example I tried a quick integration in 0004,
(see 0004_new_API.txt attached) to put it in the logical decoding on standby context.

Based on this, I've 3 comments:

- Maybe ConditionVariableEventSleep() should take care of the “WaitEventSetWait returns 1 and cvEvent.event ==
WL_POSTMASTER_DEATH”case?
 

- Maybe ConditionVariableEventSleep() could accept and deal with the CV being NULL?
I used it in the POC attached to handle logical decoding on the primary server case.
One option should be to create a dedicated CV for that case though.

- In the POC attached I had to add this extra condition “(cv && !RecoveryInProgress())” to avoid waiting on the timeout
whenthere is a promotion.
 
That makes me think that we may want to add 2 extra parameters (as 2 functions returning a bool?) to
ConditionVariableEventSleep()
to check whether or not we still want to test the socket or the CV wake up in each loop iteration.

Also 3 additional remarks:

1) About InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate(): I'm not sure about the naming as
thereis no CV yet (they "just" deal with WaitEventSet).
 

So, what about renaming?

+static WaitEventSet *ConditionVariableWaitSet = NULL;

to say, "LocalWaitSet" and then rename ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and
ConditionVariableWaitSetCreate()accordingly?
 

But it might be not needed (see 3) below).

2)

  /*
   * Prepare to wait on a given condition variable.
   *
@@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
  void
  ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
  {
-       (void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
+       (void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
+                                                                          -1 /* no timeout */ ,
                                                                            wait_event_info);
  }

@@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
  bool
  ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
                                                         uint32 wait_event_info)
+{
+       return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
+                                                                          wait_event_info);
+}
+

I like the idea of making use of the new ConditionVariableEventSleep() here, but on the other hand...

3)

I wonder if there is no race conditions: ConditionVariableWaitSet is being initialized with PGINVALID_SOCKET
as WL_LATCH_SET and might be also (if IsUnderPostmaster) be initialized with PGINVALID_SOCKET as WL_EXIT_ON_PM_DEATH.

So IIUC, the patch is introducing 2 new possible source of wake up.

Then, what about?

- not create ConditionVariableWaitSet, ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and
ConditionVariableWaitSetCreate()at all?
 
- call ConditionVariableEventSleep() with a NULL parameter in ConditionVariableSleep() and
ConditionVariableTimedSleep()?
- handle the case where the WaitEventSet parameter is NULL in ConditionVariableEventSleep()? (That could also make
senseif we handle the case of the CV being NULL as proposed above)
 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/8/23 11:25 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 3/3/23 5:26 PM, Drouvot, Bertrand wrote:
>> Hi,
>>
>> On 3/3/23 8:58 AM, Jeff Davis wrote:
>>> On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
>>>> In this case it looks easier to add the right API than to be sure
>>>> about
>>>> whether it's needed or not.
>>>
>>> I attached a sketch of one approach. 
>>
>> Oh, that's very cool, thanks a lot!
>>
>>> I'm not very confident that it's
>>> the right API or even that it works as I intended it, but if others
>>> like the approach I can work on it some more.
>>>
>>
>> I'll look at it early next week.
>>
> 
> So, I took your patch and as an example I tried a quick integration in 0004,
> (see 0004_new_API.txt attached) to put it in the logical decoding on standby context.
> 
> Based on this, I've 3 comments:
> 
> - Maybe ConditionVariableEventSleep() should take care of the “WaitEventSetWait returns 1 and cvEvent.event ==
WL_POSTMASTER_DEATH”case?
 
> 
> - Maybe ConditionVariableEventSleep() could accept and deal with the CV being NULL?
> I used it in the POC attached to handle logical decoding on the primary server case.
> One option should be to create a dedicated CV for that case though.
> 
> - In the POC attached I had to add this extra condition “(cv && !RecoveryInProgress())” to avoid waiting on the
timeoutwhen there is a promotion.
 
> That makes me think that we may want to add 2 extra parameters (as 2 functions returning a bool?) to
ConditionVariableEventSleep()
> to check whether or not we still want to test the socket or the CV wake up in each loop iteration.
> 
> Also 3 additional remarks:
> 
> 1) About InitializeConditionVariableWaitSet() and ConditionVariableWaitSetCreate(): I'm not sure about the naming as
thereis no CV yet (they "just" deal with WaitEventSet).
 
> 
> So, what about renaming?
> 
> +static WaitEventSet *ConditionVariableWaitSet = NULL;
> 
> to say, "LocalWaitSet" and then rename ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and
ConditionVariableWaitSetCreate()accordingly?
 
> 
> But it might be not needed (see 3) below).
> 
> 2)
> 
>   /*
>    * Prepare to wait on a given condition variable.
>    *
> @@ -97,7 +162,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
>   void
>   ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
>   {
> -       (void) ConditionVariableTimedSleep(cv, -1 /* no timeout */ ,
> +       (void) ConditionVariableEventSleep(cv, ConditionVariableWaitSet,
> +                                                                          -1 /* no timeout */ ,
>                                                                             wait_event_info);
>   }
> 
> @@ -111,11 +177,27 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
>   bool
>   ConditionVariableTimedSleep(ConditionVariable *cv, long timeout,
>                                                          uint32 wait_event_info)
> +{
> +       return ConditionVariableEventSleep(cv, ConditionVariableWaitSet, timeout,
> +                                                                          wait_event_info);
> +}
> +
> 
> I like the idea of making use of the new ConditionVariableEventSleep() here, but on the other hand...
> 
> 3)
> 
> I wonder if there is no race conditions: ConditionVariableWaitSet is being initialized with PGINVALID_SOCKET
> as WL_LATCH_SET and might be also (if IsUnderPostmaster) be initialized with PGINVALID_SOCKET as
WL_EXIT_ON_PM_DEATH.
> 
> So IIUC, the patch is introducing 2 new possible source of wake up.
> 
> Then, what about?
> 
> - not create ConditionVariableWaitSet, ConditionVariableWaitSetLatchPos, InitializeConditionVariableWaitSet() and
ConditionVariableWaitSetCreate()at all?
 
> - call ConditionVariableEventSleep() with a NULL parameter in ConditionVariableSleep() and
ConditionVariableTimedSleep()?
> - handle the case where the WaitEventSet parameter is NULL in ConditionVariableEventSleep()? (That could also make
senseif we handle the case of the CV being NULL as proposed above)
 
> 

I gave it a try, so please find attached v2-0001-Introduce-ConditionVariableEventSleep.txt (implementing the comments
above)and 0004_new_API.txt to put the new API in the logical decoding on standby context.
 

There is no change in v2-0001-Introduce-ConditionVariableEventSleep.txt regarding the up-thread comment related to
WL_POSTMASTER_DEATH.
  
What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Thu, 2023-03-02 at 23:58 -0800, Jeff Davis wrote:
> On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
> > In this case it looks easier to add the right API than to be sure
> > about
> > whether it's needed or not.
>
> I attached a sketch of one approach. I'm not very confident that it's
> the right API or even that it works as I intended it, but if others
> like the approach I can work on it some more.

Another approach might be to extend WaitEventSets() to be able to wait
on Condition Variables, rather than Condition Variables waiting on
WaitEventSets. Thoughts?

Regards,
    Jeff Davis





Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-03-04 12:19:57 +0100, Drouvot, Bertrand wrote:
> Subject: [PATCH v52 1/6] Add info in WAL records in preparation for logical
>  slot conflict handling.
> 
> Overall design:
> 
> 1. We want to enable logical decoding on standbys, but replay of WAL
> from the primary might remove data that is needed by logical decoding,
> causing error(s) on the standby. To prevent those errors, a new replication
> conflict scenario needs to be addressed (as much as hot standby does).
> 
> 2. Our chosen strategy for dealing with this type of replication slot
> is to invalidate logical slots for which needed data has been removed.
> 
> 3. To do this we need the latestRemovedXid for each change, just as we
> do for physical replication conflicts, but we also need to know
> whether any particular change was to data that logical replication
> might access. That way, during WAL replay, we know when there is a risk of
> conflict and, if so, if there is a conflict.
> 
> 4. We can't rely on the standby's relcache entries for this purpose in
> any way, because the startup process can't access catalog contents.
> 
> 5. Therefore every WAL record that potentially removes data from the
> index or heap must carry a flag indicating whether or not it is one
> that might be accessed during logical decoding.
> 
> Why do we need this for logical decoding on standby?
> 
> First, let's forget about logical decoding on standby and recall that
> on a primary database, any catalog rows that may be needed by a logical
> decoding replication slot are not removed.
> 
> This is done thanks to the catalog_xmin associated with the logical
> replication slot.
> 
> But, with logical decoding on standby, in the following cases:
> 
> - hot_standby_feedback is off
> - hot_standby_feedback is on but there is no a physical slot between
>   the primary and the standby. Then, hot_standby_feedback will work,
>   but only while the connection is alive (for example a node restart
>   would break it)
> 
> Then, the primary may delete system catalog rows that could be needed
> by the logical decoding on the standby (as it does not know about the
> catalog_xmin on the standby).
> 
> So, it’s mandatory to identify those rows and invalidate the slots
> that may need them if any. Identifying those rows is the purpose of
> this commit.

This is a very nice commit message.


> Implementation:
> 
> When a WAL replay on standby indicates that a catalog table tuple is
> to be deleted by an xid that is greater than a logical slot's
> catalog_xmin, then that means the slot's catalog_xmin conflicts with
> the xid, and we need to handle the conflict. While subsequent commits
> will do the actual conflict handling, this commit adds a new field
> isCatalogRel in such WAL records (and a new bit set in the
> xl_heap_visible flags field), that is true for catalog tables, so as to
> arrange for conflict handling.
> 
> The affected WAL records are the ones that already contain the
> snapshotConflictHorizon field, namely:
> 
> - gistxlogDelete
> - gistxlogPageReuse
> - xl_hash_vacuum_one_page
> - xl_heap_prune
> - xl_heap_freeze_page
> - xl_heap_visible
> - xl_btree_reuse_page
> - xl_btree_delete
> - spgxlogVacuumRedirect
> 
> Due to this new field being added, xl_hash_vacuum_one_page and
> gistxlogDelete do now contain the offsets to be deleted as a
> FLEXIBLE_ARRAY_MEMBER. This is needed to ensure correct alignement.
> It's not needed on the others struct where isCatalogRel has
> been added.
> 
> Author: Andres Freund (in an older version), Amit Khandekar, Bertrand
> Drouvot

I think you're first author on this one by now.


I think this commit is ready to go. Unless somebody thinks differently, I
think I might push it tomorrow.


> Subject: [PATCH v52 2/6] Handle logical slot conflicts on standby.


> @@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
>       */
>      XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
>      KeepLogSeg(recptr, &_logSegNo);
> -    if (InvalidateObsoleteReplicationSlots(_logSegNo))
> +    InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
> +    if (invalidated)
>      {
>          /*
>           * Some slots have been invalidated; recalculate the old-segment

I don't really understand why you changed InvalidateObsoleteReplicationSlots
to return void instead of bool, and then added an output boolean argument via
a pointer?



> @@ -7964,6 +7968,22 @@ xlog_redo(XLogReaderState *record)
>          /* Update our copy of the parameters in pg_control */
>          memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
>  
> +        /*
> +         * Invalidate logical slots if we are in hot standby and the primary does not
> +         * have a WAL level sufficient for logical decoding. No need to search
> +         * for potentially conflicting logically slots if standby is running
> +         * with wal_level lower than logical, because in that case, we would
> +         * have either disallowed creation of logical slots or invalidated existing
> +         * ones.
> +         */
> +        if (InRecovery && InHotStandby &&
> +            xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> +            wal_level >= WAL_LEVEL_LOGICAL)
> +        {
> +            TransactionId ConflictHorizon = InvalidTransactionId;
> +            InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
> +        }
> +

Are there races around changing wal_level?



> @@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
>          SpinLockAcquire(&s->mutex);
>          effective_xmin = s->effective_xmin;
>          effective_catalog_xmin = s->effective_catalog_xmin;
> -        invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
> -                       XLogRecPtrIsInvalid(s->data.restart_lsn));
> +        invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
> +                        XLogRecPtrIsInvalid(s->data.restart_lsn))
> +                       || (!TransactionIdIsValid(s->data.xmin) &&
> +                           !TransactionIdIsValid(s->data.catalog_xmin)));
>          SpinLockRelease(&s->mutex);
>  
>          /* invalidated slots need not apply */

I still would like a wrapper function to determine whether a slot has been
invalidated. This This is too complicated to be repeated in other places.


> @@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
>  }
>  
>  /*
> - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
> - * and mark it invalid, if necessary and possible.
> + * Helper for InvalidateObsoleteReplicationSlots
> + *
> + * Acquires the given slot and mark it invalid, if necessary and possible.
>   *
>   * Returns whether ReplicationSlotControlLock was released in the interim (and
>   * in that case we're not holding the lock at return, otherwise we are).
>   *
> - * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
> + * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
>   *
>   * This is inherently racy, because we release the LWLock
>   * for syscalls, so caller must restart if we return true.
>   */
>  static bool
> -InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> -                               bool *invalidated)
> +InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> +                                                   bool *invalidated, TransactionId *xid)

This is too long a name. I'd probably just leave it at the old name.



> @@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>           * Check if the slot needs to be invalidated. If it needs to be
>           * invalidated, and is not currently acquired, acquire it and mark it
>           * as having been invalidated.  We do this with the spinlock held to
> -         * avoid race conditions -- for example the restart_lsn could move
> -         * forward, or the slot could be dropped.
> +         * avoid race conditions -- for example the restart_lsn (or the
> +         * xmin(s) could) move forward or the slot could be dropped.
>           */
>          SpinLockAcquire(&s->mutex);
>  
>          restart_lsn = s->data.restart_lsn;
> +        slot_xmin = s->data.xmin;
> +        slot_catalog_xmin = s->data.catalog_xmin;
> +
> +        /* slot has been invalidated (logical decoding conflict case) */
> +        if ((xid &&
> +             ((LogicalReplicationSlotIsInvalid(s))
> +              ||
>  

Uh, huh?

That's very odd formatting.

>          /*
> -         * If the slot is already invalid or is fresh enough, we don't need to
> -         * do anything.
> +         * We are not forcing for invalidation because the xid is valid and
> +         * this is a non conflicting slot.
>           */
> -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
> +              (TransactionIdIsValid(*xid) && !(
> +                                               (TransactionIdIsValid(slot_xmin) &&
TransactionIdPrecedesOrEquals(slot_xmin,*xid))
 
> +                                               ||
> +                                               (TransactionIdIsValid(slot_catalog_xmin) &&
TransactionIdPrecedesOrEquals(slot_catalog_xmin,*xid))
 
> +                                               ))
> +              ))
> +            ||
> +        /* slot has been invalidated (obsolete LSN case) */
> +            (!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
>          {
>              SpinLockRelease(&s->mutex);
>              if (released_lock)


This needs some cleanup.


> @@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>          {
>              MyReplicationSlot = s;
>              s->active_pid = MyProcPid;
> -            s->data.invalidated_at = restart_lsn;
> -            s->data.restart_lsn = InvalidXLogRecPtr;
> -
> +            if (xid)
> +            {
> +                s->data.xmin = InvalidTransactionId;
> +                s->data.catalog_xmin = InvalidTransactionId;
> +            }
> +            else
> +            {
> +                s->data.invalidated_at = restart_lsn;
> +                s->data.restart_lsn = InvalidXLogRecPtr;
> +            }
>              /* Let caller know */
>              *invalidated = true;
>          }
> @@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>               */
>              if (last_signaled_pid != active_pid)
>              {
> -                ereport(LOG,
> -                        errmsg("terminating process %d to release replication slot \"%s\"",
> -                               active_pid, NameStr(slotname)),
> -                        errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> -                                  LSN_FORMAT_ARGS(restart_lsn),
> -                                  (unsigned long long) (oldestLSN - restart_lsn)),
> -                        errhint("You might need to increase max_slot_wal_keep_size."));
> +                if (xid)
> +                {
> +                    if (TransactionIdIsValid(*xid))
> +                    {
> +                        ereport(LOG,
> +                                errmsg("terminating process %d because replication slot \"%s\" conflicts with
recovery",
> +                                       active_pid, NameStr(slotname)),
> +                                errdetail("The slot conflicted with xid horizon %u.",
> +                                          *xid));
> +                    }
> +                    else
> +                    {
> +                        ereport(LOG,
> +                                errmsg("terminating process %d because replication slot \"%s\" conflicts with
recovery",
> +                                       active_pid, NameStr(slotname)),
> +                                errdetail("Logical decoding on standby requires wal_level to be at least logical on
theprimary server"));
 
> +                    }
> +
> +                    (void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
> +                }
> +                else
> +                {
> +                    ereport(LOG,
> +                            errmsg("terminating process %d to release replication slot \"%s\"",
> +                                   active_pid, NameStr(slotname)),
> +                            errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> +                                      LSN_FORMAT_ARGS(restart_lsn),
> +                                      (unsigned long long) (oldestLSN - restart_lsn)),
> +                            errhint("You might need to increase max_slot_wal_keep_size."));
> +
> +                    (void) kill(active_pid, SIGTERM);

I think it ought be possible to deduplicate this a fair bit. For one, two of
the errmsg()s above are identical.  But I think this could be consolidated
further, e.g. by using the same message style for the three cases, and passing
in a separately translated reason for the termination?


> +                }
>  
> -                (void) kill(active_pid, SIGTERM);
>                  last_signaled_pid = active_pid;
>              }
>  
> @@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>              ReplicationSlotSave();
>              ReplicationSlotRelease();
>  
> -            ereport(LOG,
> -                    errmsg("invalidating obsolete replication slot \"%s\"",
> -                           NameStr(slotname)),
> -                    errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> -                              LSN_FORMAT_ARGS(restart_lsn),
> -                              (unsigned long long) (oldestLSN - restart_lsn)),
> -                    errhint("You might need to increase max_slot_wal_keep_size."));
> +            if (xid)
> +            {
> +                pgstat_drop_replslot(s);

Why is this done here now?


> +                if (TransactionIdIsValid(*xid))
> +                {
> +                    ereport(LOG,
> +                            errmsg("invalidating slot \"%s\" because it conflicts with recovery",
NameStr(slotname)),
> +                            errdetail("The slot conflicted with xid horizon %u.", *xid));
> +                }
> +                else
> +                {
> +                    ereport(LOG,
> +                            errmsg("invalidating slot \"%s\" because it conflicts with recovery",
NameStr(slotname)),
> +                            errdetail("Logical decoding on standby requires wal_level to be at least logical on the
primaryserver"));
 
> +                }
> +            }
> +            else
> +            {
> +                ereport(LOG,
> +                        errmsg("invalidating obsolete replication slot \"%s\"",
> +                               NameStr(slotname)),
> +                        errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> +                                  LSN_FORMAT_ARGS(restart_lsn),
> +                                  (unsigned long long) (oldestLSN - restart_lsn)),
> +                        errhint("You might need to increase max_slot_wal_keep_size."));
> +            }
>

I don't like all these repeated elogs...



> @@ -3057,6 +3060,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>              case PROCSIG_RECOVERY_CONFLICT_LOCK:
>              case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
>              case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
> +            case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
> +
> +                /*
> +                 * For conflicts that require a logical slot to be
> +                 * invalidated, the requirement is for the signal receiver to
> +                 * release the slot, so that it could be invalidated by the
> +                 * signal sender. So for normal backends, the transaction
> +                 * should be aborted, just like for other recovery conflicts.
> +                 * But if it's walsender on standby, we don't want to go
> +                 * through the following IsTransactionOrTransactionBlock()
> +                 * check, so break here.
> +                 */
> +                if (am_cascading_walsender &&
> +                    reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
> +                    MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
> +                {
> +                    RecoveryConflictPending = true;
> +                    QueryCancelPending = true;
> +                    InterruptPending = true;
> +                    break;
> +                }
>  
>                  /*
>                   * If we aren't in a transaction any longer then ignore.

I can't see any reason for this to be mixed into the same case "body" as LOCK
etc?


> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index 38c6f18886..290d4b45f4 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -51,6 +51,7 @@
>  #include "storage/proc.h"
>  #include "storage/procarray.h"
>  #include "utils/builtins.h"
> +#include "access/xlogrecovery.h"

Add new includes in the "alphabetically" right place...



Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/30/23 9:04 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-03-04 12:19:57 +0100, Drouvot, Bertrand wrote:
>> Subject: [PATCH v52 1/6] Add info in WAL records in preparation for logical
>>   slot conflict handling.
> 
> This is a very nice commit message.

Thanks! Melanie and Robert did provide great feedback/input to help make it
as it is now.
  
> I think this commit is ready to go. Unless somebody thinks differently, I
> think I might push it tomorrow.

Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
once 0001 is committed).

> 
>> Subject: [PATCH v52 2/6] Handle logical slot conflicts on standby.
> 
> 
>> @@ -6807,7 +6808,8 @@ CreateCheckPoint(int flags)
>>        */
>>       XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
>>       KeepLogSeg(recptr, &_logSegNo);
>> -    if (InvalidateObsoleteReplicationSlots(_logSegNo))
>> +    InvalidateObsoleteReplicationSlots(_logSegNo, &invalidated, InvalidOid, NULL);
>> +    if (invalidated)
>>       {
>>           /*
>>            * Some slots have been invalidated; recalculate the old-segment
> 
> I don't really understand why you changed InvalidateObsoleteReplicationSlots
> to return void instead of bool, and then added an output boolean argument via
> a pointer?
> 
> 

I gave a second thought and it looks like I over complicated that part. I removed the
pointer parameter in V53 attached (and it now returns bool as before).

> 
>> @@ -7964,6 +7968,22 @@ xlog_redo(XLogReaderState *record)
>>           /* Update our copy of the parameters in pg_control */
>>           memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
>>   
>> +        /*
>> +         * Invalidate logical slots if we are in hot standby and the primary does not
>> +         * have a WAL level sufficient for logical decoding. No need to search
>> +         * for potentially conflicting logically slots if standby is running
>> +         * with wal_level lower than logical, because in that case, we would
>> +         * have either disallowed creation of logical slots or invalidated existing
>> +         * ones.
>> +         */
>> +        if (InRecovery && InHotStandby &&
>> +            xlrec.wal_level < WAL_LEVEL_LOGICAL &&
>> +            wal_level >= WAL_LEVEL_LOGICAL)
>> +        {
>> +            TransactionId ConflictHorizon = InvalidTransactionId;
>> +            InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, NULL, InvalidOid, &ConflictHorizon);
>> +        }
>> +
> 
> Are there races around changing wal_level?
> 

Humm, not that I can think of right now. Do you have one/some in mind?

> 
>> @@ -855,8 +855,10 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
>>           SpinLockAcquire(&s->mutex);
>>           effective_xmin = s->effective_xmin;
>>           effective_catalog_xmin = s->effective_catalog_xmin;
>> -        invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
>> -                       XLogRecPtrIsInvalid(s->data.restart_lsn));
>> +        invalidated = ((!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
>> +                        XLogRecPtrIsInvalid(s->data.restart_lsn))
>> +                       || (!TransactionIdIsValid(s->data.xmin) &&
>> +                           !TransactionIdIsValid(s->data.catalog_xmin)));
>>           SpinLockRelease(&s->mutex);
>>   
>>           /* invalidated slots need not apply */
> 
> I still would like a wrapper function to determine whether a slot has been
> invalidated. This This is too complicated to be repeated in other places.
> 
> 

Agree, so adding ObsoleteSlotIsInvalid() and SlotIsInvalid() in V53 attached.

ObsoleteSlotIsInvalid() could also be done in a dedicated patch outside this patch series, though.


>> @@ -1224,20 +1226,21 @@ ReplicationSlotReserveWal(void)
>>   }
>>   
>>   /*
>> - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
>> - * and mark it invalid, if necessary and possible.
>> + * Helper for InvalidateObsoleteReplicationSlots
>> + *
>> + * Acquires the given slot and mark it invalid, if necessary and possible.
>>    *
>>    * Returns whether ReplicationSlotControlLock was released in the interim (and
>>    * in that case we're not holding the lock at return, otherwise we are).
>>    *
>> - * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
>> + * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
>>    *
>>    * This is inherently racy, because we release the LWLock
>>    * for syscalls, so caller must restart if we return true.
>>    */
>>   static bool
>> -InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>> -                               bool *invalidated)
>> +InvalidatePossiblyObsoleteOrConflictingLogicalSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>> +                                                   bool *invalidated, TransactionId *xid)
> 
> This is too long a name. I'd probably just leave it at the old name.
> 
> 

Done in V53 attached.

> 
>> @@ -1261,18 +1267,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>            * Check if the slot needs to be invalidated. If it needs to be
>>            * invalidated, and is not currently acquired, acquire it and mark it
>>            * as having been invalidated.  We do this with the spinlock held to
>> -         * avoid race conditions -- for example the restart_lsn could move
>> -         * forward, or the slot could be dropped.
>> +         * avoid race conditions -- for example the restart_lsn (or the
>> +         * xmin(s) could) move forward or the slot could be dropped.
>>            */
>>           SpinLockAcquire(&s->mutex);
>>   
>>           restart_lsn = s->data.restart_lsn;
>> +        slot_xmin = s->data.xmin;
>> +        slot_catalog_xmin = s->data.catalog_xmin;
>> +
>> +        /* slot has been invalidated (logical decoding conflict case) */
>> +        if ((xid &&
>> +             ((LogicalReplicationSlotIsInvalid(s))
>> +              ||
>>   
> 
> Uh, huh?
> 
> That's very odd formatting.
> 
>>           /*
>> -         * If the slot is already invalid or is fresh enough, we don't need to
>> -         * do anything.
>> +         * We are not forcing for invalidation because the xid is valid and
>> +         * this is a non conflicting slot.
>>            */
>> -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
>> +              (TransactionIdIsValid(*xid) && !(
>> +                                               (TransactionIdIsValid(slot_xmin) &&
TransactionIdPrecedesOrEquals(slot_xmin,*xid))
 
>> +                                               ||
>> +                                               (TransactionIdIsValid(slot_catalog_xmin) &&
TransactionIdPrecedesOrEquals(slot_catalog_xmin,*xid))
 
>> +                                               ))
>> +              ))
>> +            ||
>> +        /* slot has been invalidated (obsolete LSN case) */
>> +            (!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
>>           {
>>               SpinLockRelease(&s->mutex);
>>               if (released_lock)
> 
> 
> This needs some cleanup.

Added a new macro LogicalReplicationSlotXidsConflict() and reformatted a bit.
Also ran pgindent on it, hope it's cleaner now.

> 
> 
>> @@ -1292,9 +1313,16 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>           {
>>               MyReplicationSlot = s;
>>               s->active_pid = MyProcPid;
>> -            s->data.invalidated_at = restart_lsn;
>> -            s->data.restart_lsn = InvalidXLogRecPtr;
>> -
>> +            if (xid)
>> +            {
>> +                s->data.xmin = InvalidTransactionId;
>> +                s->data.catalog_xmin = InvalidTransactionId;
>> +            }
>> +            else
>> +            {
>> +                s->data.invalidated_at = restart_lsn;
>> +                s->data.restart_lsn = InvalidXLogRecPtr;
>> +            }
>>               /* Let caller know */
>>               *invalidated = true;
>>           }
>> @@ -1327,15 +1355,39 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>                */
>>               if (last_signaled_pid != active_pid)
>>               {
>> -                ereport(LOG,
>> -                        errmsg("terminating process %d to release replication slot \"%s\"",
>> -                               active_pid, NameStr(slotname)),
>> -                        errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
>> -                                  LSN_FORMAT_ARGS(restart_lsn),
>> -                                  (unsigned long long) (oldestLSN - restart_lsn)),
>> -                        errhint("You might need to increase max_slot_wal_keep_size."));
>> +                if (xid)
>> +                {
>> +                    if (TransactionIdIsValid(*xid))
>> +                    {
>> +                        ereport(LOG,
>> +                                errmsg("terminating process %d because replication slot \"%s\" conflicts with
recovery",
>> +                                       active_pid, NameStr(slotname)),
>> +                                errdetail("The slot conflicted with xid horizon %u.",
>> +                                          *xid));
>> +                    }
>> +                    else
>> +                    {
>> +                        ereport(LOG,
>> +                                errmsg("terminating process %d because replication slot \"%s\" conflicts with
recovery",
>> +                                       active_pid, NameStr(slotname)),
>> +                                errdetail("Logical decoding on standby requires wal_level to be at least logical on
theprimary server"));
 
>> +                    }
>> +
>> +                    (void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
>> +                }
>> +                else
>> +                {
>> +                    ereport(LOG,
>> +                            errmsg("terminating process %d to release replication slot \"%s\"",
>> +                                   active_pid, NameStr(slotname)),
>> +                            errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
>> +                                      LSN_FORMAT_ARGS(restart_lsn),
>> +                                      (unsigned long long) (oldestLSN - restart_lsn)),
>> +                            errhint("You might need to increase max_slot_wal_keep_size."));
>> +
>> +                    (void) kill(active_pid, SIGTERM);
> 
> I think it ought be possible to deduplicate this a fair bit. For one, two of
> the errmsg()s above are identical.  But I think this could be consolidated
> further, e.g. by using the same message style for the three cases, and passing
> in a separately translated reason for the termination?
> 

deduplication done in V53 so that there is a single ereport() call.
I'm not sure the translation is fine the way I did it, please advice if that's not right.

> 
>> +                }
>>   
>> -                (void) kill(active_pid, SIGTERM);
>>                   last_signaled_pid = active_pid;
>>               }
>>   
>> @@ -1369,13 +1421,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>               ReplicationSlotSave();
>>               ReplicationSlotRelease();
>>   
>> -            ereport(LOG,
>> -                    errmsg("invalidating obsolete replication slot \"%s\"",
>> -                           NameStr(slotname)),
>> -                    errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
>> -                              LSN_FORMAT_ARGS(restart_lsn),
>> -                              (unsigned long long) (oldestLSN - restart_lsn)),
>> -                    errhint("You might need to increase max_slot_wal_keep_size."));
>> +            if (xid)
>> +            {
>> +                pgstat_drop_replslot(s);
> 
> Why is this done here now?
> 
> 

Oops, moved above the if() in V53.

>> +                if (TransactionIdIsValid(*xid))
>> +                {
>> +                    ereport(LOG,
>> +                            errmsg("invalidating slot \"%s\" because it conflicts with recovery",
NameStr(slotname)),
>> +                            errdetail("The slot conflicted with xid horizon %u.", *xid));
>> +                }
>> +                else
>> +                {
>> +                    ereport(LOG,
>> +                            errmsg("invalidating slot \"%s\" because it conflicts with recovery",
NameStr(slotname)),
>> +                            errdetail("Logical decoding on standby requires wal_level to be at least logical on the
primaryserver"));
 
>> +                }
>> +            }
>> +            else
>> +            {
>> +                ereport(LOG,
>> +                        errmsg("invalidating obsolete replication slot \"%s\"",
>> +                               NameStr(slotname)),
>> +                        errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
>> +                                  LSN_FORMAT_ARGS(restart_lsn),
>> +                                  (unsigned long long) (oldestLSN - restart_lsn)),
>> +                        errhint("You might need to increase max_slot_wal_keep_size."));
>> +            }
>>
> 
> I don't like all these repeated elogs...

deduplication done in V53 so that there is a single ereport() call.
I'm not sure the translation is fine the way I did it, please advice if that's not right.

> 
> 
> 
>> @@ -3057,6 +3060,27 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>>               case PROCSIG_RECOVERY_CONFLICT_LOCK:
>>               case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
>>               case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
>> +            case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
>> +
>> +                /*
>> +                 * For conflicts that require a logical slot to be
>> +                 * invalidated, the requirement is for the signal receiver to
>> +                 * release the slot, so that it could be invalidated by the
>> +                 * signal sender. So for normal backends, the transaction
>> +                 * should be aborted, just like for other recovery conflicts.
>> +                 * But if it's walsender on standby, we don't want to go
>> +                 * through the following IsTransactionOrTransactionBlock()
>> +                 * check, so break here.
>> +                 */
>> +                if (am_cascading_walsender &&
>> +                    reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
>> +                    MyReplicationSlot && SlotIsLogical(MyReplicationSlot))
>> +                {
>> +                    RecoveryConflictPending = true;
>> +                    QueryCancelPending = true;
>> +                    InterruptPending = true;
>> +                    break;
>> +                }
>>   
>>                   /*
>>                    * If we aren't in a transaction any longer then ignore.
> 
> I can't see any reason for this to be mixed into the same case "body" as LOCK
> etc?
> 

Oh right, nice catch. I don't know how it ended up done that way. Fixed in V53.

> 
>> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
>> index 38c6f18886..290d4b45f4 100644
>> --- a/src/backend/replication/slot.c
>> +++ b/src/backend/replication/slot.c
>> @@ -51,6 +51,7 @@
>>   #include "storage/proc.h"
>>   #include "storage/procarray.h"
>>   #include "utils/builtins.h"
>> +#include "access/xlogrecovery.h"
> 
> Add new includes in the "alphabetically" right place...

Fixed in 0003 in V53 and the other places (aka other sub-patches) where it was needed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Masahiko Sawada
Date:
Hi,

On Thu, Mar 30, 2023 at 2:45 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Thu, 2023-03-02 at 23:58 -0800, Jeff Davis wrote:
> > On Thu, 2023-03-02 at 11:45 -0800, Jeff Davis wrote:
> > > In this case it looks easier to add the right API than to be sure
> > > about
> > > whether it's needed or not.
> >
> > I attached a sketch of one approach. I'm not very confident that it's
> > the right API or even that it works as I intended it, but if others
> > like the approach I can work on it some more.
>
> Another approach might be to extend WaitEventSets() to be able to wait
> on Condition Variables, rather than Condition Variables waiting on
> WaitEventSets. Thoughts?
>

+1 to extend CV. If we extend WaitEventSet() to be able to wait on CV,
it would be able to make the code simple, but we would need to change
both CV and WaitEventSet().

On Fri, Mar 10, 2023 at 8:34 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> I gave it a try, so please find attached v2-0001-Introduce-ConditionVariableEventSleep.txt (implementing the comments
above)and 0004_new_API.txt to put the new API in the logical decoding on standby context. 

@@ -180,13 +203,25 @@ ConditionVariableTimedSleep(ConditionVariable
*cv, long timeout,
                 * by something other than ConditionVariableSignal;
though we don't
                 * guarantee not to return spuriously, we'll avoid
this obvious case.
                 */
-               SpinLockAcquire(&cv->mutex);
-               if (!proclist_contains(&cv->wakeup, MyProc->pgprocno,
cvWaitLink))
+
+               if (cv)
                {
-                       done = true;
-                       proclist_push_tail(&cv->wakeup,
MyProc->pgprocno, cvWaitLink);
+                       SpinLockAcquire(&cv->mutex);
+                       if (!proclist_contains(&cv->wakeup,
MyProc->pgprocno, cvWaitLink))
+                       {
+                               done = true;
+                               proclist_push_tail(&cv->wakeup,
MyProc->pgprocno, cvWaitLink);
+                       }
+                       SpinLockRelease(&cv->mutex);
                }

This change looks odd to me since it accepts cv being NULL in spite of
calling ConditionVariableEventSleep() for cv. I think that this is
because in 0004_new_API.txt, we use ConditionVariableEventSleep() in
both not-in-recovery case and recovery-in-progress cases in
WalSndWaitForWal() as follows:

-               WalSndWait(wakeEvents, sleeptime,
WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+               ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos,
wakeEvents, NULL);
+               ConditionVariableEventSleep(cv, RecoveryInProgress,
FeBeWaitSet, NULL,
+
 sleeptime, wait_event);
        }

But I don't think we need to use ConditionVariableEventSleep() in
not-in-recovery cases. If I correctly understand the problem this
patch wants to deal with, in logical decoding on standby cases, the
walsender needs to be woken up on the following events:

* condition variable
* timeout
* socket writable (if pq_is_send_pending() is true)
(socket readable event should also be included to avoid
wal_receiver_timeout BTW?)

On the other hand, in not-in-recovery case, the events are:

* socket readable
* socket writable (if pq_is_send_pending() is true)
* latch
* timeout

I think that we don't need to change for the latter case as
WalSndWait() perfectly works. As for the former cases, since we need
to wait for CV, timeout, or socket writable we can use
ConditionVariableEventSleep().

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-03-30 18:23:41 +0200, Drouvot, Bertrand wrote:
> On 3/30/23 9:04 AM, Andres Freund wrote:
> > I think this commit is ready to go. Unless somebody thinks differently, I
> > think I might push it tomorrow.
>
> Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
> use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
> once 0001 is committed).

Unfortunately I did find an issue doing a pre-commit review of the patch.

The patch adds VISIBILITYMAP_IS_CATALOG_REL to xl_heap_visible.flags - but it
does not remove the bit before calling visibilitymap_set().

This ends up corrupting the visibilitymap, because the we'll set a bit for
another page.

It's unfortunate that visibilitymap_set() doesn't assert that just the correct
bits are passed in. It does assert that at least one valid bit is set, but
that's not enough, as this case shows.


I noticed this when looking into the changes to visibilitymapdefs.h in more
detail. I don't like how it ends up in the patch:

> --- a/src/include/access/visibilitymapdefs.h
> +++ b/src/include/access/visibilitymapdefs.h
> @@ -17,9 +17,11 @@
>  #define BITS_PER_HEAPBLOCK 2
>
>  /* Flags for bit map */
> -#define VISIBILITYMAP_ALL_VISIBLE    0x01
> -#define VISIBILITYMAP_ALL_FROZEN    0x02
> -#define VISIBILITYMAP_VALID_BITS    0x03    /* OR of all valid visibilitymap
> -                                             * flags bits */
> +#define VISIBILITYMAP_ALL_VISIBLE                                0x01
> +#define VISIBILITYMAP_ALL_FROZEN                                0x02
> +#define VISIBILITYMAP_VALID_BITS                                0x03    /* OR of all valid visibilitymap
> +                                                                         * flags bits */
> +#define VISIBILITYMAP_IS_CATALOG_REL                            0x04    /* to handle recovery conflict during
logical
> +                                                                         * decoding on standby */

On a casual read, one very well might think that VISIBILITYMAP_IS_CATALOG_REL
is a valid bit that could be set in the VM.

I am thinking of instead creating a separate namespace for the "xlog only"
bits:

/*
 * To detect recovery conflicts during logical decoding on a standby, we need
 * to know if a table is a user catalog table. For that we add an additional
 * bit into xl_heap_visible.flags, in addition to the above.
 *
 * NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
 */
#define VISIBILITYMAP_XLOG_CATALOG_REL    0x04
#define VISIBILITYMAP_XLOG_VALID_BITS    (VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)


That allows heap_xlog_visible() to do:

        Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
        vmbits = (xlrec->flags & VISIBILITYMAP_VALID_BITS);

and pass vmbits istead of xlrec->flags to visibilitymap_set().


I'm also thinking of splitting the patch into two. One patch to pass down the
heap relation into the new places, and another for the rest. As evidenced
above, looking at the actual behaviour changes is important...


Given how the patch changes the struct for XLOG_GIST_DELETE:

> diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
> index 2ce9366277..93fb9d438a 100644
> --- a/src/include/access/gistxlog.h
> +++ b/src/include/access/gistxlog.h
> @@ -51,11 +51,14 @@ typedef struct gistxlogDelete
>  {
>      TransactionId snapshotConflictHorizon;
>      uint16        ntodelete;        /* number of deleted offsets */
> +    bool        isCatalogRel;    /* to handle recovery conflict during logical
> +                                 * decoding on standby */
>
> -    /* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
> +    /* TODELETE OFFSET NUMBERS */
> +    OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
>  } gistxlogDelete;
>
> -#define SizeOfGistxlogDelete    (offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
> +#define SizeOfGistxlogDelete    offsetof(gistxlogDelete, offsets)


and XLOG_HASH_VACUUM_ONE_PAGE:

> diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
> index 9894ab9afe..6c5535fe73 100644
> --- a/src/include/access/hash_xlog.h
> +++ b/src/include/access/hash_xlog.h
> @@ -252,12 +252,14 @@ typedef struct xl_hash_vacuum_one_page
>  {
>      TransactionId snapshotConflictHorizon;
>      uint16            ntuples;
> +    bool        isCatalogRel;   /* to handle recovery conflict during logical
> +                                 * decoding on standby */
>
> -    /* TARGET OFFSET NUMBERS FOLLOW AT THE END */
> +    /* TARGET OFFSET NUMBERS */
> +    OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
>  } xl_hash_vacuum_one_page;
>
> -#define SizeOfHashVacuumOnePage \
> -    (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16))
> +#define SizeOfHashVacuumOnePage offsetof(xl_hash_vacuum_one_page, offsets)


I don't think the changes are quite sufficient:

for gist:

> @@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
>   */
>  XLogRecPtr
>  gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
> -               TransactionId snapshotConflictHorizon)
> +               TransactionId snapshotConflictHorizon, Relation heaprel)
>  {
>      gistxlogDelete xlrec;
>      XLogRecPtr    recptr;
>
> +    xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
>      xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
>      xlrec.ntodelete = ntodelete;

Note that gistXLogDelete() continues to register data with two different
XLogRegisterData() calls. This will append data without any padding:

    XLogRegisterData((char *) &xlrec, SizeOfGistxlogDelete);

    /*
     * We need the target-offsets array whether or not we store the whole
     * buffer, to allow us to find the snapshotConflictHorizon on a standby
     * server.
     */
    XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));


But replay now uses the new offset member:

> @@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
>      gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
>      Buffer        buffer;
>      Page        page;
> +    OffsetNumber *toDelete = xldata->offsets;
>
>      /*
>       * If we have any conflict processing to do, it must happen before we


That doesn't look right. If there's any padding before offsets, we'll afaict
read completely bogus data?

As it turns out, there is padding:

struct gistxlogDelete {
        TransactionId              snapshotConflictHorizon; /*     0     4 */
        uint16                     ntodelete;            /*     4     2 */
        _Bool                      isCatalogRel;         /*     6     1 */

        /* XXX 1 byte hole, try to pack */

        OffsetNumber               offsets[];            /*     8     0 */

        /* size: 8, cachelines: 1, members: 4 */
        /* sum members: 7, holes: 1, sum holes: 1 */
        /* last cacheline: 8 bytes */
};


I am frankly baffled how this works at all, this should just about immediately
crash?


Oh, I see. We apparently don't reach the gist deletion code in the tests:
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174

And indeed, if I add an abort() into , it's not reached.

And it's not because tests use a temp table, the caller is also unreachable:
https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643

Whut?


And the same issue exists for hash as well.

Logging:
            XLogRegisterData((char *) &xlrec, SizeOfHashVacuumOnePage);

            /*
             * We need the target-offsets array whether or not we store the
             * whole buffer, to allow us to find the snapshotConflictHorizon
             * on a standby server.
             */
            XLogRegisterData((char *) deletable,
                             ndeletable * sizeof(OffsetNumber));

Redo:

>      xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
> +    toDelete = xldata->offsets;


And there also are no tests:
https://coverage.postgresql.org/src/backend/access/hash/hashinsert.c.gcov.html#372


I'm not going to commit a nontrivial change to these WAL records without some
minimal tests.


Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Wed, 2023-03-08 at 11:25 +0100, Drouvot, Bertrand wrote:
> - Maybe ConditionVariableEventSleep() should take care of the
> “WaitEventSetWait returns 1 and cvEvent.event == WL_POSTMASTER_DEATH”
> case?

Thank you, done. I think the nearby line was also wrong, returning true
when there was no timeout. I combined the lines and got rid of the
early return so it can check the list and timeout condition like
normal. Attached.

> - Maybe ConditionVariableEventSleep() could accept and deal with the
> CV being NULL?
> I used it in the POC attached to handle logical decoding on the
> primary server case.
> One option should be to create a dedicated CV for that case though.

I don't think it's a good idea to have a CV-based API that doesn't need
a CV. Wouldn't that just be a normal WaitEventSet?

> - In the POC attached I had to add this extra condition “(cv &&
> !RecoveryInProgress())” to avoid waiting on the timeout when there is
> a promotion.
> That makes me think that we may want to add 2 extra parameters (as 2
> functions returning a bool?) to ConditionVariableEventSleep()
> to check whether or not we still want to test the socket or the CV
> wake up in each loop iteration.

That seems like a complex API. Would it work to signal the CV during
promotion instead?

> Also 3 additional remarks:
>
> 1) About InitializeConditionVariableWaitSet() and
> ConditionVariableWaitSetCreate(): I'm not sure about the naming as
> there is no CV yet (they "just" deal with WaitEventSet).

It's a WaitEventSet that contains the conditions always required for
any CV, and allows you to add in more.

> 3)
>
> I wonder if there is no race conditions: ConditionVariableWaitSet is
> being initialized with PGINVALID_SOCKET
> as WL_LATCH_SET and might be also (if IsUnderPostmaster) be
> initialized with PGINVALID_SOCKET as WL_EXIT_ON_PM_DEATH.
>
> So IIUC, the patch is introducing 2 new possible source of wake up.

Those should be the same conditions already required by
ConditionVariableTimedSleep() in master, right?


Regards,
    Jeff Davis

Attachment

Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Fri, 2023-03-31 at 01:31 +0900, Masahiko Sawada wrote:
> I think that we don't need to change for the latter case as
> WalSndWait() perfectly works. As for the former cases, since we need
> to wait for CV, timeout, or socket writable we can use
> ConditionVariableEventSleep().

For this patch series, I agree.

But if the ConditionVariableEventSleep() API is added, then I think we
should change the non-recovery case to use a CV as well for
consistency, and it would avoid the need for WalSndWakeup().

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/31/23 6:33 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-03-30 18:23:41 +0200, Drouvot, Bertrand wrote:
>> On 3/30/23 9:04 AM, Andres Freund wrote:
>>> I think this commit is ready to go. Unless somebody thinks differently, I
>>> think I might push it tomorrow.
>>
>> Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
>> use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
>> once 0001 is committed).
> 
> Unfortunately I did find an issue doing a pre-commit review of the patch.
> 
> The patch adds VISIBILITYMAP_IS_CATALOG_REL to xl_heap_visible.flags - but it
> does not remove the bit before calling visibilitymap_set().
> 
> This ends up corrupting the visibilitymap, because the we'll set a bit for
> another page.
> 

Oh I see, I did not think about that (not enough experience in the VM area).
Nice catch and thanks for pointing out!

> On a casual read, one very well might think that VISIBILITYMAP_IS_CATALOG_REL
> is a valid bit that could be set in the VM.
> 

I see what you're saying now and do agree that's confusing.

> I am thinking of instead creating a separate namespace for the "xlog only"
> bits:
> 
> /*
>   * To detect recovery conflicts during logical decoding on a standby, we need
>   * to know if a table is a user catalog table. For that we add an additional
>   * bit into xl_heap_visible.flags, in addition to the above.
>   *
>   * NB: VISIBILITYMAP_XLOG_* may not be passed to visibilitymap_set().
>   */
> #define VISIBILITYMAP_XLOG_CATALOG_REL    0x04
> #define VISIBILITYMAP_XLOG_VALID_BITS    (VISIBILITYMAP_VALID_BITS | VISIBILITYMAP_XLOG_CATALOG_REL)
> 
> 
> That allows heap_xlog_visible() to do:
> 
>         Assert((xlrec->flags & VISIBILITYMAP_XLOG_VALID_BITS) == xlrec->flags);
>         vmbits = (xlrec->flags & VISIBILITYMAP_VALID_BITS);
> 
> and pass vmbits istead of xlrec->flags to visibilitymap_set().
> 

That sounds good to me. That way you'd ensure that VISIBILITYMAP_XLOG_CATALOG_REL is not
passed to visibilitymap_set().

> 
> I'm also thinking of splitting the patch into two. One patch to pass down the
> heap relation into the new places, and another for the rest.

I think that makes sense. I don't know how far you've work on the split but please
find attached V54 doing such a split + implementing your VISIBILITYMAP_XLOG_VALID_BITS
suggestion.

> 
> Note that gistXLogDelete() continues to register data with two different
> XLogRegisterData() calls. This will append data without any padding:
> 
>     XLogRegisterData((char *) &xlrec, SizeOfGistxlogDelete);
> 
>     /*
>      * We need the target-offsets array whether or not we store the whole
>      * buffer, to allow us to find the snapshotConflictHorizon on a standby
>      * server.
>      */
>     XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));
> 
> 
> But replay now uses the new offset member:
> 
>> @@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
>>       gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
>>       Buffer        buffer;
>>       Page        page;
>> +    OffsetNumber *toDelete = xldata->offsets;
>>
>>       /*
>>        * If we have any conflict processing to do, it must happen before we
> 
> 
> That doesn't look right. If there's any padding before offsets, we'll afaict
> read completely bogus data?
> 
> As it turns out, there is padding:
> 
> struct gistxlogDelete {
>          TransactionId              snapshotConflictHorizon; /*     0     4 */
>          uint16                     ntodelete;            /*     4     2 */
>          _Bool                      isCatalogRel;         /*     6     1 */
> 
>          /* XXX 1 byte hole, try to pack */
> 
>          OffsetNumber               offsets[];            /*     8     0 */
> 
>          /* size: 8, cachelines: 1, members: 4 */
>          /* sum members: 7, holes: 1, sum holes: 1 */
>          /* last cacheline: 8 bytes */
> };
> 
> 
> I am frankly baffled how this works at all, this should just about immediately
> crash?
> 
> 

Oh, I see. Hm, don't we have already the same issue for spgxlogVacuumRoot / vacuumLeafRoot() / spgRedoVacuumRoot()?

> 
> I'm not going to commit a nontrivial change to these WAL records without some
> minimal tests.
> 

That makes fully sense.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>

+ * This is needed for logical decoding on standby. Indeed the "problem" is that
+ * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
+ * by walreceiver when new WAL has been flushed. Which means that typically
+ * walsenders will get woken up at the same time that the startup process
+ * will be - which means that by the time the logical walsender checks
+ * GetXLogReplayRecPtr() it's unlikely that the startup process
already replayed
+ * the record and updated XLogCtl->lastReplayedEndRecPtr.
+ *
+ * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.

IIUC we are introducing condition variables as we can't rely on
current wait events because they will lead to spurious wakeups for
logical walsenders due to the below code in walreceiver:
XLogWalRcvFlush()
{
...
/* Signal the startup process and walsender that new WAL has arrived */
WakeupRecovery();
if (AllowCascadeReplication())
WalSndWakeup();

Is my understanding correct?

Can't we simply avoid waking up logical walsenders at this place and
rather wake them up at ApplyWalRecord() where the 0005 patch does
conditionvariable broadcast? Now, there doesn't seem to be anything
that distinguishes between logical and physical walsender but I guess
we can add a variable in WalSnd structure to identify it.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 3/31/23 1:58 PM, Amit Kapila wrote:
> On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
> 
> + * This is needed for logical decoding on standby. Indeed the "problem" is that
> + * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
> + * by walreceiver when new WAL has been flushed. Which means that typically
> + * walsenders will get woken up at the same time that the startup process
> + * will be - which means that by the time the logical walsender checks
> + * GetXLogReplayRecPtr() it's unlikely that the startup process
> already replayed
> + * the record and updated XLogCtl->lastReplayedEndRecPtr.
> + *
> + * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
> 
> IIUC we are introducing condition variables as we can't rely on
> current wait events because they will lead to spurious wakeups for
> logical walsenders due to the below code in walreceiver:
> XLogWalRcvFlush()
> {
> ...
> /* Signal the startup process and walsender that new WAL has arrived */
> WakeupRecovery();
> if (AllowCascadeReplication())
> WalSndWakeup();
> 
> Is my understanding correct?
> 

Both the walsender and the startup process are waked up at the
same time. If the walsender does not find any new record that has been replayed
(because the startup process did not replay yet), then it will sleep during i
ts timeout time (then delaying the decoding).

The CV helps to wake up the walsender has soon as a replay is done.

> Can't we simply avoid waking up logical walsenders at this place and
> rather wake them up at ApplyWalRecord() where the 0005 patch does
> conditionvariable broadcast? Now, there doesn't seem to be anything
> that distinguishes between logical and physical walsender but I guess
> we can add a variable in WalSnd structure to identify it.
> 

That sounds like a good idea. We could imagine creating a LogicalWalSndWakeup()
doing the Walsender(s) triage based on a new variable (as you suggest).

But, it looks to me that we:

- would need to go through the list of all the walsenders to do the triage
- could wake up some logical walsender(s) unnecessary

This extra work would occur during each replay.

while with the CV, only the ones in the CV wait queue would be waked up.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-03-30 21:33:00 -0700, Andres Freund wrote:
> > diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
> > index 2ce9366277..93fb9d438a 100644
> > --- a/src/include/access/gistxlog.h
> > +++ b/src/include/access/gistxlog.h
> > @@ -51,11 +51,14 @@ typedef struct gistxlogDelete
> >  {
> >      TransactionId snapshotConflictHorizon;
> >      uint16        ntodelete;        /* number of deleted offsets */
> > +    bool        isCatalogRel;    /* to handle recovery conflict during logical
> > +                                 * decoding on standby */
> >
> > -    /* TODELETE OFFSET NUMBER ARRAY FOLLOWS */
> > +    /* TODELETE OFFSET NUMBERS */
> > +    OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
> >  } gistxlogDelete;
> >
> > -#define SizeOfGistxlogDelete    (offsetof(gistxlogDelete, ntodelete) + sizeof(uint16))
> > +#define SizeOfGistxlogDelete    offsetof(gistxlogDelete, offsets)

> 
> I don't think the changes are quite sufficient:
> 
> for gist:
> 
> > @@ -672,11 +668,12 @@ gistXLogUpdate(Buffer buffer,
> >   */
> >  XLogRecPtr
> >  gistXLogDelete(Buffer buffer, OffsetNumber *todelete, int ntodelete,
> > -               TransactionId snapshotConflictHorizon)
> > +               TransactionId snapshotConflictHorizon, Relation heaprel)
> >  {
> >      gistxlogDelete xlrec;
> >      XLogRecPtr    recptr;
> >
> > +    xlrec.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
> >      xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
> >      xlrec.ntodelete = ntodelete;
> 
> Note that gistXLogDelete() continues to register data with two different
> XLogRegisterData() calls. This will append data without any padding:
> 
>     XLogRegisterData((char *) &xlrec, SizeOfGistxlogDelete);
> 
>     /*
>      * We need the target-offsets array whether or not we store the whole
>      * buffer, to allow us to find the snapshotConflictHorizon on a standby
>      * server.
>      */
>     XLogRegisterData((char *) todelete, ntodelete * sizeof(OffsetNumber));
> 
> 
> But replay now uses the new offset member:
> 
> > @@ -177,6 +177,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
> >      gistxlogDelete *xldata = (gistxlogDelete *) XLogRecGetData(record);
> >      Buffer        buffer;
> >      Page        page;
> > +    OffsetNumber *toDelete = xldata->offsets;
> >
> >      /*
> >       * If we have any conflict processing to do, it must happen before we
> 
> 
> That doesn't look right. If there's any padding before offsets, we'll afaict
> read completely bogus data?
> 
> As it turns out, there is padding:
> 
> struct gistxlogDelete {
>         TransactionId              snapshotConflictHorizon; /*     0     4 */
>         uint16                     ntodelete;            /*     4     2 */
>         _Bool                      isCatalogRel;         /*     6     1 */
> 
>         /* XXX 1 byte hole, try to pack */
> 
>         OffsetNumber               offsets[];            /*     8     0 */
> 
>         /* size: 8, cachelines: 1, members: 4 */
>         /* sum members: 7, holes: 1, sum holes: 1 */
>         /* last cacheline: 8 bytes */
> };
> 
> 
> I am frankly baffled how this works at all, this should just about immediately
> crash?
> 
> 
> Oh, I see. We apparently don't reach the gist deletion code in the tests:
> https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#674
> https://coverage.postgresql.org/src/backend/access/gist/gistxlog.c.gcov.html#174
> 
> And indeed, if I add an abort() into , it's not reached.
> 
> And it's not because tests use a temp table, the caller is also unreachable:
> https://coverage.postgresql.org/src/backend/access/gist/gist.c.gcov.html#1643

After writing a minimal test to reach it, it turns out to actually work - I
missed that SizeOfGistxlogDelete now includes the padding, where commonly that
pattern tries to *exclude* trailing padding.  Sorry for the noise on this one
:(

I am writing two minimal test cases to reach this code for hash and
gist. Not to commit as part of this, but to be able to verify that it
works. I'll post them in the separate thread I started about the lack of
regression test coverage in the area.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Fri, Mar 31, 2023 at 7:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 3/31/23 1:58 PM, Amit Kapila wrote:
> > On Fri, Mar 31, 2023 at 4:17 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >
> > + * This is needed for logical decoding on standby. Indeed the "problem" is that
> > + * WalSndWaitForWal() waits for the *replay* LSN to increase, but gets woken up
> > + * by walreceiver when new WAL has been flushed. Which means that typically
> > + * walsenders will get woken up at the same time that the startup process
> > + * will be - which means that by the time the logical walsender checks
> > + * GetXLogReplayRecPtr() it's unlikely that the startup process
> > already replayed
> > + * the record and updated XLogCtl->lastReplayedEndRecPtr.
> > + *
> > + * The ConditionVariable XLogRecoveryCtl->replayedCV solves this corner case.
> >
> > IIUC we are introducing condition variables as we can't rely on
> > current wait events because they will lead to spurious wakeups for
> > logical walsenders due to the below code in walreceiver:
> > XLogWalRcvFlush()
> > {
> > ...
> > /* Signal the startup process and walsender that new WAL has arrived */
> > WakeupRecovery();
> > if (AllowCascadeReplication())
> > WalSndWakeup();
> >
> > Is my understanding correct?
> >
>
> Both the walsender and the startup process are waked up at the
> same time. If the walsender does not find any new record that has been replayed
> (because the startup process did not replay yet), then it will sleep during i
> ts timeout time (then delaying the decoding).
>
> The CV helps to wake up the walsender has soon as a replay is done.
>
> > Can't we simply avoid waking up logical walsenders at this place and
> > rather wake them up at ApplyWalRecord() where the 0005 patch does
> > conditionvariable broadcast? Now, there doesn't seem to be anything
> > that distinguishes between logical and physical walsender but I guess
> > we can add a variable in WalSnd structure to identify it.
> >
>
> That sounds like a good idea. We could imagine creating a LogicalWalSndWakeup()
> doing the Walsender(s) triage based on a new variable (as you suggest).
>
> But, it looks to me that we:
>
> - would need to go through the list of all the walsenders to do the triage
> - could wake up some logical walsender(s) unnecessary
>

Why it could wake up unnecessarily?

> This extra work would occur during each replay.
>
> while with the CV, only the ones in the CV wait queue would be waked up.
>

Currently, we wake up walsenders only after writing some WAL records
at the time of flush, so won't it be better to wake up only after
applying some WAL records rather than after applying each record?

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-03-31 12:45:51 +0200, Drouvot, Bertrand wrote:
> On 3/31/23 6:33 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-03-30 18:23:41 +0200, Drouvot, Bertrand wrote:
> > > On 3/30/23 9:04 AM, Andres Freund wrote:
> > > > I think this commit is ready to go. Unless somebody thinks differently, I
> > > > think I might push it tomorrow.
> > > 
> > > Great! Once done, I'll submit a new patch so that GlobalVisTestFor() can make
> > > use of the heap relation in vacuumRedirectAndPlaceholder() (which will be possible
> > > once 0001 is committed).
> > 
> > Unfortunately I did find an issue doing a pre-commit review of the patch.
> > 
> > The patch adds VISIBILITYMAP_IS_CATALOG_REL to xl_heap_visible.flags - but it
> > does not remove the bit before calling visibilitymap_set().
> > 
> > This ends up corrupting the visibilitymap, because the we'll set a bit for
> > another page.
> > 
> 
> Oh I see, I did not think about that (not enough experience in the VM area).
> Nice catch and thanks for pointing out!

I pushed a commit just adding an assertion that only valid bits are passed in.


> > I'm also thinking of splitting the patch into two. One patch to pass down the
> > heap relation into the new places, and another for the rest.
> 
> I think that makes sense. I don't know how far you've work on the split but please
> find attached V54 doing such a split + implementing your VISIBILITYMAP_XLOG_VALID_BITS
> suggestion.

I pushed the pass-the-relation part.  I removed an include of catalog.h that
was in the patch - I suspect it might have slipped in there from a later patch
in the series...

I was a bit bothered by using 'heap' instead of 'table' in so many places
(eventually we imo should standardize on the latter), but looking around the
changed places, heap was used for things like buffers etc. So I left it at
heap.

Glad we split 0001 - the rest is a lot easier to review.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/1/23 6:50 AM, Amit Kapila wrote:
> On Fri, Mar 31, 2023 at 7:14 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> That sounds like a good idea. We could imagine creating a LogicalWalSndWakeup()
>> doing the Walsender(s) triage based on a new variable (as you suggest).
>>
>> But, it looks to me that we:
>>
>> - would need to go through the list of all the walsenders to do the triage
>> - could wake up some logical walsender(s) unnecessary
>>
> 
> Why it could wake up unnecessarily?

I was thinking that, if a new LogicalWalSndWakeup() replaces
"ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
in ApplyWalRecord() then, it could be possible that some walsender(s)
are requested to wake up while they are actually doing decoding (but I might be wrong).

> 
>> This extra work would occur during each replay.
>>
>> while with the CV, only the ones in the CV wait queue would be waked up.
>>
> 
> Currently, we wake up walsenders only after writing some WAL records
> at the time of flush, so won't it be better to wake up only after
> applying some WAL records rather than after applying each record?

Yeah that would be better.

Do you have any idea about how (and where) we could define the "some WAL records replayed"?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/2/23 5:42 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-03-31 12:45:51 +0200, Drouvot, Bertrand wrote:
> 
> I pushed a commit just adding an assertion that only valid bits are passed in.
> 
> 

Thanks!

>>> I'm also thinking of splitting the patch into two. One patch to pass down the
>>> heap relation into the new places, and another for the rest.
>>
>> I think that makes sense. I don't know how far you've work on the split but please
>> find attached V54 doing such a split + implementing your VISIBILITYMAP_XLOG_VALID_BITS
>> suggestion.
> 
> I pushed the pass-the-relation part. 

Thanks! I just created a new thread [1] for passing down the heap relation to GlobalVisTestFor() in
vacuumRedirectAndPlaceholder().

> I removed an include of catalog.h that
> was in the patch - I suspect it might have slipped in there from a later patch
> in the series...
> 

Oops, my bad. Thanks! Yeah, indeed it's due to the split and it's in fact needed in
"Add-info-in-WAL-records-in-preparation-for-logic.patch".

Please find enclosed v55 with the correction (re-adding it Add-info-in-WAL-records-in-preparation-for-logic.patch as
compareto v54).
 

> I was a bit bothered by using 'heap' instead of 'table' in so many places
> (eventually we imo should standardize on the latter), but looking around the
> changed places, heap was used for things like buffers etc. 

yup


[1]: https://www.postgresql.org/message-id/flat/02392033-f030-a3c8-c7d0-5c27eb529fec%40gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Sun, 2023-04-02 at 10:11 +0200, Drouvot, Bertrand wrote:
> I was thinking that, if a new LogicalWalSndWakeup() replaces
> "ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
> in ApplyWalRecord() then, it could be possible that some walsender(s)
> are requested to wake up while they are actually doing decoding (but
> I might be wrong).

I don't think that's a problem, right?

We are concerned about wakeups when they happen repeatedly when there's
no work to do, or when the wakeup doesn't happen when it should (and we
need to wait for a timeout).

> >
> > Currently, we wake up walsenders only after writing some WAL
> > records
> > at the time of flush, so won't it be better to wake up only after
> > applying some WAL records rather than after applying each record?
>
> Yeah that would be better.

Why? If the walsender is asleep, and there's work to be done, why not
wake it up?

If it's already doing work, and the latch gets repeatedly set, that
doesn't look like a problem either. The comment on SetLatch() says:

  /*
   * Sets a latch and wakes up anyone waiting on it.
   *
   * This is cheap if the latch is already set, otherwise not so much.

Regards,
    Jeff Davis








Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

Btw, most of the patches have some things that pgindent will change (and some
that my editor will highlight). It wouldn't hurt to run pgindent for the later
patches...

Pushed the WAL format change.


On 2023-04-02 10:27:45 +0200, Drouvot, Bertrand wrote:
> During WAL replay on standby, when slot conflict is identified,
> invalidate such slots. Also do the same thing if wal_level on the primary server
> is reduced to below logical and there are existing logical slots
> on standby. Introduce a new ProcSignalReason value for slot
> conflict recovery. Arrange for a new pg_stat_database_conflicts field:
> confl_active_logicalslot.
> 
> Add a new field "conflicting" in pg_replication_slots.
> 
> Author: Andres Freund (in an older version), Amit Khandekar, Bertrand Drouvot
> Reviewed-By: Bertrand Drouvot, Andres Freund, Robert Haas, Fabrizio de Royes Mello,
> Bharath Rupireddy
> ---
>  doc/src/sgml/monitoring.sgml                  |  11 +
>  doc/src/sgml/system-views.sgml                |  10 +
>  src/backend/access/gist/gistxlog.c            |   2 +
>  src/backend/access/hash/hash_xlog.c           |   1 +
>  src/backend/access/heap/heapam.c              |   3 +
>  src/backend/access/nbtree/nbtxlog.c           |   2 +
>  src/backend/access/spgist/spgxlog.c           |   1 +
>  src/backend/access/transam/xlog.c             |  20 +-
>  src/backend/catalog/system_views.sql          |   6 +-
>  .../replication/logical/logicalfuncs.c        |  13 +-
>  src/backend/replication/slot.c                | 189 ++++++++++++++----
>  src/backend/replication/slotfuncs.c           |  16 +-
>  src/backend/replication/walsender.c           |   7 +
>  src/backend/storage/ipc/procsignal.c          |   3 +
>  src/backend/storage/ipc/standby.c             |  13 +-
>  src/backend/tcop/postgres.c                   |  28 +++
>  src/backend/utils/activity/pgstat_database.c  |   4 +
>  src/backend/utils/adt/pgstatfuncs.c           |   3 +
>  src/include/catalog/pg_proc.dat               |  11 +-
>  src/include/pgstat.h                          |   1 +
>  src/include/replication/slot.h                |  14 +-
>  src/include/storage/procsignal.h              |   1 +
>  src/include/storage/standby.h                 |   2 +
>  src/test/regress/expected/rules.out           |   8 +-
>  24 files changed, 308 insertions(+), 61 deletions(-)
>    5.3% doc/src/sgml/
>    6.2% src/backend/access/transam/
>    4.6% src/backend/replication/logical/
>   55.6% src/backend/replication/
>    4.4% src/backend/storage/ipc/
>    6.9% src/backend/tcop/
>    5.3% src/backend/
>    3.8% src/include/catalog/
>    5.3% src/include/replication/

I think it might be worth trying to split this up a bit.


>          restart_lsn = s->data.restart_lsn;
> -
> -        /*
> -         * If the slot is already invalid or is fresh enough, we don't need to
> -         * do anything.
> -         */
> -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
> +        slot_xmin = s->data.xmin;
> +        slot_catalog_xmin = s->data.catalog_xmin;
> +
> +        /* the slot has been invalidated (logical decoding conflict case) */
> +        if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
> +        /* or the xid is valid and this is a non conflicting slot */
> +                     (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin,
slot_catalog_xmin,*xid))))) ||
 
> +        /* or the slot has been invalidated (obsolete LSN case) */
> +            (!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
>          {

This still looks nearly unreadable. I suggest moving comments outside of the
if (), remove redundant parentheses, use a function to detect if the slot has
been invalidated.


> @@ -1329,16 +1345,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>               */
>              if (last_signaled_pid != active_pid)
>              {
> +                bool        send_signal = false;
> +
> +                initStringInfo(&err_msg);
> +                initStringInfo(&err_detail);
> +
> +                appendStringInfo(&err_msg, "terminating process %d to release replication slot \"%s\"",
> +                                 active_pid,
> +                                 NameStr(slotname));

For this to be translatable you need to use _("message").


> +                if (xid)
> +                {
> +                    appendStringInfo(&err_msg, " because it conflicts with recovery");
> +                    send_signal = true;
> +
> +                    if (TransactionIdIsValid(*xid))
> +                        appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
> +                    else
> +                        appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least
logicalon the primary server");
 
> +                }
> +                else
> +                {
> +                    appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> +                                     LSN_FORMAT_ARGS(restart_lsn),
> +                                     (unsigned long long) (oldestLSN - restart_lsn));
> +                }
> +
>                  ereport(LOG,
> -                        errmsg("terminating process %d to release replication slot \"%s\"",
> -                               active_pid, NameStr(slotname)),
> -                        errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> -                                  LSN_FORMAT_ARGS(restart_lsn),
> -                                  (unsigned long long) (oldestLSN - restart_lsn)),
> -                        errhint("You might need to increase max_slot_wal_keep_size."));
> -
> -                (void) kill(active_pid, SIGTERM);
> +                        errmsg("%s", err_msg.data),
> +                        errdetail("%s", err_detail.data),
> +                        send_signal ? 0 : errhint("You might need to increase max_slot_wal_keep_size."));
> +
> +                if (send_signal)
> +                    (void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
> +                else
> +                    (void) kill(active_pid, SIGTERM);
> +
>                  last_signaled_pid = active_pid;
> +
> +                pfree(err_msg.data);
> +                pfree(err_detail.data);
>              }
>  
>              /* Wait until the slot is released. */
> @@ -1355,6 +1400,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>          }
>          else
>          {
> +            bool        hint = false;;
> +
> +            initStringInfo(&err_msg);
> +            initStringInfo(&err_detail);
> +
>              /*
>               * We hold the slot now and have already invalidated it; flush it
>               * to ensure that state persists.
> @@ -1370,14 +1420,37 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>              ReplicationSlotMarkDirty();
>              ReplicationSlotSave();
>              ReplicationSlotRelease();
> +            pgstat_drop_replslot(s);
> +
> +            appendStringInfo(&err_msg, "invalidating");
> +
> +            if (xid)
> +            {
> +                if (TransactionIdIsValid(*xid))
> +                    appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
> +                else
> +                    appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least
logicalon the primary server");
 
> +            }
> +            else

These are nearly the same messags as above. This is too much code to duplicate
between terminating and invalidating. Put this into a helper or such.


> @@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>                  /* Intentional fall through to session cancel */
>                  /* FALLTHROUGH */
>  
> +            case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:

The case: above is explicitl falling through. This makes no sense here as far
as I can tell. I thought you did change this in response to my last comment
about it?

> index 8872c80cdf..013cd2b4d0 100644
> --- a/src/include/replication/slot.h
> +++ b/src/include/replication/slot.h
> @@ -17,6 +17,17 @@
>  #include "storage/spin.h"
>  #include "replication/walreceiver.h"
>  
> +#define ObsoleteSlotIsInvalid(s) (!XLogRecPtrIsInvalid(s->data.invalidated_at) && \
> +                                  XLogRecPtrIsInvalid(s->data.restart_lsn))
> +
> +#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
> +                                            !TransactionIdIsValid(s->data.catalog_xmin))
> +
> +#define SlotIsInvalid(s) (ObsoleteSlotIsInvalid(s) || LogicalReplicationSlotIsInvalid (s))
> +
> +#define LogicalReplicationSlotXidsConflict(slot_xmin, catalog_xmin, xid) \
> +        ((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) || \
> +        (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid)))

Can you make these static inlines instead?




> diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
> index 8fe7bb65f1..8457eec4c4 100644
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>               * can restart from there.
>               */
>              break;
> +        case XLOG_PARAMETER_CHANGE:
> +        {
> +            xl_parameter_change *xlrec =
> +                (xl_parameter_change *) XLogRecGetData(buf->record);
> +
> +            /*
> +             * If wal_level on primary is reduced to less than logical, then we
> +             * want to prevent existing logical slots from being used.
> +             * Existing logical slots on standby get invalidated when this WAL
> +             * record is replayed; and further, slot creation fails when the
> +             * wal level is not sufficient; but all these operations are not
> +             * synchronized, so a logical slot may creep in while the wal_level
> +             * is being reduced. Hence this extra check.
> +             */
> +            if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                         errmsg("logical decoding on standby requires wal_level "
> +                                "to be at least logical on the primary server")));

Please don't break error messages into multiple lines, makes it harder to grep
for.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Fri, 2023-03-31 at 02:44 -0700, Jeff Davis wrote:
> Thank you, done. I think the nearby line was also wrong, returning
> true
> when there was no timeout. I combined the lines and got rid of the
> early return so it can check the list and timeout condition like
> normal. Attached.

On second (third?) thought, I think I was right the first time. It
passes the flag WL_EXIT_ON_PM_DEATH (included in the
ConditionVariableWaitSet), so a WL_POSTMASTER_DEATH event should not be
returned.

Also, I think the early return is correct. The current code in
ConditionVariableTimedSleep() still checks the wait list even if
WaitLatch() returns WL_TIMEOUT (it ignores the return), but I don't see
why it can't early return true. For a socket event in
ConditionVariableEventSleep() I think it should early return false.

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-03-31 02:44:33 -0700, Jeff Davis wrote:
> From 2f05cab9012950ae9290fccbb6366d50fc01553e Mon Sep 17 00:00:00 2001
> From: Jeff Davis <jeff@j-davis.com>
> Date: Wed, 1 Mar 2023 20:02:42 -0800
> Subject: [PATCH v2] Introduce ConditionVariableEventSleep().
> 
> The new API takes a WaitEventSet which can include socket events. The
> WaitEventSet must have been created by
> ConditionVariableWaitSetCreate(), another new function, so that it
> includes the wait events necessary for a condition variable.

Why not offer a function to add a CV to a WES? It seems somehow odd to require
going through condition_variable.c to create a WES.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Sun, 2023-04-02 at 14:35 -0700, Andres Freund wrote:
> Why not offer a function to add a CV to a WES? It seems somehow odd
> to require
> going through condition_variable.c to create a WES.

I agree that it's a bit odd, but remember that after waiting on a CV's
latch, it needs to re-insert itself into the CV's wait list.

A WaitEventSetWait() can't do that, unless we move the details of re-
adding to the wait list into latch.c. I considered that, but latch.c
already implements the APIs for WaitEventSet and Latch, so it felt
complex to also make it responsible for ConditionVariable.

I'm open to suggestion.

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-02 15:15:44 -0700, Jeff Davis wrote:
> On Sun, 2023-04-02 at 14:35 -0700, Andres Freund wrote:
> > Why not offer a function to add a CV to a WES? It seems somehow odd
> > to require
> > going through condition_variable.c to create a WES.
> 
> I agree that it's a bit odd, but remember that after waiting on a CV's
> latch, it needs to re-insert itself into the CV's wait list.
> 
> A WaitEventSetWait() can't do that, unless we move the details of re-
> adding to the wait list into latch.c. I considered that, but latch.c
> already implements the APIs for WaitEventSet and Latch, so it felt
> complex to also make it responsible for ConditionVariable.

I agree that the *wait* has to go through condition_variable.c, but it doesn't
seem right that creation of the WES needs to go through condition_variable.c.

The only thing that ConditionVariableEventSleep() seems to require is that the
WES is waiting for MyLatch. You don't even need a separate WES for that, the
already existing WES should suffice.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Fri, 2023-03-31 at 02:50 -0700, Jeff Davis wrote:
> But if the ConditionVariableEventSleep() API is added, then I think
> we
> should change the non-recovery case to use a CV as well for
> consistency, and it would avoid the need for WalSndWakeup().

It seems like what we ultimately want is for WalSndWakeup() to
selectively wake up physical and/or logical walsenders depending on the
caller. For instance:

   WalSndWakeup(bool physical, bool logical)

The callers:

  * On promotion, StartupXLog would call:
    - WalSndWakeup(true, true)
  * XLogFlush/XLogBackgroundFlush/XLogWalRcvFlush would call:
    - WalSndWakeup(true, !RecoveryInProgress())
  * ApplyWalRecord would call:
    - WalSndWakeup(switchedTLI, switchedTLI || RecoveryInProgress())

There seem to be two approaches to making that work:

1. Use two ConditionVariables, and WalSndWakeup would broadcast to one
or both depending on its arguments.

2. Have a "replicaiton_kind" variable in WalSnd (either set based on
MyDatabaseId==InvalidOid, or set at START_REPLICATION time) to indicate
whether it's a physical or logical walsender. WalSndWakeup would wake
up the right walsenders based on its arguments.

#2 seems simpler at least for now. Would that work?

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Sun, 2023-04-02 at 15:29 -0700, Andres Freund wrote:
> I agree that the *wait* has to go through condition_variable.c, but
> it doesn't
> seem right that creation of the WES needs to go through
> condition_variable.c.

The kind of WES required by a CV is an implementation detail, so I was
concerned about making too many assumptions across different APIs.

But what I ended up with is arguably not better, so perhaps I should do
it your way and then just have some comments about what assumptions are
being made?

> The only thing that ConditionVariableEventSleep() seems to require is
> that the
> WES is waiting for MyLatch. You don't even need a separate WES for
> that, the
> already existing WES should suffice.

By "already existing" WES, I assume you mean FeBeWaitSet? Yes, that
mostly matches, but it uses WL_POSTMASTER_DEATH instead of
WL_EXIT_ON_PM_DEATH, so I'd need to handle PM death in
condition_variable.c. That's trivial to do, though.

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
> From 56a9559555918a99c202a0924f7b2ede9de4e75d Mon Sep 17 00:00:00 2001
> From: bdrouvotAWS <bdrouvot@amazon.com>
> Date: Tue, 7 Feb 2023 08:59:47 +0000
> Subject: [PATCH v52 3/6] Allow logical decoding on standby.
> 
> Allow a logical slot to be created on standby. Restrict its usage
> or its creation if wal_level on primary is less than logical.
> During slot creation, it's restart_lsn is set to the last replayed
> LSN. Effectively, a logical slot creation on standby waits for an
> xl_running_xact record to arrive from primary.

Hmm, not sure if it really applies here, but this sounds similar to
issues with track_commit_timestamps: namely, if the primary has it
enabled and you start a standby with it enabled, that's fine; but if the
primary is later shut down (but the standby isn't) and then the primary
restarted with a lesser value, then the standby would misbehave without
any obvious errors.  If that is a real problem, then perhaps you can
solve it by copying some of the logic from track_commit_timestamps,
which took a large number of iterations to get right.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Mon, Apr 3, 2023 at 1:31 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sun, 2023-04-02 at 10:11 +0200, Drouvot, Bertrand wrote:
> > I was thinking that, if a new LogicalWalSndWakeup() replaces
> > "ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
> > in ApplyWalRecord() then, it could be possible that some walsender(s)
> > are requested to wake up while they are actually doing decoding (but
> > I might be wrong).
>
> I don't think that's a problem, right?
>

Agreed, I also don't see a problem because of the reason you mentioned
below that if the latch is already set, we won't do anything in
SetLatch.

> We are concerned about wakeups when they happen repeatedly when there's
> no work to do, or when the wakeup doesn't happen when it should (and we
> need to wait for a timeout).
>
> > >
> > > Currently, we wake up walsenders only after writing some WAL
> > > records
> > > at the time of flush, so won't it be better to wake up only after
> > > applying some WAL records rather than after applying each record?
> >
> > Yeah that would be better.
>
> Why? If the walsender is asleep, and there's work to be done, why not
> wake it up?
>

I think we can wake it up when there is work to be done even if the
work unit is smaller. The reason why I mentioned waking up the
walsender only after processing some records is to avoid the situation
where it may not need to wait again after decoding very few records.
But probably the logic in WalSndWaitForWal() will help us to exit
before starting to wait by checking the replay location.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2023-03-31 at 02:50 -0700, Jeff Davis wrote:
> > But if the ConditionVariableEventSleep() API is added, then I think
> > we
> > should change the non-recovery case to use a CV as well for
> > consistency, and it would avoid the need for WalSndWakeup().
>
> It seems like what we ultimately want is for WalSndWakeup() to
> selectively wake up physical and/or logical walsenders depending on the
> caller. For instance:
>
>    WalSndWakeup(bool physical, bool logical)
>
> The callers:
>
>   * On promotion, StartupXLog would call:
>     - WalSndWakeup(true, true)
>   * XLogFlush/XLogBackgroundFlush/XLogWalRcvFlush would call:
>     - WalSndWakeup(true, !RecoveryInProgress())
>   * ApplyWalRecord would call:
>     - WalSndWakeup(switchedTLI, switchedTLI || RecoveryInProgress())
>
> There seem to be two approaches to making that work:
>
> 1. Use two ConditionVariables, and WalSndWakeup would broadcast to one
> or both depending on its arguments.
>
> 2. Have a "replicaiton_kind" variable in WalSnd (either set based on
> MyDatabaseId==InvalidOid, or set at START_REPLICATION time) to indicate
> whether it's a physical or logical walsender. WalSndWakeup would wake
> up the right walsenders based on its arguments.
>
> #2 seems simpler at least for now. Would that work?
>

Agreed, even Bertrand and myself discussed the same approach few
emails above. BTW, if we have this selective logic to wake
physical/logical walsenders and for standby's, we only wake logical
walsenders at the time of  ApplyWalRecord() then do we need the new
conditional variable enhancement being discussed, and if so, why?

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/3/23 7:20 AM, Amit Kapila wrote:
> On Mon, Apr 3, 2023 at 1:31 AM Jeff Davis <pgsql@j-davis.com> wrote:
>>
>> On Sun, 2023-04-02 at 10:11 +0200, Drouvot, Bertrand wrote:
>>> I was thinking that, if a new LogicalWalSndWakeup() replaces
>>> "ConditionVariableBroadcast(&XLogRecoveryCtl->replayedCV);"
>>> in ApplyWalRecord() then, it could be possible that some walsender(s)
>>> are requested to wake up while they are actually doing decoding (but
>>> I might be wrong).
>>
>> I don't think that's a problem, right?
>>
> 
> Agreed, I also don't see a problem because of the reason you mentioned
> below that if the latch is already set, we won't do anything in
> SetLatch.

Thanks for the feedback, I do agree too after Jeff's and your explanation.

> 
>> We are concerned about wakeups when they happen repeatedly when there's
>> no work to do, or when the wakeup doesn't happen when it should (and we
>> need to wait for a timeout).
>>
>>>>
>>>> Currently, we wake up walsenders only after writing some WAL
>>>> records
>>>> at the time of flush, so won't it be better to wake up only after
>>>> applying some WAL records rather than after applying each record?
>>>
>>> Yeah that would be better.
>>
>> Why? If the walsender is asleep, and there's work to be done, why not
>> wake it up?
>>
> 
> I think we can wake it up when there is work to be done even if the
> work unit is smaller. The reason why I mentioned waking up the
> walsender only after processing some records is to avoid the situation
> where it may not need to wait again after decoding very few records.
> But probably the logic in WalSndWaitForWal() will help us to exit
> before starting to wait by checking the replay location.
> 

Okay, I'll re-write the sub-patch related to the startup/walsender corner
case with this new approach.

Regards,


-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/3/23 7:35 AM, Amit Kapila wrote:
> On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:
>>
>> On Fri, 2023-03-31 at 02:50 -0700, Jeff Davis wrote:
>>> But if the ConditionVariableEventSleep() API is added, then I think
>>> we
>>> should change the non-recovery case to use a CV as well for
>>> consistency, and it would avoid the need for WalSndWakeup().
>>
>> It seems like what we ultimately want is for WalSndWakeup() to
>> selectively wake up physical and/or logical walsenders depending on the
>> caller. For instance:
>>
>>     WalSndWakeup(bool physical, bool logical)
>>
>> The callers:
>>
>>    * On promotion, StartupXLog would call:
>>      - WalSndWakeup(true, true)
>>    * XLogFlush/XLogBackgroundFlush/XLogWalRcvFlush would call:
>>      - WalSndWakeup(true, !RecoveryInProgress())
>>    * ApplyWalRecord would call:
>>      - WalSndWakeup(switchedTLI, switchedTLI || RecoveryInProgress())
>>
>> There seem to be two approaches to making that work:
>>
>> 1. Use two ConditionVariables, and WalSndWakeup would broadcast to one
>> or both depending on its arguments.
>>
>> 2. Have a "replicaiton_kind" variable in WalSnd (either set based on
>> MyDatabaseId==InvalidOid, or set at START_REPLICATION time) to indicate
>> whether it's a physical or logical walsender. WalSndWakeup would wake
>> up the right walsenders based on its arguments.
>>
>> #2 seems simpler at least for now. Would that work?
>>
> 
> Agreed, even Bertrand and myself discussed the same approach few
> emails above. BTW, if we have this selective logic to wake
> physical/logical walsenders and for standby's, we only wake logical
> walsenders at the time of  ApplyWalRecord() then do we need the new
> conditional variable enhancement being discussed, and if so, why?
> 

Thank you both for this new idea and discussion. In that case I don't think
we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
a new proposal with this new approach.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Mon, Apr 3, 2023 at 4:39 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > From 56a9559555918a99c202a0924f7b2ede9de4e75d Mon Sep 17 00:00:00 2001
> > From: bdrouvotAWS <bdrouvot@amazon.com>
> > Date: Tue, 7 Feb 2023 08:59:47 +0000
> > Subject: [PATCH v52 3/6] Allow logical decoding on standby.
> >
> > Allow a logical slot to be created on standby. Restrict its usage
> > or its creation if wal_level on primary is less than logical.
> > During slot creation, it's restart_lsn is set to the last replayed
> > LSN. Effectively, a logical slot creation on standby waits for an
> > xl_running_xact record to arrive from primary.
>
> Hmm, not sure if it really applies here, but this sounds similar to
> issues with track_commit_timestamps: namely, if the primary has it
> enabled and you start a standby with it enabled, that's fine; but if the
> primary is later shut down (but the standby isn't) and then the primary
> restarted with a lesser value, then the standby would misbehave without
> any obvious errors.
>

IIUC, the patch deals it by invalidating logical slots while replaying
the XLOG_PARAMETER_CHANGE record on standby. Then later during
decoding, if it encounters XLOG_PARAMETER_CHANGE, and wal_level from
primary has been reduced, it will return an error. There is a race
condition here as explained in the patch as follows:

+ /*
+ * If wal_level on primary is reduced to less than logical, then we
+ * want to prevent existing logical slots from being used.
+ * Existing logical slots on standby get invalidated when this WAL
+ * record is replayed; and further, slot creation fails when the
+ * wal level is not sufficient; but all these operations are not
+ * synchronized, so a logical slot may creep in while the wal_level
+ * is being reduced. Hence this extra check.
+ */
+ if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("logical decoding on standby requires "
+ "wal_level >= logical on master")));

Now, during this race condition, say not only does a logical slot
creep in but also one tries to decode WAL using the same then some
misbehavior is expected. I have not tried this so not sure if this is
really a problem but are you worried about something along those
lines?

>  If that is a real problem, then perhaps you can
> solve it by copying some of the logic from track_commit_timestamps,
> which took a large number of iterations to get right.
>

IIUC, track_commit_timestamps deactivates the CommitTs module (by
using state in the shared memory) when replaying the
XLOG_PARAMETER_CHANGE record. Then later using that state it gives an
error from the appropriate place in the CommitTs module. If my
understanding is correct then that appears to be a better design than
what the patch is currently doing. Also, the error message used in
error_commit_ts_disabled() seems to be better than the current one.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/2/23 10:10 PM, Andres Freund wrote:
> Hi,
> 
> Btw, most of the patches have some things that pgindent will change (and some
> that my editor will highlight). It wouldn't hurt to run pgindent for the later
> patches...

done.

> 
> Pushed the WAL format change.
> 

Thanks!

> 
> On 2023-04-02 10:27:45 +0200, Drouvot, Bertrand wrote:
>>     5.3% doc/src/sgml/
>>     6.2% src/backend/access/transam/
>>     4.6% src/backend/replication/logical/
>>    55.6% src/backend/replication/
>>     4.4% src/backend/storage/ipc/
>>     6.9% src/backend/tcop/
>>     5.3% src/backend/
>>     3.8% src/include/catalog/
>>     5.3% src/include/replication/
> 
> I think it might be worth trying to split this up a bit.
> 

Okay. Split in 2 parts in V56 enclosed.

One part to handle logical slot conflicts on standby, and one part
to arrange for a new pg_stat_database_conflicts and pg_replication_slots field.

> 
>>           restart_lsn = s->data.restart_lsn;
>> -
>> -        /*
>> -         * If the slot is already invalid or is fresh enough, we don't need to
>> -         * do anything.
>> -         */
>> -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
>> +        slot_xmin = s->data.xmin;
>> +        slot_catalog_xmin = s->data.catalog_xmin;
>> +
>> +        /* the slot has been invalidated (logical decoding conflict case) */
>> +        if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
>> +        /* or the xid is valid and this is a non conflicting slot */
>> +                     (TransactionIdIsValid(*xid) && !(LogicalReplicationSlotXidsConflict(slot_xmin,
slot_catalog_xmin,*xid))))) ||
 
>> +        /* or the slot has been invalidated (obsolete LSN case) */
>> +            (!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
>>           {
> 
> This still looks nearly unreadable. I suggest moving comments outside of the
> if (), remove redundant parentheses, use a function to detect if the slot has
> been invalidated.
> 

I made it as simple as:

         /*
          * If the slot is already invalid or is a non conflicting slot, we don't
          * need to do anything.
          */
         islogical = xid ? true : false;

         if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))

in V56 attached.

> 
>> @@ -1329,16 +1345,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>                */
>>               if (last_signaled_pid != active_pid)
>>               {
>> +                bool        send_signal = false;
>> +
>> +                initStringInfo(&err_msg);
>> +                initStringInfo(&err_detail);
>> +
>> +                appendStringInfo(&err_msg, "terminating process %d to release replication slot \"%s\"",
>> +                                 active_pid,
>> +                                 NameStr(slotname));
> 
> For this to be translatable you need to use _("message").

Thanks!

> 
> 
>> +                if (xid)
>> +                {
>> +                    appendStringInfo(&err_msg, " because it conflicts with recovery");
>> +                    send_signal = true;
>> +
>> +                    if (TransactionIdIsValid(*xid))
>> +                        appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
>> +                    else
>> +                        appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at
leastlogical on the primary server");
 
>> +                }
>> +                else
>> +                {
>> +                    appendStringInfo(&err_detail, "The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
>> +                                     LSN_FORMAT_ARGS(restart_lsn),
>> +                                     (unsigned long long) (oldestLSN - restart_lsn));
>> +                }
>> +
>>                   ereport(LOG,
>> -                        errmsg("terminating process %d to release replication slot \"%s\"",
>> -                               active_pid, NameStr(slotname)),
>> -                        errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
>> -                                  LSN_FORMAT_ARGS(restart_lsn),
>> -                                  (unsigned long long) (oldestLSN - restart_lsn)),
>> -                        errhint("You might need to increase max_slot_wal_keep_size."));
>> -
>> -                (void) kill(active_pid, SIGTERM);
>> +                        errmsg("%s", err_msg.data),
>> +                        errdetail("%s", err_detail.data),
>> +                        send_signal ? 0 : errhint("You might need to increase max_slot_wal_keep_size."));
>> +
>> +                if (send_signal)
>> +                    (void) SendProcSignal(active_pid, PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT, InvalidBackendId);
>> +                else
>> +                    (void) kill(active_pid, SIGTERM);
>> +
>>                   last_signaled_pid = active_pid;
>> +
>> +                pfree(err_msg.data);
>> +                pfree(err_detail.data);
>>               }
>>   
>>               /* Wait until the slot is released. */
>> @@ -1355,6 +1400,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>           }
>>           else
>>           {
>> +            bool        hint = false;;
>> +
>> +            initStringInfo(&err_msg);
>> +            initStringInfo(&err_detail);
>> +
>>               /*
>>                * We hold the slot now and have already invalidated it; flush it
>>                * to ensure that state persists.
>> @@ -1370,14 +1420,37 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>               ReplicationSlotMarkDirty();
>>               ReplicationSlotSave();
>>               ReplicationSlotRelease();
>> +            pgstat_drop_replslot(s);
>> +
>> +            appendStringInfo(&err_msg, "invalidating");
>> +
>> +            if (xid)
>> +            {
>> +                if (TransactionIdIsValid(*xid))
>> +                    appendStringInfo(&err_detail, "The slot conflicted with xid horizon %u.", *xid);
>> +                else
>> +                    appendStringInfo(&err_detail, "Logical decoding on standby requires wal_level to be at least
logicalon the primary server");
 
>> +            }
>> +            else
> 
> These are nearly the same messags as above. This is too much code to duplicate
> between terminating and invalidating. Put this into a helper or such.
> 

ReportTerminationInvalidation() added in V56 for this purpose.

> 
>> @@ -3099,6 +3102,31 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>>                   /* Intentional fall through to session cancel */
>>                   /* FALLTHROUGH */
>>   
>> +            case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
> 
> The case: above is explicitl falling through. This makes no sense here as far
> as I can tell.

There is an if "reason == PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT" in the
PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT case, so that seems ok to me.

Or are you saying that you'd prefer to see the PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT case somewhere else?
If so, where?

> I thought you did change this in response to my last comment
> about it?
> 

Yes.


>> index 8872c80cdf..013cd2b4d0 100644
>> --- a/src/include/replication/slot.h
>> +++ b/src/include/replication/slot.h
>> @@ -17,6 +17,17 @@
>>   #include "storage/spin.h"
>>   #include "replication/walreceiver.h"
>>   
>> +#define ObsoleteSlotIsInvalid(s) (!XLogRecPtrIsInvalid(s->data.invalidated_at) && \
>> +                                  XLogRecPtrIsInvalid(s->data.restart_lsn))
>> +
>> +#define LogicalReplicationSlotIsInvalid(s) (!TransactionIdIsValid(s->data.xmin) && \
>> +                                            !TransactionIdIsValid(s->data.catalog_xmin))
>> +
>> +#define SlotIsInvalid(s) (ObsoleteSlotIsInvalid(s) || LogicalReplicationSlotIsInvalid (s))
>> +
>> +#define LogicalReplicationSlotXidsConflict(slot_xmin, catalog_xmin, xid) \
>> +        ((TransactionIdIsValid(slot_xmin) && TransactionIdPrecedesOrEquals(slot_xmin, xid)) || \
>> +        (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid)))
> 
> Can you make these static inlines instead?
> 
> 

Done.

> 
> 
>> diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
>> index 8fe7bb65f1..8457eec4c4 100644
>> --- a/src/backend/replication/logical/decode.c
>> +++ b/src/backend/replication/logical/decode.c
>> @@ -152,11 +152,31 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
>>                * can restart from there.
>>                */
>>               break;
>> +        case XLOG_PARAMETER_CHANGE:
>> +        {
>> +            xl_parameter_change *xlrec =
>> +                (xl_parameter_change *) XLogRecGetData(buf->record);
>> +
>> +            /*
>> +             * If wal_level on primary is reduced to less than logical, then we
>> +             * want to prevent existing logical slots from being used.
>> +             * Existing logical slots on standby get invalidated when this WAL
>> +             * record is replayed; and further, slot creation fails when the
>> +             * wal level is not sufficient; but all these operations are not
>> +             * synchronized, so a logical slot may creep in while the wal_level
>> +             * is being reduced. Hence this extra check.
>> +             */
>> +            if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
>> +                ereport(ERROR,
>> +                        (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                         errmsg("logical decoding on standby requires wal_level "
>> +                                "to be at least logical on the primary server")));
> 
> Please don't break error messages into multiple lines, makes it harder to grep
> for.
>
done.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
On 2023-Apr-03, Drouvot, Bertrand wrote:

> +/*
> + * Report terminating or conflicting message.
> + *
> + * For both, logical conflict on standby and obsolete slot are handled.
> + */
> +static void
> +ReportTerminationInvalidation(bool terminating, bool islogical, int pid,
> +                              NameData slotname, TransactionId *xid,
> +                              XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
> +{

> +    if (terminating)
> +        appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
> +                         pid,
> +                         NameStr(slotname));
> +    else
> +        appendStringInfo(&err_msg, _("invalidating"));
> +
> +    if (islogical)
> +    {
> +        if (terminating)
> +            appendStringInfo(&err_msg, _(" because it conflicts with recovery"));

You can't build the strings this way, because it's not possible to put
the strings into the translation machinery.  You need to write full
strings for each separate case instead, without appending other string
parts later.

Thanks

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Hay quien adquiere la mala costumbre de ser infeliz" (M. A. Evans)



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/3/23 8:10 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 4/3/23 7:35 AM, Amit Kapila wrote:
>> On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:
>>
>> Agreed, even Bertrand and myself discussed the same approach few
>> emails above. BTW, if we have this selective logic to wake
>> physical/logical walsenders and for standby's, we only wake logical
>> walsenders at the time of  ApplyWalRecord() then do we need the new
>> conditional variable enhancement being discussed, and if so, why?
>>
> 
> Thank you both for this new idea and discussion. In that case I don't think
> we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
> a new proposal with this new approach.
> 

Please find enclosed V57 implementing the new approach in 0004. With the new approach in place
the TAP tests (0005) work like a charm (no delay and even after a promotion).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-03 17:34:52 +0200, Alvaro Herrera wrote:
> On 2023-Apr-03, Drouvot, Bertrand wrote:
> 
> > +/*
> > + * Report terminating or conflicting message.
> > + *
> > + * For both, logical conflict on standby and obsolete slot are handled.
> > + */
> > +static void
> > +ReportTerminationInvalidation(bool terminating, bool islogical, int pid,
> > +                              NameData slotname, TransactionId *xid,
> > +                              XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
> > +{
> 
> > +    if (terminating)
> > +        appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
> > +                         pid,
> > +                         NameStr(slotname));
> > +    else
> > +        appendStringInfo(&err_msg, _("invalidating"));
> > +
> > +    if (islogical)
> > +    {
> > +        if (terminating)
> > +            appendStringInfo(&err_msg, _(" because it conflicts with recovery"));
> 
> You can't build the strings this way, because it's not possible to put
> the strings into the translation machinery.  You need to write full
> strings for each separate case instead, without appending other string
> parts later.

Hm? That's what the _'s do. We build strings in parts in other places too.

You do need to use errmsg_internal() later, to prevent that format string from
being translated as well.

I'm not say that this is exactly the right way, don't get me wrong.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Masahiko Sawada
Date:
On Tue, Apr 4, 2023 at 3:17 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On 4/3/23 8:10 AM, Drouvot, Bertrand wrote:
> > Hi,
> >
> > On 4/3/23 7:35 AM, Amit Kapila wrote:
> >> On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:
> >>
> >> Agreed, even Bertrand and myself discussed the same approach few
> >> emails above. BTW, if we have this selective logic to wake
> >> physical/logical walsenders and for standby's, we only wake logical
> >> walsenders at the time of  ApplyWalRecord() then do we need the new
> >> conditional variable enhancement being discussed, and if so, why?
> >>
> >
> > Thank you both for this new idea and discussion. In that case I don't think
> > we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
> > a new proposal with this new approach.
> >
>
> Please find enclosed V57 implementing the new approach in 0004.

Regarding 0004 patch:

@@ -2626,6 +2626,12 @@ InitWalSenderSlot(void)
                        walsnd->sync_standby_priority = 0;
                        walsnd->latch = &MyProc->procLatch;
                        walsnd->replyTime = 0;
+
+                       if (MyDatabaseId == InvalidOid)
+                               walsnd->kind = REPLICATION_KIND_PHYSICAL;
+                       else
+                               walsnd->kind = REPLICATION_KIND_LOGICAL;
+

I think we might want to set the replication kind when processing the
START_REPLICATION command. The walsender using a logical replication
slot is not necessarily streaming (e.g. when COPYing table data).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Mon, Apr 3, 2023 at 8:51 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/2/23 10:10 PM, Andres Freund wrote:
> > Hi,
> >>              restart_lsn = s->data.restart_lsn;
> >> -
> >> -            /*
> >> -             * If the slot is already invalid or is fresh enough, we don't need to
> >> -             * do anything.
> >> -             */
> >> -            if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
> >> +            slot_xmin = s->data.xmin;
> >> +            slot_catalog_xmin = s->data.catalog_xmin;
> >> +
> >> +            /* the slot has been invalidated (logical decoding conflict case) */
> >> +            if ((xid && ((LogicalReplicationSlotIsInvalid(s)) ||
> >> +            /* or the xid is valid and this is a non conflicting slot */
> >> +                                     (TransactionIdIsValid(*xid) &&
!(LogicalReplicationSlotXidsConflict(slot_xmin,slot_catalog_xmin, *xid))))) || 
> >> +            /* or the slot has been invalidated (obsolete LSN case) */
> >> +                    (!xid && (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)))
> >>              {
> >
> > This still looks nearly unreadable. I suggest moving comments outside of the
> > if (), remove redundant parentheses, use a function to detect if the slot has
> > been invalidated.
> >
>
> I made it as simple as:
>
>          /*
>           * If the slot is already invalid or is a non conflicting slot, we don't
>           * need to do anything.
>           */
>          islogical = xid ? true : false;
>
>          if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))
>
> in V56 attached.
>

Here the variable 'islogical' doesn't seem to convey its actual
meaning because one can imagine that it indicates whether the slot is
logical which I don't think is the actual intent. One idea to simplify
this is to introduce a single function CanInvalidateSlot() or
something like that and move the logic from both the functions
SlotIsInvalid() and SlotIsNotConflicting() into the new function.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Masahiko Sawada
Date:
On Tue, Apr 4, 2023 at 10:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Apr 4, 2023 at 3:17 AM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> >
> > Hi,
> >
> > On 4/3/23 8:10 AM, Drouvot, Bertrand wrote:
> > > Hi,
> > >
> > > On 4/3/23 7:35 AM, Amit Kapila wrote:
> > >> On Mon, Apr 3, 2023 at 4:26 AM Jeff Davis <pgsql@j-davis.com> wrote:
> > >>
> > >> Agreed, even Bertrand and myself discussed the same approach few
> > >> emails above. BTW, if we have this selective logic to wake
> > >> physical/logical walsenders and for standby's, we only wake logical
> > >> walsenders at the time of  ApplyWalRecord() then do we need the new
> > >> conditional variable enhancement being discussed, and if so, why?
> > >>
> > >
> > > Thank you both for this new idea and discussion. In that case I don't think
> > > we need the new CV API and the use of a CV anymore. As just said up-thread I'll submit
> > > a new proposal with this new approach.
> > >
> >
> > Please find enclosed V57 implementing the new approach in 0004.
>
> Regarding 0004 patch:
>
> @@ -2626,6 +2626,12 @@ InitWalSenderSlot(void)
>                         walsnd->sync_standby_priority = 0;
>                         walsnd->latch = &MyProc->procLatch;
>                         walsnd->replyTime = 0;
> +
> +                       if (MyDatabaseId == InvalidOid)
> +                               walsnd->kind = REPLICATION_KIND_PHYSICAL;
> +                       else
> +                               walsnd->kind = REPLICATION_KIND_LOGICAL;
> +
>
> I think we might want to set the replication kind when processing the
> START_REPLICATION command. The walsender using a logical replication
> slot is not necessarily streaming (e.g. when COPYing table data).
>

Discussing with Bertrand off-list, it's wrong as the logical
replication slot creation also needs to read WAL records so a
walsender who is creating a logical replication slot needs to be woken
up. We can set it the replication kind when processing
START_REPLICATION and CREATE_REPLICATION_SLOT, but it seems better to
set it in one place. So I agree to set it in InitWalSenderSlot().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/4/23 7:57 AM, Amit Kapila wrote:
> On Mon, Apr 3, 2023 at 8:51 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> I made it as simple as:
>>
>>           /*
>>            * If the slot is already invalid or is a non conflicting slot, we don't
>>            * need to do anything.
>>            */
>>           islogical = xid ? true : false;
>>
>>           if (SlotIsInvalid(s, islogical) || SlotIsNotConflicting(s, islogical, xid, &oldestLSN))
>>
>> in V56 attached.
>>
> 
> Here the variable 'islogical' doesn't seem to convey its actual
> meaning because one can imagine that it indicates whether the slot is
> logical which I don't think is the actual intent.

Good point. Just renamed it to 'check_on_xid' (as still needed outside of
the "CanInvalidateSlot" context) in V58 attached.

> One idea to simplify
> this is to introduce a single function CanInvalidateSlot() or
> something like that and move the logic from both the functions
> SlotIsInvalid() and SlotIsNotConflicting() into the new function.
> 

Oh right, even better, thanks!
Done in V58 and now this is as simple as:

+               if (DoNotInvalidateSlot(s, xid, &oldestLSN))
                 {
                         /* then, we are not forcing for invalidation */


Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/4/23 9:48 AM, Masahiko Sawada wrote:
> On Tue, Apr 4, 2023 at 10:55 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>>
>> Regarding 0004 patch:
>>
>> @@ -2626,6 +2626,12 @@ InitWalSenderSlot(void)
>>                          walsnd->sync_standby_priority = 0;
>>                          walsnd->latch = &MyProc->procLatch;
>>                          walsnd->replyTime = 0;
>> +
>> +                       if (MyDatabaseId == InvalidOid)
>> +                               walsnd->kind = REPLICATION_KIND_PHYSICAL;
>> +                       else
>> +                               walsnd->kind = REPLICATION_KIND_LOGICAL;
>> +
>>
>> I think we might want to set the replication kind when processing the
>> START_REPLICATION command. The walsender using a logical replication
>> slot is not necessarily streaming (e.g. when COPYing table data).
>>
> 
> Discussing with Bertrand off-list, it's wrong as the logical
> replication slot creation also needs to read WAL records so a
> walsender who is creating a logical replication slot needs to be woken
> up. We can set it the replication kind when processing
> START_REPLICATION and CREATE_REPLICATION_SLOT, but it seems better to
> set it in one place. So I agree to set it in InitWalSenderSlot().
> 

Thanks for the review and feedback!
Added a comment in 0004 in V58 just posted up-thread to explain the reason
why the walsnd->kind assignment is done InitWalSenderSlot().

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
Hi,

On 2023-Apr-03, Andres Freund wrote:

> Hm? That's what the _'s do. We build strings in parts in other places too.

No, what _() does is mark each piece for translation separately.  But a
translation cannot be done on string pieces, and later have all the
pieces appended together to form a full sentence.  Let me show the
"!terminating" case as example and grab some translations for it from
src/backend/po/de.po:

"invalidating" -> "... wird ungültig gemacht" (?)

(if logical) " obsolete replication" -> " obsolete Replikation"

" slot \"%s\" because it conflicts with recovery" -> " Slot \"%s\", weil sie in Konflikt mit Wiederherstellung steht"

If you just concatenate all the translated phrases together, the
resulting string will make no sense; keep in mind the "obsolete
replication" part may or not may not be there.  And there's no way to
make that work: even if you found an ordering of the English parts that
allows you to translate each piece separately and have it make sense for
German, the same won't work for Spanish or Japanese.

You have to give the translator a complete phrase and let them turn into
a complete translated phrases.  Building from parts doesn't work.  We're
very good at avoiding string building; we have a couple of cases, but
they are *very* minor.

string 1 "invalidating slot \"%s\" because it conflicts with recovery"

string 2 "invalidating obsolete replication slot \"%s\" because it conflicts with recovery"

(I'm not clear on why did Bertrand omitted the word "replication" in the
case where the slot is not logical)

I think the errdetail() are okay, it's the errmsg() bits that are bogus.

And yes, well caught on having to use errmsg_internal and
errdetail_internal() to avoid double translation.

Cheers

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Tue, Apr 4, 2023 at 3:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>


+static inline bool
+LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
+{
+ TransactionId slot_xmin;
+ TransactionId slot_catalog_xmin;
+
+ slot_xmin = s->data.xmin;
+ slot_catalog_xmin = s->data.catalog_xmin;
+
+ return (((TransactionIdIsValid(slot_xmin) &&
TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||

For logical slots, slot->data.xmin will always be an
InvalidTransactionId. It will only be set/updated for physical slots.
So, it is not clear to me why in this and other related functions, you
are referring to and or invalidating it.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/4/23 1:43 PM, Amit Kapila wrote:
> On Tue, Apr 4, 2023 at 3:14 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
> 
> 
> +static inline bool
> +LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
> +{
> + TransactionId slot_xmin;
> + TransactionId slot_catalog_xmin;
> +
> + slot_xmin = s->data.xmin;
> + slot_catalog_xmin = s->data.catalog_xmin;
> +
> + return (((TransactionIdIsValid(slot_xmin) &&
> TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||
> 
> For logical slots, slot->data.xmin will always be an
> InvalidTransactionId. It will only be set/updated for physical slots.
> So, it is not clear to me why in this and other related functions, you
> are referring to and or invalidating it.
> 

I think you're right that invalidating/checking only on the catalog xmin is
enough for logical slot (I'm not sure how I ended up taking the xmin into account but
that seems useless indeed).

I'll submit a new version to deal with the catalog xmin only, thanks!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Tue, Apr 4, 2023 at 6:05 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/4/23 1:43 PM, Amit Kapila wrote:
> > On Tue, Apr 4, 2023 at 3:14 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >
> >
> > +static inline bool
> > +LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
> > +{
> > + TransactionId slot_xmin;
> > + TransactionId slot_catalog_xmin;
> > +
> > + slot_xmin = s->data.xmin;
> > + slot_catalog_xmin = s->data.catalog_xmin;
> > +
> > + return (((TransactionIdIsValid(slot_xmin) &&
> > TransactionIdPrecedesOrEquals(slot_xmin, xid)) ||
> >
> > For logical slots, slot->data.xmin will always be an
> > InvalidTransactionId. It will only be set/updated for physical slots.
> > So, it is not clear to me why in this and other related functions, you
> > are referring to and or invalidating it.
> >
>
> I think you're right that invalidating/checking only on the catalog xmin is
> enough for logical slot (I'm not sure how I ended up taking the xmin into account but
> that seems useless indeed).
>

I think we might want to consider slot's effective_xmin instead of
data.xmin as we use that to store xmin_horizon when we build the full
snapshot.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/4/23 3:43 PM, Amit Kapila wrote:
> On Tue, Apr 4, 2023 at 6:05 PM Drouvot, Bertrand
> 
> I think we might want to consider slot's effective_xmin instead of
> data.xmin as we use that to store xmin_horizon when we build the full
> snapshot.
> 

Oh, I did not know about the 'effective_xmin' and was going to rely only on the catalog xmin.

Reading the comment in the ReplicationSlot struct about the 'effective_xmin' I do think it makes sense to use it
(instead of data.xmin).

Please find attached v59 doing so.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/4/23 1:21 PM, Alvaro Herrera wrote:
> Hi,
> 
> On 2023-Apr-03, Andres Freund wrote:
> 
>> Hm? That's what the _'s do. We build strings in parts in other places too.
> 
> No, what _() does is mark each piece for translation separately.  But a
> translation cannot be done on string pieces, and later have all the
> pieces appended together to form a full sentence.  Let me show the
> "!terminating" case as example and grab some translations for it from
> src/backend/po/de.po:
> 
> "invalidating" -> "... wird ungültig gemacht" (?)
> 
> (if logical) " obsolete replication" -> " obsolete Replikation"
> 
> " slot \"%s\" because it conflicts with recovery" -> " Slot \"%s\", weil sie in Konflikt mit Wiederherstellung
steht"
> 
> If you just concatenate all the translated phrases together, the
> resulting string will make no sense; keep in mind the "obsolete
> replication" part may or not may not be there.  And there's no way to
> make that work: even if you found an ordering of the English parts that
> allows you to translate each piece separately and have it make sense for
> German, the same won't work for Spanish or Japanese.
> 
> You have to give the translator a complete phrase and let them turn into
> a complete translated phrases.  Building from parts doesn't work.  We're
> very good at avoiding string building; we have a couple of cases, but
> they are *very* minor.
> 
> string 1 "invalidating slot \"%s\" because it conflicts with recovery"
> 
> string 2 "invalidating obsolete replication slot \"%s\" because it conflicts with recovery"
> 

Thanks for looking at it and the explanations!

> (I'm not clear on why did Bertrand omitted the word "replication" in the
> case where the slot is not logical)

It makes more sense to add it, will do thanks!

> 
> I think the errdetail() are okay, it's the errmsg() bits that are bogus.
  
> And yes, well caught on having to use errmsg_internal and
> errdetail_internal() to avoid double translation.
> 

So, IIUC having something like this would be fine?

"
     if (check_on_xid)
     {
         if (terminating)
             appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\" because it
conflictswith recovery"),
 
                              pid,
                              NameStr(slotname));
         else
             appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with recovery"),
                              NameStr(slotname));

         if (TransactionIdIsValid(*xid))
             appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."), *xid);
         else
             appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical on
theprimary server"));
 
     }
     else
     {
         if (terminating)
             appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\""),
                              pid,
                              NameStr(slotname));
         else
             appendStringInfo(&err_msg, _("invalidating obsolete replication slot \"%s\""),
                              NameStr(slotname));

         appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
                          LSN_FORMAT_ARGS(restart_lsn),
                          (unsigned long long) (oldestLSN - restart_lsn));

         hint = true;
     }

     ereport(LOG,
             errmsg_internal("%s", err_msg.data),
             errdetail_internal("%s", err_detail.data),
             hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
"

as err_msg is not concatenated anymore (I mean it's just one sentence build one time)
and this make use of errmsg_internal() and errdetail_internal().

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-04 13:21:38 +0200, Alvaro Herrera wrote:
> On 2023-Apr-03, Andres Freund wrote:
> 
> > Hm? That's what the _'s do. We build strings in parts in other places too.
> 
> No, what _() does is mark each piece for translation separately.  But a
> translation cannot be done on string pieces, and later have all the
> pieces appended together to form a full sentence.  Let me show the
> "!terminating" case as example and grab some translations for it from
> src/backend/po/de.po:
> 
> "invalidating" -> "... wird ungültig gemacht" (?)
> 
> (if logical) " obsolete replication" -> " obsolete Replikation"
> 
> " slot \"%s\" because it conflicts with recovery" -> " Slot \"%s\", weil sie in Konflikt mit Wiederherstellung
steht"
> 
> If you just concatenate all the translated phrases together, the
> resulting string will make no sense; keep in mind the "obsolete
> replication" part may or not may not be there.  And there's no way to
> make that work: even if you found an ordering of the English parts that
> allows you to translate each piece separately and have it make sense for
> German, the same won't work for Spanish or Japanese.

Ah, I misunderstood the angle you're coming from. Yes, the pieces need to be
reasonable fragments, instead of half-sentences.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-04 18:54:33 +0200, Drouvot, Bertrand wrote:
>     if (check_on_xid)
>     {
>         if (terminating)
>             appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\" because it
conflictswith recovery"),
 
>                              pid,
>                              NameStr(slotname));

FWIW, I would just use exactly the same error message as today here.

                        errmsg("terminating process %d to release replication slot \"%s\"",
                               active_pid, NameStr(slotname)),

This is accurate for both the existing and the new case. Then there's no need
to put that string into a stringinfo either.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Tue, 2023-04-04 at 11:42 +0200, Drouvot, Bertrand wrote:
> Done in V58 and now this is as simple as:


Minor comments on 0004 (address if you agree):

* Consider static inline for WalSndWakeupProcessRequests()?
* Is the WalSndWakeup() in KeepFileRestoredFromArchive() more like the
flush case? Why is the second argument unconditionally true? I don't
think the cascading logical walsenders have anything to do until the
WAL is actually applied.

Otherwise, looks good!

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Apr 4, 2023 at 5:44 AM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
> Oh right, even better, thanks!
> Done in V58 and now this is as simple as:
>
> +               if (DoNotInvalidateSlot(s, xid, &oldestLSN))
>                  {
>                          /* then, we are not forcing for invalidation */

Thanks for your continued work on $SUBJECT. I just took a look at
0004, and I think that at the very least the commit message needs
work. Nobody who is not a hacker is going to understand what problem
this is fixing, because it makes reference to the names of functions
and structure members rather than user-visible behavior. In fact, I'm
not really sure that I understand the problem myself. It seems like
the problem is that on a standby, WAL senders will get woken up too
early, before we have any WAL to send. That's presumably OK, in the
sense that they'll go back to sleep and eventually wake up again, but
it means they might end up chronically behind sending out WAL to
cascading standbys. If that's right, I think it should be spelled out
more clearly in the commit message, and maybe also in the code
comments.

But the weird thing is that most (all?) of the patch doesn't seem to
be about that issue at all. Instead, it's about separating wakeups of
physical walsenders from wakeups of logical walsenders. I don't see
how that could ever fix the kind of problem I mentioned in the
preceding paragraph, so my guess is that this is a separate change.
But this change doesn't really seem adequately justified. The commit
message says that it "helps to filter what kind of walsender
we want to wakeup based on the code path" but that's awfully vague
about what the actual benefit is. I wonder whether many people have a
mix of physical and logical systems connecting to the same machine
such that this would even help, and if they do have that, would this
really do enough to solve any performance problem that might be caused
by too many wakeups?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
Jeff Davis
Date:
On Tue, 2023-04-04 at 14:55 -0400, Robert Haas wrote:
> Thanks for your continued work on $SUBJECT. I just took a look at
> 0004, and I think that at the very least the commit message needs
> work. Nobody who is not a hacker is going to understand what problem
> this is fixing, because it makes reference to the names of functions
> and structure members rather than user-visible behavior. In fact, I'm
> not really sure that I understand the problem myself. It seems like
> the problem is that on a standby, WAL senders will get woken up too
> early, before we have any WAL to send.

Logical walsenders on the standby, specifically, which didn't exist
before this patch series.

>  That's presumably OK, in the
> sense that they'll go back to sleep and eventually wake up again, but
> it means they might end up chronically behind sending out WAL to
> cascading standbys.

Without 0004, cascading logical walsenders would have worse wakeup
behavior than logical walsenders on the primary. Assuming the fix is
small in scope and otherwise acceptable, I think it belongs as a part
of this overall series.

> If that's right, I think it should be spelled out
> more clearly in the commit message, and maybe also in the code
> comments.

Perhaps a commit message like:

"For cascading replication, wake up physical walsenders separately from
logical walsenders.

Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.

Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would not have been applied yet, so logical walsenders were
awakened too early."

(I'm not sure if I quite got the verb tenses right.)

For comments, I agree that WalSndWakeup() clearly needs a comment
update. The call site in ApplyWalRecord() could also use a comment. You
could add a comment at every call site, but I don't think that's
necessary if there's a good comment over WalSndWakeup().

Regards,
    Jeff Davis




Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-04 17:33:25 -0700, Jeff Davis wrote:
> On Tue, 2023-04-04 at 14:55 -0400, Robert Haas wrote:
> >  That's presumably OK, in the
> > sense that they'll go back to sleep and eventually wake up again, but
> > it means they might end up chronically behind sending out WAL to
> > cascading standbys.
> 
> Without 0004, cascading logical walsenders would have worse wakeup
> behavior than logical walsenders on the primary. Assuming the fix is
> small in scope and otherwise acceptable, I think it belongs as a part
> of this overall series.

FWIW, personally, I wouldn't feel ok with committing 0003 without 0004. And
IMO they ought to be committed the other way round. The stalls you *can* get,
depending on the speed of WAL apply and OS scheduling, can be long.

This is actually why a predecessor version of the feature had a bunch of
sleeps and retries in the tests, just to avoid those stalls. Obviously that's
not a good path...

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Mon, Apr 3, 2023 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Apr 3, 2023 at 4:39 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > > From 56a9559555918a99c202a0924f7b2ede9de4e75d Mon Sep 17 00:00:00 2001
> > > From: bdrouvotAWS <bdrouvot@amazon.com>
> > > Date: Tue, 7 Feb 2023 08:59:47 +0000
> > > Subject: [PATCH v52 3/6] Allow logical decoding on standby.
> > >
> > > Allow a logical slot to be created on standby. Restrict its usage
> > > or its creation if wal_level on primary is less than logical.
> > > During slot creation, it's restart_lsn is set to the last replayed
> > > LSN. Effectively, a logical slot creation on standby waits for an
> > > xl_running_xact record to arrive from primary.
> >
> > Hmm, not sure if it really applies here, but this sounds similar to
> > issues with track_commit_timestamps: namely, if the primary has it
> > enabled and you start a standby with it enabled, that's fine; but if the
> > primary is later shut down (but the standby isn't) and then the primary
> > restarted with a lesser value, then the standby would misbehave without
> > any obvious errors.
> >
>
> IIUC, the patch deals it by invalidating logical slots while replaying
> the XLOG_PARAMETER_CHANGE record on standby. Then later during
> decoding, if it encounters XLOG_PARAMETER_CHANGE, and wal_level from
> primary has been reduced, it will return an error. There is a race
> condition here as explained in the patch as follows:
>
> + /*
> + * If wal_level on primary is reduced to less than logical, then we
> + * want to prevent existing logical slots from being used.
> + * Existing logical slots on standby get invalidated when this WAL
> + * record is replayed; and further, slot creation fails when the
> + * wal level is not sufficient; but all these operations are not
> + * synchronized, so a logical slot may creep in while the wal_level
> + * is being reduced. Hence this extra check.
> + */
> + if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("logical decoding on standby requires "
> + "wal_level >= logical on master")));
>
> Now, during this race condition, say not only does a logical slot
> creep in but also one tries to decode WAL using the same then some
> misbehavior is expected. I have not tried this so not sure if this is
> really a problem but are you worried about something along those
> lines?
>

On further thinking, as such this shouldn't be a problem because all
the WAL records before PARAMETER_CHANGE record will have sufficient
information so that they can get decoded. However, with the current
approach, the subscriber may not even receive the valid records before
PARAMETER_CHANGE record. This is because startup process will
terminate the walsenders while invaliding the slots and after restart
the walsenders will exit because the corresponding slot will be an
invalid slot. So, it is quite possible that walsender was lagging and
wouldn't have sent records before the PARAMETER_CHANGE record making
subscriber never receive those records that it should have received. I
don't know whether this is what one would expect.

One other observation is that once this error has been raised both
standby and subscriber will keep on getting this error in the loop
unless the user manually disables the subscription on the subscriber.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/4/23 7:53 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-04-04 18:54:33 +0200, Drouvot, Bertrand wrote:
>>      if (check_on_xid)
>>      {
>>          if (terminating)
>>              appendStringInfo(&err_msg, _("terminating process %d to release replication slot \"%s\" because it
conflictswith recovery"),
 
>>                               pid,
>>                               NameStr(slotname));
> 
> FWIW, I would just use exactly the same error message as today here.
> 
>                         errmsg("terminating process %d to release replication slot \"%s\"",
>                                active_pid, NameStr(slotname)),
> 
> This is accurate for both the existing and the new case. Then there's no need
> to put that string into a stringinfo either.
> 

Right, thanks! Did it that way in V60 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/4/23 8:13 PM, Jeff Davis wrote:
> On Tue, 2023-04-04 at 11:42 +0200, Drouvot, Bertrand wrote:
>> Done in V58 and now this is as simple as:
> 
> 
> Minor comments on 0004 (address if you agree):
> 

Thanks for the review!

> * Consider static inline for WalSndWakeupProcessRequests()?

Agree and done in V60 just shared up-thread.

> * Is the WalSndWakeup() in KeepFileRestoredFromArchive() more like the
> flush case? Why is the second argument unconditionally true? I don't
> think the cascading logical walsenders have anything to do until the
> WAL is actually applied.
> 

Agree and changed it to "(true, false)" in V60.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/5/23 2:33 AM, Jeff Davis wrote:
> On Tue, 2023-04-04 at 14:55 -0400, Robert Haas wrote:
>> Thanks for your continued work on $SUBJECT. I just took a look at
>> 0004,

Thanks Robert for the feedback!

>> and I think that at the very least the commit message needs
>> work.

Agree.

> Perhaps a commit message like:
> 
> "For cascading replication, wake up physical walsenders separately from
> logical walsenders.
> 
> Physical walsenders can't send data until it's been flushed; logical
> walsenders can't decode and send data until it's been applied. On the
> standby, the WAL is flushed first, which will only wake up physical
> walsenders; and then applied, which will only wake up logical
> walsenders.
> 
> Previously, all walsenders were awakened when the WAL was flushed. That
> was fine for logical walsenders on the primary; but on the standby the
> flushed WAL would not have been applied yet, so logical walsenders were
> awakened too early."

Thanks Jeff for the commit message proposal! It looks good to me
except that I think that "flushed WAL could have been not applied yet" is better than
"flushed WAL would not have been applied yet" but it's obviously open to discussion.

Currently changed it that way and used it in V60 shared up-thread.

> 
> For comments, I agree that WalSndWakeup() clearly needs a comment
> update. The call site in ApplyWalRecord() could also use a comment. You
> could add a comment at every call site, but I don't think that's
> necessary if there's a good comment over WalSndWakeup().

Agree, added a comment over WalSndWakeup() and one before calling WalSndWakeup()
in ApplyWalRecord() in V60.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/5/23 8:59 AM, Amit Kapila wrote:
> On Mon, Apr 3, 2023 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> On further thinking, as such this shouldn't be a problem because all
> the WAL records before PARAMETER_CHANGE record will have sufficient
> information so that they can get decoded. However, with the current
> approach, the subscriber may not even receive the valid records before
> PARAMETER_CHANGE record. This is because startup process will
> terminate the walsenders while invaliding the slots and after restart
> the walsenders will exit because the corresponding slot will be an
> invalid slot. So, it is quite possible that walsender was lagging and
> wouldn't have sent records before the PARAMETER_CHANGE record making
> subscriber never receive those records that it should have received.

Agree that would behave that way.

> I don't know whether this is what one would expect.

If one change wal_level to < logical on the primary, he should at least
know that:

"
Existing
+     logical slots on standby also get invalidated if wal_level on primary is reduced to
+     less than 'logical'.
"

If the doc has been read (as the quote above is coming from 0006).

I think that what is missing is the "when" the slots are invalidated.

Maybe we could change the doc with something among those lines instead?

"
Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.

It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
primary won't be decoded".

I don't know whether this is what one would expect but that should be less of a surprise if documented.

What do you think?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/5/23 8:59 AM, Amit Kapila wrote:
> > On Mon, Apr 3, 2023 at 12:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > On further thinking, as such this shouldn't be a problem because all
> > the WAL records before PARAMETER_CHANGE record will have sufficient
> > information so that they can get decoded. However, with the current
> > approach, the subscriber may not even receive the valid records before
> > PARAMETER_CHANGE record. This is because startup process will
> > terminate the walsenders while invaliding the slots and after restart
> > the walsenders will exit because the corresponding slot will be an
> > invalid slot. So, it is quite possible that walsender was lagging and
> > wouldn't have sent records before the PARAMETER_CHANGE record making
> > subscriber never receive those records that it should have received.
>
> Agree that would behave that way.
>
> > I don't know whether this is what one would expect.
>
> If one change wal_level to < logical on the primary, he should at least
> know that:
>
> "
> Existing
> +     logical slots on standby also get invalidated if wal_level on primary is reduced to
> +     less than 'logical'.
> "
>
> If the doc has been read (as the quote above is coming from 0006).
>
> I think that what is missing is the "when" the slots are invalidated.
>
> Maybe we could change the doc with something among those lines instead?
>
> "
> Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
> less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.
>
> It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
> primary won't be decoded".
>
> I don't know whether this is what one would expect but that should be less of a surprise if documented.
>
> What do you think?
>

Yeah, I think it is better to document to avoid any surprises if
nobody else sees any problem with it. BTW, another thought that
crosses my mind is that let's not invalidate the slots when the
standby startup process processes parameter_change record and rather
do it when walsender decodes the parameter_change record, if we think
that is safe. I have shared this as this crosses my mind while
thinking about this part of the patch and wanted to validate my
thoughts, we don't need to change even if the idea is valid.

minor nitpick:
+
+ /* Intentional fall through to session cancel */
+ /* FALLTHROUGH */

Do we need to repeat fall through twice in different ways?

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Wed, Apr 5, 2023 at 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>
> minor nitpick:
> +
> + /* Intentional fall through to session cancel */
> + /* FALLTHROUGH */
>
> Do we need to repeat fall through twice in different ways?
>

Few minor comments on 0003:
========================
1.
+ case XLOG_PARAMETER_CHANGE:
+ {
+ xl_parameter_change *xlrec =
+ (xl_parameter_change *) XLogRecGetData(buf->record);
+
+ /*
+ * If wal_level on primary is reduced to less than logical,
+ * then we want to prevent existing logical slots from being
+ * used. Existing logical slots on standby get invalidated
+ * when this WAL record is replayed; and further, slot
+ * creation fails when the wal level is not sufficient; but
+ * all these operations are not synchronized, so a logical
+ * slot may creep in while the wal_level is being reduced.
+ * Hence this extra check.
+ */
+ if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("logical decoding on standby requires wal_level to be at
least logical on the primary server")));

By looking at this change, it is not very clear that this can occur
only on standby. I understand that on primary, we will not allow
restarting the server after changing wal_level if there is a
pre-existing slot but still this looks a bit odd. Shall we have an
Assert to indicate that this will occur only on standby?

2.
/*
- * Since logical decoding is only permitted on a primary server, we know
- * that the current timeline ID can't be changing any more. If we did this
- * on a standby, we'd have to worry about the values we compute here
- * becoming invalid due to a promotion or timeline change.
+ * Since logical decoding is also permitted on a standby server, we need
+ * to check if the server is in recovery to decide how to get the current
+ * timeline ID (so that it also cover the promotion or timeline change
+ * cases).
  */
+
+ /* make sure we have enough WAL available */
+ flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
+
+ /* the standby could have been promoted, so check if still in recovery */
+ am_cascading_walsender = RecoveryInProgress();

The first part of the comment explains why it is important to check
RecoveryInProgress() and then immediately after that, the patch
invokes WalSndWaitForWal(). It may be better to move the comment after
WalSndWaitForWal() invocation. Also, it will be better to write a
comment as to why you need to do WalSndWaitForWal() before retrieving
the current timeline as previously that was done afterward.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/5/23 12:28 PM, Amit Kapila wrote:
> On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>> Maybe we could change the doc with something among those lines instead?
>>
>> "
>> Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
>> less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.
>>
>> It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
>> primary won't be decoded".
>>
>> I don't know whether this is what one would expect but that should be less of a surprise if documented.
>>
>> What do you think?
>>
> 
> Yeah, I think it is better to document to avoid any surprises if
> nobody else sees any problem with it.

Ack.

> BTW, another thought that
> crosses my mind is that let's not invalidate the slots when the
> standby startup process processes parameter_change record and rather
> do it when walsender decodes the parameter_change record, if we think
> that is safe. I have shared this as this crosses my mind while
> thinking about this part of the patch and wanted to validate my
> thoughts, we don't need to change even if the idea is valid.
>

I think this is a valid idea but I think I do prefer the current one (where the
startup process triggers the invalidations) because:

   - I think this is better to invalidate as soon as possible. In case of inactive logical
replication slot (walsenders stopped) it could take time to get "notified". While with the current
approach you'd get notified in the logfile and pg_replication_slots even if walsenders are stopped.

   - This is not a "slot" dependent invalidation (as opposed to the xid invalidations case)

   - This is "somehow" the same behavior as on the primary: if one change the wal_level to be < logical
then the engine will not start (if logical slot in place). Then what has been decoded is until the time
the engine has been stopped. So if there is walsender lag, you'd not see some records.

> minor nitpick:
> +
> + /* Intentional fall through to session cancel */
> + /* FALLTHROUGH */
> 
> Do we need to repeat fall through twice in different ways?
> 

Do you mean, you'd prefer what was done in v52/0002?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/5/23 12:28 PM, Amit Kapila wrote:
> > On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
>
> > minor nitpick:
> > +
> > + /* Intentional fall through to session cancel */
> > + /* FALLTHROUGH */
> >
> > Do we need to repeat fall through twice in different ways?
> >
>
> Do you mean, you'd prefer what was done in v52/0002?
>

No, I was thinking that instead of two comments, we need one here.
But, now thinking about it, do we really need to fall through in this
case, if so why? Shouldn't this case be handled after
PROCSIG_RECOVERY_CONFLICT_DATABASE?

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Robert Haas
Date:
On Tue, Apr 4, 2023 at 8:33 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Perhaps a commit message like:
>
> "For cascading replication, wake up physical walsenders separately from
> logical walsenders.
>
> Physical walsenders can't send data until it's been flushed; logical
> walsenders can't decode and send data until it's been applied. On the
> standby, the WAL is flushed first, which will only wake up physical
> walsenders; and then applied, which will only wake up logical
> walsenders.
>
> Previously, all walsenders were awakened when the WAL was flushed. That
> was fine for logical walsenders on the primary; but on the standby the
> flushed WAL would not have been applied yet, so logical walsenders were
> awakened too early."

This sounds great. I think it's very clear about what is being changed
and why. I see that Bertrand already pulled this language into v60.

> For comments, I agree that WalSndWakeup() clearly needs a comment
> update. The call site in ApplyWalRecord() could also use a comment. You
> could add a comment at every call site, but I don't think that's
> necessary if there's a good comment over WalSndWakeup().

Right, we don't want to go overboard, but I think putting some of the
text you wrote above for the commit message, or something with a
similar theme, in the comment for WalSndWakeup() would be quite
helpful. We want people to understand why the physical and logical
cases are different.

I agree with you that ApplyWalRecord() is the other place where we
need a good comment. I think the one in v60 needs more word-smithing.
It should probably be a bit more detailed and clear about not only
what we're doing but why we're doing it.

The comment in InitWalSenderSlot() seems like it might be slightly
overdone, but I don't have a big problem with it so if we leave it
as-is that's fine.

Now that I understand what's going on here a bit better, I'm inclined
to think that this patch is basically fine. At least, I don't see any
obvious problem with it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/5/23 1:59 PM, Amit Kapila wrote:
> On Wed, Apr 5, 2023 at 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> minor nitpick:
>> +
>> + /* Intentional fall through to session cancel */
>> + /* FALLTHROUGH */
>>
>> Do we need to repeat fall through twice in different ways?
>>
> 
> Few minor comments on 0003:
> ========================
> 1.
> + case XLOG_PARAMETER_CHANGE:
> + {
> + xl_parameter_change *xlrec =
> + (xl_parameter_change *) XLogRecGetData(buf->record);
> +
> + /*
> + * If wal_level on primary is reduced to less than logical,
> + * then we want to prevent existing logical slots from being
> + * used. Existing logical slots on standby get invalidated
> + * when this WAL record is replayed; and further, slot
> + * creation fails when the wal level is not sufficient; but
> + * all these operations are not synchronized, so a logical
> + * slot may creep in while the wal_level is being reduced.
> + * Hence this extra check.
> + */
> + if (xlrec->wal_level < WAL_LEVEL_LOGICAL)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("logical decoding on standby requires wal_level to be at
> least logical on the primary server")));
> 
> By looking at this change, it is not very clear that this can occur
> only on standby. I understand that on primary, we will not allow
> restarting the server after changing wal_level if there is a
> pre-existing slot but still this looks a bit odd. Shall we have an
> Assert to indicate that this will occur only on standby?

I think that's a fair point. Adding an Assert and a comment before the
Assert in V61 attached.

> 
> 2.
> /*
> - * Since logical decoding is only permitted on a primary server, we know
> - * that the current timeline ID can't be changing any more. If we did this
> - * on a standby, we'd have to worry about the values we compute here
> - * becoming invalid due to a promotion or timeline change.
> + * Since logical decoding is also permitted on a standby server, we need
> + * to check if the server is in recovery to decide how to get the current
> + * timeline ID (so that it also cover the promotion or timeline change
> + * cases).
>    */
> +
> + /* make sure we have enough WAL available */
> + flushptr = WalSndWaitForWal(targetPagePtr + reqLen);
> +
> + /* the standby could have been promoted, so check if still in recovery */
> + am_cascading_walsender = RecoveryInProgress();
> 
> The first part of the comment explains why it is important to check
> RecoveryInProgress() and then immediately after that, the patch
> invokes WalSndWaitForWal(). It may be better to move the comment after
> WalSndWaitForWal() invocation.

Good catch, thanks! done in V61.

> Also, it will be better to write a
> comment as to why you need to do WalSndWaitForWal() before retrieving
> the current timeline as previously that was done afterward.
> 

Agree, done in V61.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/5/23 3:15 PM, Amit Kapila wrote:
> On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> On 4/5/23 12:28 PM, Amit Kapila wrote:
>>> On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
>>> <bertranddrouvot.pg@gmail.com> wrote:
>>
>>> minor nitpick:
>>> +
>>> + /* Intentional fall through to session cancel */
>>> + /* FALLTHROUGH */
>>>
>>> Do we need to repeat fall through twice in different ways?
>>>
>>
>> Do you mean, you'd prefer what was done in v52/0002?
>>
> 
> No, I was thinking that instead of two comments, we need one here.
> But, now thinking about it, do we really need to fall through in this
> case, if so why? Shouldn't this case be handled after
> PROCSIG_RECOVERY_CONFLICT_DATABASE?
> 

Indeed, thanks! Done in V61 posted up-thread.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/5/23 4:24 PM, Robert Haas wrote:
> On Tue, Apr 4, 2023 at 8:33 PM Jeff Davis <pgsql@j-davis.com> wrote:
>> For comments, I agree that WalSndWakeup() clearly needs a comment
>> update. The call site in ApplyWalRecord() could also use a comment. You
>> could add a comment at every call site, but I don't think that's
>> necessary if there's a good comment over WalSndWakeup().
> 
> Right, we don't want to go overboard, but I think putting some of the
> text you wrote above for the commit message, or something with a
> similar theme, in the comment for WalSndWakeup() would be quite
> helpful. We want people to understand why the physical and logical
> cases are different.

Gave it a try in V61 posted up-thread.

> I agree with you that ApplyWalRecord() is the other place where we
> need a good comment. I think the one in v60 needs more word-smithing.
> It should probably be a bit more detailed and clear about not only
> what we're doing but why we're doing it.

Gave it a try in V61 posted up-thread.

> 
> Now that I understand what's going on here a bit better, I'm inclined
> to think that this patch is basically fine. At least, I don't see any
> obvious problem with it.

Thanks for the review and feedback!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-05 17:56:14 +0200, Drouvot, Bertrand wrote:

> @@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
>          /* Update our copy of the parameters in pg_control */
>          memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
>
> +        /*
> +         * Invalidate logical slots if we are in hot standby and the primary
> +         * does not have a WAL level sufficient for logical decoding. No need
> +         * to search for potentially conflicting logically slots if standby is
> +         * running with wal_level lower than logical, because in that case, we
> +         * would have either disallowed creation of logical slots or
> +         * invalidated existing ones.
> +         */
> +        if (InRecovery && InHotStandby &&
> +            xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> +            wal_level >= WAL_LEVEL_LOGICAL)
> +        {
> +            TransactionId ConflictHorizon = InvalidTransactionId;
> +
> +            InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
> +        }

I mentioned this before, but I still don't understand why
InvalidateObsoleteReplicationSlots() accepts ConflictHorizon as a
pointer. It's not even modified, as far as I can see?


>  /*
>   * Report shared-memory space needed by ReplicationSlotsShmemInit.
>   */
> @@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
>          SpinLockAcquire(&s->mutex);
>          effective_xmin = s->effective_xmin;
>          effective_catalog_xmin = s->effective_catalog_xmin;
> -        invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
> -                       XLogRecPtrIsInvalid(s->data.restart_lsn));
> +        invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
>          SpinLockRelease(&s->mutex);

I don't understand why we need to have two different functions for this.


>          /* invalidated slots need not apply */
> @@ -1225,28 +1231,92 @@ ReplicationSlotReserveWal(void)
>      }
>  }
>
> +
> +/*
> + * Report terminating or conflicting message.
> + *
> + * For both, logical conflict on standby and obsolete slot are handled.
> + */
> +static void
> +ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
> +                              NameData slotname, TransactionId *xid,
> +                              XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
> +{
> +    StringInfoData err_msg;
> +    StringInfoData err_detail;
> +    bool        hint = false;
> +
> +    initStringInfo(&err_detail);
> +
> +    if (check_on_xid)
> +    {
> +        if (!terminating)
> +        {
> +            initStringInfo(&err_msg);
> +            appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with
recovery"),
> +                             NameStr(slotname));

I still don't think the main error message should differ between invalidating
a slot due recovery and max_slot_wal_keep_size.

> +
>  /*
> - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
> - * and mark it invalid, if necessary and possible.
> + * Helper for InvalidateObsoleteReplicationSlots
> + *
> + * Acquires the given slot and mark it invalid, if necessary and possible.
>   *
>   * Returns whether ReplicationSlotControlLock was released in the interim (and
>   * in that case we're not holding the lock at return, otherwise we are).
>   *
> - * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
> + * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)

What's the point of making this specific to "obsolete slots"?


>   * This is inherently racy, because we release the LWLock
>   * for syscalls, so caller must restart if we return true.
>   */
>  static bool
>  InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> -                               bool *invalidated)
> +                               bool *invalidated, TransactionId *xid)
>  {
>      int            last_signaled_pid = 0;
>      bool        released_lock = false;
> +    bool        check_on_xid;
> +
> +    check_on_xid = xid ? true : false;
>
>      for (;;)
>      {
>          XLogRecPtr    restart_lsn;
> +
>          NameData    slotname;
>          int            active_pid = 0;
>
> @@ -1263,19 +1333,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>           * Check if the slot needs to be invalidated. If it needs to be
>           * invalidated, and is not currently acquired, acquire it and mark it
>           * as having been invalidated.  We do this with the spinlock held to
> -         * avoid race conditions -- for example the restart_lsn could move
> -         * forward, or the slot could be dropped.
> +         * avoid race conditions -- for example the restart_lsn (or the
> +         * xmin(s) could) move forward or the slot could be dropped.
>           */
>          SpinLockAcquire(&s->mutex);
>
>          restart_lsn = s->data.restart_lsn;
>
>          /*
> -         * If the slot is already invalid or is fresh enough, we don't need to
> -         * do anything.
> +         * If the slot is already invalid or is a non conflicting slot, we
> +         * don't need to do anything.
>           */
> -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
> +        if (DoNotInvalidateSlot(s, xid, &oldestLSN))

DoNotInvalidateSlot() seems odd to me, and makes the code harder to
understand. I'd make it something like:

if (!SlotIsInvalid(s) && (
      LogicalSlotConflictsWith(s, xid) ||
      SlotConflictsWithLSN(s, lsn)))


>  /*
> - * Mark any slot that points to an LSN older than the given segment
> - * as invalid; it requires WAL that's about to be removed.
> + * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.

I don't like that this spreads "obsolete slots" around further - it's very
unspecific. A logical slot that needs to be removed due to an xid conflict is
just as obsolete as one that needs to be removed due to max_slot_wal_keep_size.

I'd rephrase this to be about required resources getting removed or such, one
case of that is WAL another case is xids.

>  restart:
>      LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> @@ -1414,21 +1505,35 @@ restart:
>          if (!s->in_use)
>              continue;
>
> -        if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
> +        if (xid)
>          {
> -            /* if the lock was released, start from scratch */
> -            goto restart;
> +            /* we are only dealing with *logical* slot conflicts */
> +            if (!SlotIsLogical(s))
> +                continue;
> +
> +            /*
> +             * not the database of interest and we don't want all the
> +             * database, skip
> +             */
> +            if (s->data.database != dboid && TransactionIdIsValid(*xid))
> +                continue;

ISTM that this should be in InvalidatePossiblyObsoleteSlot().


>      /*
> -     * If any slots have been invalidated, recalculate the resource limits.
> +     * If any slots have been invalidated, recalculate the required xmin and
> +     * the required lsn (if appropriate).
>       */
>      if (invalidated)
>      {
>          ReplicationSlotsComputeRequiredXmin(false);
> -        ReplicationSlotsComputeRequiredLSN();
> +        if (!xid)
> +            ReplicationSlotsComputeRequiredLSN();
>      }

Why make this conditional? If we invalidated a logical slot, we also don't
require as much WAL anymore, no?


> @@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
>                                             PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
>                                             WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
>                                             true);
> +
> +    if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
> +        InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
>  }

Hm. Is there a reason for doing this before resolving conflicts with existing
sessions?


Another issue: ResolveRecoveryConflictWithVirtualXIDs() takes
WaitExceedsMaxStandbyDelay() into account, but
InvalidateObsoleteReplicationSlots() does not. I think that's ok, because the
setup should prevent this case from being reached in normal paths, but at
least there should be a comment documenting this.



> +static inline bool
> +LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
> +{
> +    TransactionId slot_effective_xmin;
> +    TransactionId slot_catalog_xmin;
> +
> +    slot_effective_xmin = s->effective_xmin;
> +    slot_catalog_xmin = s->data.catalog_xmin;
> +
> +    return (((TransactionIdIsValid(slot_effective_xmin) && TransactionIdPrecedesOrEquals(slot_effective_xmin, xid))
||
> +             (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
> +}

return -ETOOMANYPARENS


> +static inline bool
> +SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
> +{
> +    return (s->data.restart_lsn >= oldestLSN);
> +}
> +
> +static inline bool
> +LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
> +{
> +    return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
> +}
> +
> +static inline bool
> +DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
> +{
> +    if (xid)
> +        return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
> +    else
> +        return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
> +
> +}

See above for some more comments. But please don't accept stuff via pointer if
you don't have a reason for it. There's no reason for it for xid and oldestLSN
afaict.


> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index dbe9394762..186e4ef600 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
>      XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
>      SpinLockRelease(&XLogRecoveryCtl->info_lck);
>
> +    /*
> +     * Wakeup walsenders:
> +     *
> +     * On the standby, the WAL is flushed first (which will only wake up
> +     * physical walsenders) and then applied, which will only wake up logical
> +     * walsenders.
> +     * Indeed, logical walsenders on standby can't decode and send data until
> +     * it's been applied.
> +     *
> +     * Physical walsenders don't need to be waked up during replay unless

s/waked/woken/

> +     * cascading replication is allowed and time line change occured (so that
> +     * they can notice that they are on a new time line).
> +     *
> +     * That's why the wake up conditions are for:
> +     *
> +     *  - physical walsenders in case of new time line and cascade
> +     *  replication is allowed.
> +     *  - logical walsenders in case of new time line or recovery is in progress
> +     *  (logical decoding on standby).
> +     */
> +    WalSndWakeup(switchedTLI && AllowCascadeReplication(),
> +                 switchedTLI || RecoveryInProgress());

I don't think it's possible to get here without RecoveryInProgress() being
true. So we don't need that condition.


> @@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
>          /* Signal the startup process and walsender that new WAL has arrived */
>          WakeupRecovery();
>          if (AllowCascadeReplication())
> -            WalSndWakeup();
> +            WalSndWakeup(true, !RecoveryInProgress());

Same comment as earlier.


>          /* Report XLOG streaming progress in PS display */
>          if (update_process_title)
> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
> index 2d908d1de2..5c68ebb79e 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -2628,6 +2628,23 @@ InitWalSenderSlot(void)
>              walsnd->sync_standby_priority = 0;
>              walsnd->latch = &MyProc->procLatch;
>              walsnd->replyTime = 0;
> +
> +            /*
> +             * The kind assignment is done here and not in StartReplication()
> +             * and StartLogicalReplication(). Indeed, the logical walsender
> +             * needs to read WAL records (like snapshot of running
> +             * transactions) during the slot creation. So it needs to be woken
> +             * up based on its kind.
> +             *
> +             * The kind assignment could also be done in StartReplication(),
> +             * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
> +             * seems better to set it on one place.
> +             */

Doesn't that mean we'll wake up logical walsenders even if they're doing
normal query processing?


> +            if (MyDatabaseId == InvalidOid)
> +                walsnd->kind = REPLICATION_KIND_PHYSICAL;
> +            else
> +                walsnd->kind = REPLICATION_KIND_LOGICAL;
> +
>              SpinLockRelease(&walsnd->mutex);
>              /* don't need the lock anymore */
>              MyWalSnd = (WalSnd *) walsnd;
> @@ -3310,30 +3327,39 @@ WalSndShmemInit(void)
>  }
>
>  /*
> - * Wake up all walsenders
> + * Wake up physical, logical or both walsenders kind
> + *
> + * The distinction between physical and logical walsenders is done, because:
> + * - physical walsenders can't send data until it's been flushed
> + * - logical walsenders on standby can't decode and send data until it's been
> + * applied
> + *
> + * For cascading replication we need to wake up physical
> + * walsenders separately from logical walsenders (see the comment before calling
> + * WalSndWakeup() in ApplyWalRecord() for more details).
>   *
>   * This will be called inside critical sections, so throwing an error is not
>   * advisable.
>   */
>  void
> -WalSndWakeup(void)
> +WalSndWakeup(bool physical, bool logical)
>  {
>      int            i;
>
>      for (i = 0; i < max_wal_senders; i++)
>      {
>          Latch       *latch;
> +        ReplicationKind kind;
>          WalSnd       *walsnd = &WalSndCtl->walsnds[i];
>
> -        /*
> -         * Get latch pointer with spinlock held, for the unlikely case that
> -         * pointer reads aren't atomic (as they're 8 bytes).
> -         */
> +        /* get latch pointer and kind with spinlock helds */
>          SpinLockAcquire(&walsnd->mutex);
>          latch = walsnd->latch;
> +        kind = walsnd->kind;
>          SpinLockRelease(&walsnd->mutex);
>
> -        if (latch != NULL)
> +        if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
> +                              (logical && kind == REPLICATION_KIND_LOGICAL)))
>              SetLatch(latch);
>      }
>  }

I'd consider rewriting this to something like:

if (latch == NULL)
    continue;

if ((physical && kind == REPLICATION_KIND_PHYSICAL)) ||
   (logical && kind == REPLICATION_KIND_LOGICAL)
    SetLatch(latch)



Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/5/23 12:28 PM, Amit Kapila wrote:
> > On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >> Maybe we could change the doc with something among those lines instead?
> >>
> >> "
> >> Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
> >> less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.
> >>
> >> It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
> >> primary won't be decoded".
> >>
> >> I don't know whether this is what one would expect but that should be less of a surprise if documented.
> >>
> >> What do you think?
> >>
> >
> > Yeah, I think it is better to document to avoid any surprises if
> > nobody else sees any problem with it.
>
> Ack.
>

This doesn't seem to be addressed in the latest version. And today, I
think I see one more point about this doc change:
+    <para>
+     A logical replication slot can also be created on a hot standby.
To prevent
+     <command>VACUUM</command> from removing required rows from the system
+     catalogs, <varname>hot_standby_feedback</varname> should be set on the
+     standby. In spite of that, if any required rows get removed, the slot gets
+     invalidated. It's highly recommended to use a physical slot
between the primary
+     and the standby. Otherwise, hot_standby_feedback will work, but
only while the
+     connection is alive (for example a node restart would break it). Existing
+     logical slots on standby also get invalidated if wal_level on
primary is reduced to
+     less than 'logical'.

If hot_standby_feedback is not set then can logical decoding on
standby misbehave? If so, that is not very clear from this doc change
if that is acceptable. One scenario where I think it can misbehave is
if applying WAL records generated after changing wal_level from
'logical' to 'replica' physically removes catalog tuples that could be
referenced by the logical decoding on the standby. Now, as mentioned
in patch 0003's comment in decode.c that it is possible that some
slots may creep even after we invalidate the slots on parameter
change, so while decoding using that slot if some required catalog
tuple has been removed by physical replication then the decoding can
misbehave even before reaching XLOG_PARAMETER_CHANGE record.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> On 4/5/23 3:15 PM, Amit Kapila wrote:
> > On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
> > <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >> On 4/5/23 12:28 PM, Amit Kapila wrote:
> >>> On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
> >>> <bertranddrouvot.pg@gmail.com> wrote:
> >>
> >>> minor nitpick:
> >>> +
> >>> + /* Intentional fall through to session cancel */
> >>> + /* FALLTHROUGH */
> >>>
> >>> Do we need to repeat fall through twice in different ways?
> >>>
> >>
> >> Do you mean, you'd prefer what was done in v52/0002?
> >>
> >
> > No, I was thinking that instead of two comments, we need one here.
> > But, now thinking about it, do we really need to fall through in this
> > case, if so why? Shouldn't this case be handled after
> > PROCSIG_RECOVERY_CONFLICT_DATABASE?
> >
>
> Indeed, thanks! Done in V61 posted up-thread.
>

After this, I think for backends that have active slots, it would
simply cancel the current query. Will that be sufficient? Because we
want the backend process should exit and release the slot so that the
startup process can mark it invalid. For walsender, an ERROR will lead
to its exit, so that is fine. If this understanding is correct, then
if 'am_cascading_walsender' is false, we should set ProcDiePending
apart from other parameters. Sorry, I haven't tested this, so I could
be wrong here. Also, it seems you have removed the checks related to
slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
used for logical slots? If so, do you think an Assert would make
sense?

Another comment on 0001.
 extern void CheckSlotRequirements(void);
 extern void CheckSlotPermissions(void);
+extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
TransactionId xid, char *reason);

This doesn't seem to be called from anywhere.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Thu, Apr 6, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
> >
>
> Another comment on 0001.
>  extern void CheckSlotRequirements(void);
>  extern void CheckSlotPermissions(void);
> +extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
> TransactionId xid, char *reason);
>
> This doesn't seem to be called from anywhere.
>

Few other comments:
==================
0004
1.
+ *  - physical walsenders in case of new time line and cascade
+ *  replication is allowed.
+ *  - logical walsenders in case of new time line or recovery is in progress
+ *  (logical decoding on standby).
+ */
+ WalSndWakeup(switchedTLI && AllowCascadeReplication(),
+ switchedTLI || RecoveryInProgress());

Do we need AllowCascadeReplication() check specifically for physical
walsenders? I think this should be true for both physical and logical
walsenders.

0005
2.
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -38,6 +38,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
+#include "storage/standby.h"

The header includes should be in alphabetical order.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Thu, Apr 6, 2023 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
>
> This doesn't seem to be addressed in the latest version. And today, I
> think I see one more point about this doc change:
> +    <para>
> +     A logical replication slot can also be created on a hot standby.
> To prevent
> +     <command>VACUUM</command> from removing required rows from the system
> +     catalogs, <varname>hot_standby_feedback</varname> should be set on the
> +     standby. In spite of that, if any required rows get removed, the slot gets
> +     invalidated. It's highly recommended to use a physical slot
> between the primary
> +     and the standby. Otherwise, hot_standby_feedback will work, but
> only while the
> +     connection is alive (for example a node restart would break it). Existing
> +     logical slots on standby also get invalidated if wal_level on
> primary is reduced to
> +     less than 'logical'.
>
> If hot_standby_feedback is not set then can logical decoding on
> standby misbehave? If so, that is not very clear from this doc change
> if that is acceptable. One scenario where I think it can misbehave is
> if applying WAL records generated after changing wal_level from
> 'logical' to 'replica' physically removes catalog tuples that could be
> referenced by the logical decoding on the standby. Now, as mentioned
> in patch 0003's comment in decode.c that it is possible that some
> slots may creep even after we invalidate the slots on parameter
> change, so while decoding using that slot if some required catalog
> tuple has been removed by physical replication then the decoding can
> misbehave even before reaching XLOG_PARAMETER_CHANGE record.
>

Thinking some more on this, I think such a slot won't decode any other
records. During CreateInitDecodingContext->ReplicationSlotReserveWal,
for standby's, we use lastReplayedEndRecPtr as restart_lsn. This
should be a record before parameter_change record in the above
scenario. So, ideally, the first record to decode by such a walsender
should be parameter_change which will anyway error out. So, this
shouldn't be a problem.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/5/23 8:28 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-04-05 17:56:14 +0200, Drouvot, Bertrand wrote:
> 
>> @@ -7963,6 +7963,23 @@ xlog_redo(XLogReaderState *record)
>>           /* Update our copy of the parameters in pg_control */
>>           memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
>>
>> +        /*
>> +         * Invalidate logical slots if we are in hot standby and the primary
>> +         * does not have a WAL level sufficient for logical decoding. No need
>> +         * to search for potentially conflicting logically slots if standby is
>> +         * running with wal_level lower than logical, because in that case, we
>> +         * would have either disallowed creation of logical slots or
>> +         * invalidated existing ones.
>> +         */
>> +        if (InRecovery && InHotStandby &&
>> +            xlrec.wal_level < WAL_LEVEL_LOGICAL &&
>> +            wal_level >= WAL_LEVEL_LOGICAL)
>> +        {
>> +            TransactionId ConflictHorizon = InvalidTransactionId;
>> +
>> +            InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, InvalidOid, &ConflictHorizon);
>> +        }
> 
> I mentioned this before, 

Sorry, I probably missed it.

> but I still don't understand why
> InvalidateObsoleteReplicationSlots() accepts ConflictHorizon as a
> pointer. It's not even modified, as far as I can see?
> 

The initial goal was to be able to check if
xid pointer was NULL and also if *xid was a valid xid or not. So basically being able to
do 3 checks with the same parameter.

That's how we decided wether or not we are in the wal_level < logical on primary conflict case in
ReportTerminationInvalidation().

I agree that passing a pointer is not the best approach (as there is a "risk" of modifying the value it points to),
so adding an extra bool to InvalidateObsoleteReplicationSlots() in attached V62 instead.

Also replacing the InvalidXLogRecPtr by 0 as it does sound odd to use "InvalidXLogRecPtr"
naming for a XLogSegNo.

> 
>>   /*
>>    * Report shared-memory space needed by ReplicationSlotsShmemInit.
>>    */
>> @@ -855,8 +862,7 @@ ReplicationSlotsComputeRequiredXmin(bool already_locked)
>>           SpinLockAcquire(&s->mutex);
>>           effective_xmin = s->effective_xmin;
>>           effective_catalog_xmin = s->effective_catalog_xmin;
>> -        invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
>> -                       XLogRecPtrIsInvalid(s->data.restart_lsn));
>> +        invalidated = ObsoleteSlotIsInvalid(s, true) || LogicalReplicationSlotIsInvalid(s);
>>           SpinLockRelease(&s->mutex);
> 
> I don't understand why we need to have two different functions for this.
> 

LogicalReplicationSlotIsInvalid() has been created to provide a different error message
than in ".....because it exceeded the maximum reserved size" in StartLogicalReplication()
and "This slot has never previously reserved WAL" in pg_logical_slot_get_changes_guts().

So basically to distinguish with the max_slot_wal_keep_size related messages.

> 
>>           /* invalidated slots need not apply */
>> @@ -1225,28 +1231,92 @@ ReplicationSlotReserveWal(void)
>>       }
>>   }
>>
>> +
>> +/*
>> + * Report terminating or conflicting message.
>> + *
>> + * For both, logical conflict on standby and obsolete slot are handled.
>> + */
>> +static void
>> +ReportTerminationInvalidation(bool terminating, bool check_on_xid, int pid,
>> +                              NameData slotname, TransactionId *xid,
>> +                              XLogRecPtr restart_lsn, XLogRecPtr oldestLSN)
>> +{
>> +    StringInfoData err_msg;
>> +    StringInfoData err_detail;
>> +    bool        hint = false;
>> +
>> +    initStringInfo(&err_detail);
>> +
>> +    if (check_on_xid)
>> +    {
>> +        if (!terminating)
>> +        {
>> +            initStringInfo(&err_msg);
>> +            appendStringInfo(&err_msg, _("invalidating replication slot \"%s\" because it conflicts with
recovery"),
>> +                             NameStr(slotname));
> 
> I still don't think the main error message should differ between invalidating
> a slot due recovery and max_slot_wal_keep_size.

Okay. I gave a second thought and I agree that "obsolete" does also make
sense for the xid conflict case. So, done that way in V62.

> 
>> +
>>   /*
>> - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
>> - * and mark it invalid, if necessary and possible.
>> + * Helper for InvalidateObsoleteReplicationSlots
>> + *
>> + * Acquires the given slot and mark it invalid, if necessary and possible.
>>    *
>>    * Returns whether ReplicationSlotControlLock was released in the interim (and
>>    * in that case we're not holding the lock at return, otherwise we are).
>>    *
>> - * Sets *invalidated true if the slot was invalidated. (Untouched otherwise.)
>> + * Sets *invalidated true if an obsolete slot was invalidated. (Untouched otherwise.)
> 
> What's the point of making this specific to "obsolete slots"?

There is no. Should be coming from a previous version/experiment.
Removed in V62, thanks!

> 
> 
>>    * This is inherently racy, because we release the LWLock
>>    * for syscalls, so caller must restart if we return true.
>>    */
>>   static bool
>>   InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>> -                               bool *invalidated)
>> +                               bool *invalidated, TransactionId *xid)
>>   {
>>       int            last_signaled_pid = 0;
>>       bool        released_lock = false;
>> +    bool        check_on_xid;
>> +
>> +    check_on_xid = xid ? true : false;
>>
>>       for (;;)
>>       {
>>           XLogRecPtr    restart_lsn;
>> +
>>           NameData    slotname;
>>           int            active_pid = 0;
>>
>> @@ -1263,19 +1333,20 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>>            * Check if the slot needs to be invalidated. If it needs to be
>>            * invalidated, and is not currently acquired, acquire it and mark it
>>            * as having been invalidated.  We do this with the spinlock held to
>> -         * avoid race conditions -- for example the restart_lsn could move
>> -         * forward, or the slot could be dropped.
>> +         * avoid race conditions -- for example the restart_lsn (or the
>> +         * xmin(s) could) move forward or the slot could be dropped.
>>            */
>>           SpinLockAcquire(&s->mutex);
>>
>>           restart_lsn = s->data.restart_lsn;
>>
>>           /*
>> -         * If the slot is already invalid or is fresh enough, we don't need to
>> -         * do anything.
>> +         * If the slot is already invalid or is a non conflicting slot, we
>> +         * don't need to do anything.
>>            */
>> -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
>> +        if (DoNotInvalidateSlot(s, xid, &oldestLSN))
> 
> DoNotInvalidateSlot() seems odd to me, and makes the code harder to
> understand. I'd make it something like:
> 
> if (!SlotIsInvalid(s) && (
>        LogicalSlotConflictsWith(s, xid) ||
>        SlotConflictsWithLSN(s, lsn)))
> 

I think that's a matter of taste (having a single function was suggested
by Amit up-thread).

I think I prefer having one single function as it seems to me easier to
understand if we want to check on xid or not.

> 
>>   /*
>> - * Mark any slot that points to an LSN older than the given segment
>> - * as invalid; it requires WAL that's about to be removed.
>> + * Invalidate Obsolete slots or resolve recovery conflicts with logical slots.
> 
> I don't like that this spreads "obsolete slots" around further - it's very
> unspecific. A logical slot that needs to be removed due to an xid conflict is
> just as obsolete as one that needs to be removed due to max_slot_wal_keep_size.
> 
> I'd rephrase this to be about required resources getting removed or such, one
> case of that is WAL another case is xids.
> 

Agree. Re-worded in V62.


>>   restart:
>>       LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
>> @@ -1414,21 +1505,35 @@ restart:
>>           if (!s->in_use)
>>               continue;
>>
>> -        if (InvalidatePossiblyObsoleteSlot(s, oldestLSN, &invalidated))
>> +        if (xid)
>>           {
>> -            /* if the lock was released, start from scratch */
>> -            goto restart;
>> +            /* we are only dealing with *logical* slot conflicts */
>> +            if (!SlotIsLogical(s))
>> +                continue;
>> +
>> +            /*
>> +             * not the database of interest and we don't want all the
>> +             * database, skip
>> +             */
>> +            if (s->data.database != dboid && TransactionIdIsValid(*xid))
>> +                continue;
> 
> ISTM that this should be in InvalidatePossiblyObsoleteSlot().
> 

Agree, done in V62.

> 
>>       /*
>> -     * If any slots have been invalidated, recalculate the resource limits.
>> +     * If any slots have been invalidated, recalculate the required xmin and
>> +     * the required lsn (if appropriate).
>>        */
>>       if (invalidated)
>>       {
>>           ReplicationSlotsComputeRequiredXmin(false);
>> -        ReplicationSlotsComputeRequiredLSN();
>> +        if (!xid)
>> +            ReplicationSlotsComputeRequiredLSN();
>>       }
> 
> Why make this conditional? If we invalidated a logical slot, we also don't
> require as much WAL anymore, no?
> 

Agree, done in V62.

> 
>> @@ -491,6 +493,9 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
>>                                              PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
>>                                              WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
>>                                              true);
>> +
>> +    if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
>> +        InvalidateObsoleteReplicationSlots(InvalidXLogRecPtr, locator.dbOid, &snapshotConflictHorizon);
>>   }
> 
> Hm. Is there a reason for doing this before resolving conflicts with existing
> sessions?
> 

Do you mean, you'd prefer to InvalidateObsoleteReplicationSlots() before ResolveRecoveryConflictWithVirtualXIDs()?

> 
> Another issue: ResolveRecoveryConflictWithVirtualXIDs() takes
> WaitExceedsMaxStandbyDelay() into account, but
> InvalidateObsoleteReplicationSlots() does not.

humm, good point.

> I think that's ok, because the
> setup should prevent this case from being reached in normal paths, but at
> least there should be a comment documenting this.
> 

I started to add the comment InvalidateObsoleteReplicationSlots() but I'm not
sure what you mean by "the setup should prevent this case from being reached in normal paths"
(so I let "XXXX" in the comment for now).

Did you mean hsf and a physical slot between the primary and the standby should be in place?
Could you please elaborate?

> 
> 
>> +static inline bool
>> +LogicalReplicationSlotXidsConflict(ReplicationSlot *s, TransactionId xid)
>> +{
>> +    TransactionId slot_effective_xmin;
>> +    TransactionId slot_catalog_xmin;
>> +
>> +    slot_effective_xmin = s->effective_xmin;
>> +    slot_catalog_xmin = s->data.catalog_xmin;
>> +
>> +    return (((TransactionIdIsValid(slot_effective_xmin) && TransactionIdPrecedesOrEquals(slot_effective_xmin, xid))
||
>> +             (TransactionIdIsValid(slot_catalog_xmin) && TransactionIdPrecedesOrEquals(slot_catalog_xmin, xid))));
>> +}
> 
> return -ETOOMANYPARENS
> 

gave it a try to make it better in V62.

> 
>> +static inline bool
>> +SlotIsFreshEnough(ReplicationSlot *s, XLogRecPtr oldestLSN)
>> +{
>> +    return (s->data.restart_lsn >= oldestLSN);
>> +}
>> +
>> +static inline bool
>> +LogicalSlotIsNotConflicting(ReplicationSlot *s, TransactionId *xid)
>> +{
>> +    return (TransactionIdIsValid(*xid) && !LogicalReplicationSlotXidsConflict(s, *xid));
>> +}
>> +
>> +static inline bool
>> +DoNotInvalidateSlot(ReplicationSlot *s, TransactionId *xid, XLogRecPtr *oldestLSN)
>> +{
>> +    if (xid)
>> +        return (LogicalReplicationSlotIsInvalid(s) || LogicalSlotIsNotConflicting(s, xid));
>> +    else
>> +        return (ObsoleteSlotIsInvalid(s, false) || SlotIsFreshEnough(s, *oldestLSN));
>> +
>> +}
> 
> See above for some more comments. But please don't accept stuff via pointer if
> you don't have a reason for it. There's no reason for it for xid and oldestLSN
> afaict.

Agree that there is no reason for oldestLSN. Changing in V62.
As far the xid, I explained why I used a pointer above but find a way to remove the need
in V62 (as explained above).

> 
> 
>> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
>> index dbe9394762..186e4ef600 100644
>> --- a/src/backend/access/transam/xlogrecovery.c
>> +++ b/src/backend/access/transam/xlogrecovery.c
>> @@ -1935,6 +1935,30 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
>>       XLogRecoveryCtl->lastReplayedTLI = *replayTLI;
>>       SpinLockRelease(&XLogRecoveryCtl->info_lck);
>>
>> +    /*
>> +     * Wakeup walsenders:
>> +     *
>> +     * On the standby, the WAL is flushed first (which will only wake up
>> +     * physical walsenders) and then applied, which will only wake up logical
>> +     * walsenders.
>> +     * Indeed, logical walsenders on standby can't decode and send data until
>> +     * it's been applied.
>> +     *
>> +     * Physical walsenders don't need to be waked up during replay unless
> 
> s/waked/woken/

Thans, fixed.

>> +     * cascading replication is allowed and time line change occured (so that
>> +     * they can notice that they are on a new time line).
>> +     *
>> +     * That's why the wake up conditions are for:
>> +     *
>> +     *  - physical walsenders in case of new time line and cascade
>> +     *  replication is allowed.
>> +     *  - logical walsenders in case of new time line or recovery is in progress
>> +     *  (logical decoding on standby).
>> +     */
>> +    WalSndWakeup(switchedTLI && AllowCascadeReplication(),
>> +                 switchedTLI || RecoveryInProgress());
> 
> I don't think it's possible to get here without RecoveryInProgress() being
> true. So we don't need that condition.

Right, so using "true" instead as we don't want to rely only on a time line change
for a logical walsender.

> 
> 
>> @@ -1010,7 +1010,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
>>           /* Signal the startup process and walsender that new WAL has arrived */
>>           WakeupRecovery();
>>           if (AllowCascadeReplication())
>> -            WalSndWakeup();
>> +            WalSndWakeup(true, !RecoveryInProgress());
> 
> Same comment as earlier.

done.

> 
> 
>>           /* Report XLOG streaming progress in PS display */
>>           if (update_process_title)
>> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
>> index 2d908d1de2..5c68ebb79e 100644
>> --- a/src/backend/replication/walsender.c
>> +++ b/src/backend/replication/walsender.c
>> @@ -2628,6 +2628,23 @@ InitWalSenderSlot(void)
>>               walsnd->sync_standby_priority = 0;
>>               walsnd->latch = &MyProc->procLatch;
>>               walsnd->replyTime = 0;
>> +
>> +            /*
>> +             * The kind assignment is done here and not in StartReplication()
>> +             * and StartLogicalReplication(). Indeed, the logical walsender
>> +             * needs to read WAL records (like snapshot of running
>> +             * transactions) during the slot creation. So it needs to be woken
>> +             * up based on its kind.
>> +             *
>> +             * The kind assignment could also be done in StartReplication(),
>> +             * StartLogicalReplication() and CREATE_REPLICATION_SLOT but it
>> +             * seems better to set it on one place.
>> +             */
> 
> Doesn't that mean we'll wake up logical walsenders even if they're doing
> normal query processing?
> 

I'm not following what you mean here.

> 
>> +            if (MyDatabaseId == InvalidOid)
>> +                walsnd->kind = REPLICATION_KIND_PHYSICAL;
>> +            else
>> +                walsnd->kind = REPLICATION_KIND_LOGICAL;
>> +
>>               SpinLockRelease(&walsnd->mutex);
>>               /* don't need the lock anymore */
>>               MyWalSnd = (WalSnd *) walsnd;
>> @@ -3310,30 +3327,39 @@ WalSndShmemInit(void)
>>   }
>>
>>   /*
>> - * Wake up all walsenders
>> + * Wake up physical, logical or both walsenders kind
>> + *
>> + * The distinction between physical and logical walsenders is done, because:
>> + * - physical walsenders can't send data until it's been flushed
>> + * - logical walsenders on standby can't decode and send data until it's been
>> + * applied
>> + *
>> + * For cascading replication we need to wake up physical
>> + * walsenders separately from logical walsenders (see the comment before calling
>> + * WalSndWakeup() in ApplyWalRecord() for more details).
>>    *
>>    * This will be called inside critical sections, so throwing an error is not
>>    * advisable.
>>    */
>>   void
>> -WalSndWakeup(void)
>> +WalSndWakeup(bool physical, bool logical)
>>   {
>>       int            i;
>>
>>       for (i = 0; i < max_wal_senders; i++)
>>       {
>>           Latch       *latch;
>> +        ReplicationKind kind;
>>           WalSnd       *walsnd = &WalSndCtl->walsnds[i];
>>
>> -        /*
>> -         * Get latch pointer with spinlock held, for the unlikely case that
>> -         * pointer reads aren't atomic (as they're 8 bytes).
>> -         */
>> +        /* get latch pointer and kind with spinlock helds */
>>           SpinLockAcquire(&walsnd->mutex);
>>           latch = walsnd->latch;
>> +        kind = walsnd->kind;
>>           SpinLockRelease(&walsnd->mutex);
>>
>> -        if (latch != NULL)
>> +        if (latch != NULL && ((physical && kind == REPLICATION_KIND_PHYSICAL) ||
>> +                              (logical && kind == REPLICATION_KIND_LOGICAL)))
>>               SetLatch(latch);
>>       }
>>   }
> 
> I'd consider rewriting this to something like:
> 
> if (latch == NULL)
>      continue;
> 
> if ((physical && kind == REPLICATION_KIND_PHYSICAL)) ||
>     (logical && kind == REPLICATION_KIND_LOGICAL)
>      SetLatch(latch)
> 

Yeah better, done.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/6/23 8:40 AM, Amit Kapila wrote:
> On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
> 
> After this, I think for backends that have active slots, it would
> simply cancel the current query. Will that be sufficient? Because we
> want the backend process should exit and release the slot so that the
> startup process can mark it invalid. 
> For walsender, an ERROR will lead
> to its exit, so that is fine. If this understanding is correct, then
> if 'am_cascading_walsender' is false, we should set ProcDiePending
> apart from other parameters. Sorry, I haven't tested this, so I could
> be wrong here. 

Oops my bad. You are fully, right. Fixed in V62 posted up-thread

> Also, it seems you have removed the checks related to
> slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
> used for logical slots? If so, do you think an Assert would make
> sense?
> 

Yes, indeed adding an Assert makes sense: done in V62 posted up-thread.


> Another comment on 0001.
>   extern void CheckSlotRequirements(void);
>   extern void CheckSlotPermissions(void);
> +extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
> TransactionId xid, char *reason);
> 
> This doesn't seem to be called from anywhere.
>
Good catch, removed in V62 posted up-thread.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/6/23 11:55 AM, Amit Kapila wrote:
> On Thu, Apr 6, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>>
>>
>> Another comment on 0001.
>>   extern void CheckSlotRequirements(void);
>>   extern void CheckSlotPermissions(void);
>> +extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
>> TransactionId xid, char *reason);
>>
>> This doesn't seem to be called from anywhere.
>>
> 
> Few other comments:
> ==================
> 0004
> 1.
> + *  - physical walsenders in case of new time line and cascade
> + *  replication is allowed.
> + *  - logical walsenders in case of new time line or recovery is in progress
> + *  (logical decoding on standby).
> + */
> + WalSndWakeup(switchedTLI && AllowCascadeReplication(),
> + switchedTLI || RecoveryInProgress());
> 
> Do we need AllowCascadeReplication() check specifically for physical
> walsenders? I think this should be true for both physical and logical
> walsenders.
> 

I don't think it could be possible to create logical walsenders on a standby if
AllowCascadeReplication() is not true, or am I missing something?

If so, I think it has to be set to true for the logical walsenders in all the case (like
done in V62 posted up-thread).

Andres, made the point up-thread that RecoveryInProgress() is always true, and
as we don't want to be woken up only when there is a time line change then I think
it has to be always true for logical walsenders.

> 0005
> 2.
> --- a/src/backend/access/transam/xlogfuncs.c
> +++ b/src/backend/access/transam/xlogfuncs.c
> @@ -38,6 +38,7 @@
>   #include "utils/pg_lsn.h"
>   #include "utils/timestamp.h"
>   #include "utils/tuplestore.h"
> +#include "storage/standby.h"
> 
> The header includes should be in alphabetical order.
> 

Good catch, thanks! Done in V62.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/6/23 7:59 AM, Amit Kapila wrote:
> On Wed, Apr 5, 2023 at 6:14 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> On 4/5/23 12:28 PM, Amit Kapila wrote:
>>> On Wed, Apr 5, 2023 at 2:41 PM Drouvot, Bertrand
>>> <bertranddrouvot.pg@gmail.com> wrote:
>>>> Maybe we could change the doc with something among those lines instead?
>>>>
>>>> "
>>>> Existing logical slots on standby also get invalidated if wal_level on primary is reduced to
>>>> less than 'logical'. This is done as soon as the standby detects such a change in the WAL stream.
>>>>
>>>> It means, that for walsenders that are lagging (if any), some WAL records up to the parameter change on the
>>>> primary won't be decoded".
>>>>
>>>> I don't know whether this is what one would expect but that should be less of a surprise if documented.
>>>>
>>>> What do you think?
>>>>
>>>
>>> Yeah, I think it is better to document to avoid any surprises if
>>> nobody else sees any problem with it.
>>
>> Ack.
>>
> 
> This doesn't seem to be addressed in the latest version.

Right, I was waiting if "nobody else sees any problem with it".

Added it now in V62 posted up-thread.

> And today, I
> think I see one more point about this doc change:
> +    <para>
> +     A logical replication slot can also be created on a hot standby.
> To prevent
> +     <command>VACUUM</command> from removing required rows from the system
> +     catalogs, <varname>hot_standby_feedback</varname> should be set on the
> +     standby. In spite of that, if any required rows get removed, the slot gets
> +     invalidated. It's highly recommended to use a physical slot
> between the primary
> +     and the standby. Otherwise, hot_standby_feedback will work, but
> only while the
> +     connection is alive (for example a node restart would break it). Existing
> +     logical slots on standby also get invalidated if wal_level on
> primary is reduced to
> +     less than 'logical'.
> 
> If hot_standby_feedback is not set then can logical decoding on
> standby misbehave? If so, that is not very clear from this doc change
> if that is acceptable.

I don't think it would misbehave but that primary may delete system catalog rows
that could be needed by the logical decoding on the standby (as it does not know about the
catalog_xmin on the standby).

Added this remark in V62.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/6/23 2:23 PM, Amit Kapila wrote:
> On Thu, Apr 6, 2023 at 11:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> Thinking some more on this, I think such a slot won't decode any other
> records. During CreateInitDecodingContext->ReplicationSlotReserveWal,
> for standby's, we use lastReplayedEndRecPtr as restart_lsn. This
> should be a record before parameter_change record in the above
> scenario. So, ideally, the first record to decode by such a walsender
> should be parameter_change which will anyway error out. So, this
> shouldn't be a problem.
> 

Agree.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Thu, Apr 6, 2023 at 6:32 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On 4/6/23 11:55 AM, Amit Kapila wrote:
> > On Thu, Apr 6, 2023 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Wed, Apr 5, 2023 at 9:27 PM Drouvot, Bertrand
> >> <bertranddrouvot.pg@gmail.com> wrote:
> >>>
> >>
> >> Another comment on 0001.
> >>   extern void CheckSlotRequirements(void);
> >>   extern void CheckSlotPermissions(void);
> >> +extern void ResolveRecoveryConflictWithLogicalSlots(Oid dboid,
> >> TransactionId xid, char *reason);
> >>
> >> This doesn't seem to be called from anywhere.
> >>
> >
> > Few other comments:
> > ==================
> > 0004
> > 1.
> > + *  - physical walsenders in case of new time line and cascade
> > + *  replication is allowed.
> > + *  - logical walsenders in case of new time line or recovery is in progress
> > + *  (logical decoding on standby).
> > + */
> > + WalSndWakeup(switchedTLI && AllowCascadeReplication(),
> > + switchedTLI || RecoveryInProgress());
> >
> > Do we need AllowCascadeReplication() check specifically for physical
> > walsenders? I think this should be true for both physical and logical
> > walsenders.
> >
>
> I don't think it could be possible to create logical walsenders on a standby if
> AllowCascadeReplication() is not true, or am I missing something?
>

Right, so why to even traverse walsenders for that case? What I was
imagining a code is like:
if (AllowCascadeReplication())
    WalSndWakeup(switchedTLI, true);

Do you see any problem with this change?

Few more minor comments on 0005
=============================
0005
1.
+       <para>
+        Take a snapshot of running transactions and write this to WAL without
+        having to wait bgwriter or checkpointer to log one.

/wait bgwriter/wait for bgwriter

2.
+use Test::More tests => 67;

We no more use the number of tests. Please refer to other similar tests.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/6/23 3:39 PM, Amit Kapila wrote:
> On Thu, Apr 6, 2023 at 6:32 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>>
>> I don't think it could be possible to create logical walsenders on a standby if
>> AllowCascadeReplication() is not true, or am I missing something?
>>
> 
> Right, so why to even traverse walsenders for that case? What I was
> imagining a code is like:
> if (AllowCascadeReplication())
>      WalSndWakeup(switchedTLI, true);
> 
> Do you see any problem with this change?

Not at all, it looks good to me.

> 
> Few more minor comments on 0005
> =============================
> 0005
> 1.
> +       <para>
> +        Take a snapshot of running transactions and write this to WAL without
> +        having to wait bgwriter or checkpointer to log one.
> 
> /wait bgwriter/wait for bgwriter
> 
> 2.
> +use Test::More tests => 67;
> 
> We no more use the number of tests. Please refer to other similar tests.
> 

Thanks! Will update 0005.

Regards,


-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:
> After this, I think for backends that have active slots, it would
> simply cancel the current query. Will that be sufficient? Because we
> want the backend process should exit and release the slot so that the
> startup process can mark it invalid.

We don't need them to exit, we just need them to release the slot. Which does
happen when the query is cancelled. Imagine if that weren't the case - if a
cancellation of pg_logical_slot_* wouldn't release the slot, we couldn't call
it again before disconnecting. I also did verify that indeed the slot is
released upon a cancellation.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Fri, Apr 7, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:
> > After this, I think for backends that have active slots, it would
> > simply cancel the current query. Will that be sufficient? Because we
> > want the backend process should exit and release the slot so that the
> > startup process can mark it invalid.
>
> We don't need them to exit, we just need them to release the slot. Which does
> happen when the query is cancelled. Imagine if that weren't the case - if a
> cancellation of pg_logical_slot_* wouldn't release the slot, we couldn't call
> it again before disconnecting. I also did verify that indeed the slot is
> released upon a cancellation.
>

makes sense. Thanks for the clarification!

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

TBH, I don't like the state of 0001 much. I'm working on polishing it now.

A lot of the new functions in slot.h don't seem right to me:
- ObsoleteSlotIsInvalid() - isn't an obsolete slot by definition invalid?
- Why does ObsoleteSlotIsInvalid() sometime check invalidated_at and sometimes
  not?
- DoNotInvalidateSlot() seems too generic a name for a function exposed to the
  outside of slot.c
- TransactionIdIsValidPrecedesOrEquals() shouldn't be defined in slot.h -
  also, it's not actually clear what semantics it's trying to have.
- there's no commonality in naming between the functions used to test if a
  slot needs to be invalidated (SlotIsFreshEnough() vs
  LogicalSlotIsNotConflicting()).

Leaving naming etc aside, most of these don't seem to belong in slot.h, but
should just be in slot.c - there aren't conceivable users from outside slot.c.


Independent of this patch: What's the point of invalidated_at? The only reads
of it are done like
                invalidated = (!XLogRecPtrIsInvalid(s->data.invalidated_at) &&
                                           XLogRecPtrIsInvalid(s->data.restart_lsn));
i.e. the actual LSN is not used.

ISTM that we should just have it be a boolean, and that it should be used by
the different kinds of invalidating a slot.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:
> Also, it seems you have removed the checks related to
> slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
> used for logical slots? If so, do you think an Assert would make
> sense?

The asserts that have been added aren't correct. There's no guarantee that the
receiver of the procsignal still holds the same slot or any slot at all.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Fri, Apr 7, 2023 at 8:43 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:
> > Also, it seems you have removed the checks related to
> > slots, is it because PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT is only
> > used for logical slots? If so, do you think an Assert would make
> > sense?
>
> The asserts that have been added aren't correct. There's no guarantee that the
> receiver of the procsignal still holds the same slot or any slot at all.
>

For backends, that don't hold any slot, can we skip setting the
RecoveryConflictPending and other flags?

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Thu, Apr 6, 2023 at 7:50 PM Drouvot, Bertrand
<bertranddrouvot.pg@gmail.com> wrote:
>
> Thanks! Will update 0005.
>

I noticed a few typos in the latest patches.

0004
1.
+ * Physical walsenders don't need to be wakon up during replay unless

Typo.

0005
2.
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub check_slots_dropped
+{
+ my ($slot_user_handle) = @_;
+
+ is($node_standby->slot('inactiveslot')->{'slot_type'}, '',
'inactiveslot on standby dropped');
+ is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot
on standby dropped');
+
+ check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
+}
+
+# Check if all the slots on standby are dropped. These include the 'activeslot'
+# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
+sub change_hot_standby_feedback_and_wait_for_xmins
+{
+ my ($hsf, $invalidated) = @_;
+
+ $node_standby->append_conf('postgresql.conf',qq[
+ hot_standby_feedback = $hsf
+ ]);
+
+ $node_standby->reload;
+
+ if ($hsf && $invalidated)
+ {
...

The comment above change_hot_standby_feedback_and_wait_for_xmins seems
to be wrong. It seems to be copied from the previous test.

3.
+# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
been updated
+# we now expect 3 conflicts reported as the counter persist across reloads
+ok( $node_standby->poll_query_until(
+ 'postgres',
+ "select (confl_active_logicalslot = 4) from
pg_stat_database_conflicts where datname = 'testdb'", 't'),
+ 'confl_active_logicalslot updated') or die "Timed out waiting
confl_active_logicalslot to be updated";

The comment incorrectly mentions 3 conflicts whereas the query expects 4.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
On 4/7/23 3:59 AM, Amit Kapila wrote:
> On Fri, Apr 7, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> On 2023-04-06 12:10:57 +0530, Amit Kapila wrote:
>>> After this, I think for backends that have active slots, it would
>>> simply cancel the current query. Will that be sufficient? Because we
>>> want the backend process should exit and release the slot so that the
>>> startup process can mark it invalid.
>>
>> We don't need them to exit, we just need them to release the slot. Which does
>> happen when the query is cancelled. Imagine if that weren't the case - if a
>> cancellation of pg_logical_slot_* wouldn't release the slot, we couldn't call
>> it again before disconnecting. I also did verify that indeed the slot is
>> released upon a cancellation.
>>
> 
> makes sense. Thanks for the clarification!
> 

+1, thanks Andres!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 4:18 AM, Andres Freund wrote:
> Hi,
> 
> TBH, I don't like the state of 0001 much. I'm working on polishing it now.
> 

Thanks Andres!

> A lot of the new functions in slot.h don't seem right to me:
> - ObsoleteSlotIsInvalid() - isn't an obsolete slot by definition invalid?

bad naming, agree.

> - Why does ObsoleteSlotIsInvalid() sometime check invalidated_at and sometimes
>    not?

because part of the existing code was doing so (checking if s->data.restart_lsn is valid
with/without checking if data.invalidated_at is valid) and I thought it was better not
to change it.

> - TransactionIdIsValidPrecedesOrEquals() shouldn't be defined in slot.h -
>    also, it's not actually clear what semantics it's trying to have.

Oh right, my bad for the location.

> - there's no commonality in naming between the functions used to test if a
>    slot needs to be invalidated (SlotIsFreshEnough() vs
>    LogicalSlotIsNotConflicting()).

Agree, my bad.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/6/23 4:20 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 4/6/23 3:39 PM, Amit Kapila wrote:
>> On Thu, Apr 6, 2023 at 6:32 PM Drouvot, Bertrand
>> <bertranddrouvot.pg@gmail.com> wrote:
>>>
>>>
>>> I don't think it could be possible to create logical walsenders on a standby if
>>> AllowCascadeReplication() is not true, or am I missing something?
>>>
>>
>> Right, so why to even traverse walsenders for that case? What I was
>> imagining a code is like:
>> if (AllowCascadeReplication())
>>      WalSndWakeup(switchedTLI, true);
>>
>> Do you see any problem with this change?
> 
> Not at all, it looks good to me.
> 
>>

Done in V63 attached and did change the associated comment a bit.


>> Few more minor comments on 0005
>> =============================
>> 0005
>> 1.
>> +       <para>
>> +        Take a snapshot of running transactions and write this to WAL without
>> +        having to wait bgwriter or checkpointer to log one.
>>
>> /wait bgwriter/wait for bgwriter
>>
>> 2.
>> +use Test::More tests => 67;
>>
>> We no more use the number of tests. Please refer to other similar tests.
>>
> 
> Thanks! Will update 0005.
> 

Done in V63.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 5:47 AM, Amit Kapila wrote:
> On Thu, Apr 6, 2023 at 7:50 PM Drouvot, Bertrand
> <bertranddrouvot.pg@gmail.com> wrote:
>>
>> Thanks! Will update 0005.
>>
> 
> I noticed a few typos in the latest patches.
> 
> 0004
> 1.
> + * Physical walsenders don't need to be wakon up during replay unless
> 
> Typo.

Thanks! Fixed in V63 just posted up-thread.

> 
> 0005
> 2.
> +# Check if all the slots on standby are dropped. These include the 'activeslot'
> +# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
> +sub check_slots_dropped
> +{
> + my ($slot_user_handle) = @_;
> +
> + is($node_standby->slot('inactiveslot')->{'slot_type'}, '',
> 'inactiveslot on standby dropped');
> + is($node_standby->slot('activeslot')->{'slot_type'}, '', 'activeslot
> on standby dropped');
> +
> + check_pg_recvlogical_stderr($slot_user_handle, "conflict with recovery");
> +}
> +
> +# Check if all the slots on standby are dropped. These include the 'activeslot'
> +# that was acquired by make_slot_active(), and the non-active 'inactiveslot'.
> +sub change_hot_standby_feedback_and_wait_for_xmins
> +{
> + my ($hsf, $invalidated) = @_;
> +
> + $node_standby->append_conf('postgresql.conf',qq[
> + hot_standby_feedback = $hsf
> + ]);
> +
> + $node_standby->reload;
> +
> + if ($hsf && $invalidated)
> + {
> ...
> 
> The comment above change_hot_standby_feedback_and_wait_for_xmins seems
> to be wrong. It seems to be copied from the previous test.
> 

Good catch! Fixed in V63.


> 3.
> +# Verify that pg_stat_database_conflicts.confl_active_logicalslot has
> been updated
> +# we now expect 3 conflicts reported as the counter persist across reloads
> +ok( $node_standby->poll_query_until(
> + 'postgres',
> + "select (confl_active_logicalslot = 4) from
> pg_stat_database_conflicts where datname = 'testdb'", 't'),
> + 'confl_active_logicalslot updated') or die "Timed out waiting
> confl_active_logicalslot to be updated";
> 
> The comment incorrectly mentions 3 conflicts whereas the query expects 4.
> 

Good catch, fixed in v63.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 07:02:04 +0200, Drouvot, Bertrand wrote:
> Done in V63 attached and did change the associated comment a bit.

Can you send your changes incrementally, relative to V62? I'm polishing them
right now, and that'd make it a lot easier to apply your changes ontop.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 7:56 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-04-07 07:02:04 +0200, Drouvot, Bertrand wrote:
>> Done in V63 attached and did change the associated comment a bit.
> 
> Can you send your changes incrementally, relative to V62? I'm polishing them
> right now, and that'd make it a lot easier to apply your changes ontop.
> 

Sure, please find them enclosed.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 08:09:50 +0200, Drouvot, Bertrand wrote:
> Hi,
>
> On 4/7/23 7:56 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2023-04-07 07:02:04 +0200, Drouvot, Bertrand wrote:
> > > Done in V63 attached and did change the associated comment a bit.
> >
> > Can you send your changes incrementally, relative to V62? I'm polishing them
> > right now, and that'd make it a lot easier to apply your changes ontop.
> >
>
> Sure, please find them enclosed.

Thanks.


Here's my current working state - I'll go to bed soon.

Changes:

- shared catalog relations weren't handled correctly, because the dboid is
  InvalidOid for them. I wrote a test for that as well.

- ReplicationSlotsComputeRequiredXmin() took invalidated logical slots into
  account (ReplicationSlotsComputeLogicalRestartLSN() too, but it never looks
  at logical slots)

- I don't think the subset of slot xids that were checked when invalidating
  was right. We need to check effective_xmin and effective_catalog_xmin - the
  latter was using catalog_xmin.

- similarly, it wasn't right that specifically those two fields were
  overwritten when invalidated - as that was done, I suspect the changes might
  get lost on a restart...

- As mentioned previously, I did not like all the functions in slot.h, nor
  their naming. Not yet quite finished with that, but a good bit further

- There were a lot of unrelated changes, e.g. removing comments like
 * NB - this runs as part of checkpoint, so avoid raising errors if possible.

- I still don't like the order of the patches, fixing the walsender patches
  after introducing support for logical decoding on standby. Reordered.

- I don't think logical slots being invalidated as checked e.g. in
  pg_logical_replication_slot_advance()

- I didn't like much that InvalidatePossiblyObsoleteSlot() switched between
  kill() and SendProcSignal() based on the "conflict". There very well could
  be reasons to use InvalidatePossiblyObsoleteSlot() with an xid from outside
  of the startup process in the future. Instead I made it differentiate based
  on MyBackendType == B_STARTUP.


I also:

Added new patch that replaces invalidated_at with a new enum, 'invalidated',
listing the reason for the invalidation. I added a check for !invalidated to
ReplicationSlotsComputeRequiredLSN() etc.

Added new patch moving checks for invalid logical slots into
CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
no sense. As far as I can tell the old message in
pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
"never previously reserved WAL"

Split "Handle logical slot conflicts on standby." into two. I'm not sure that
should stay that way, but it made it easier to hack on
InvalidateObsoleteReplicationSlots.


Todo:
- write a test that invalidated logical slots stay invalidated across a restart
- write a test that invalidated logical slots do not lead to retaining WAL
- Further evolve the API of InvalidateObsoleteReplicationSlots()
  - pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
  - rename xid to snapshotConflictHorizon, that'd be more in line with the
    ResolveRecoveryConflictWithSnapshot and easier to understand, I think

- The test could stand a bit of cleanup and consolidation
  - No need to start 4 psql processes to do 4 updates, just do it in one
    safe_psql()
  - the sequence of drop_logical_slots(), create_logical_slots(),
    change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
    repeated quite a few times
  - the stats queries checking for specific conflict counts, including
    preceding tests, is pretty painful. I suggest to reset the stats at the
    end of the test instead (likely also do the drop_logical_slot() there).
  - it's hard to correlate postgres log and the tap test, because the slots
    are named the same across all tests. Perhaps they could have a per-test
    prefix?
  - numbering tests is a PITA, I had to renumber the later ones, when adding a
    test for shared catalog tables


My attached version does include your v62-63 incremental chnages.

Greetings,

Andres Freund

Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 9:50 AM, Andres Freund wrote:
> Hi,
> Here's my current working state - I'll go to bed soon.

Thanks a lot for this Andres!

> 
> Changes:
> 
> - shared catalog relations weren't handled correctly, because the dboid is
>    InvalidOid for them. I wrote a test for that as well.
> 
> - ReplicationSlotsComputeRequiredXmin() took invalidated logical slots into
>    account (ReplicationSlotsComputeLogicalRestartLSN() too, but it never looks
>    at logical slots)
> 
> - I don't think the subset of slot xids that were checked when invalidating
>    was right. We need to check effective_xmin and effective_catalog_xmin - the
>    latter was using catalog_xmin.
> 
> - similarly, it wasn't right that specifically those two fields were
>    overwritten when invalidated - as that was done, I suspect the changes might
>    get lost on a restart...
> 
> - As mentioned previously, I did not like all the functions in slot.h, nor
>    their naming. Not yet quite finished with that, but a good bit further
> 
> - There were a lot of unrelated changes, e.g. removing comments like
>   * NB - this runs as part of checkpoint, so avoid raising errors if possible.
> 
> - I still don't like the order of the patches, fixing the walsender patches
>    after introducing support for logical decoding on standby. Reordered.
> 
> - I don't think logical slots being invalidated as checked e.g. in
>    pg_logical_replication_slot_advance()
> 
> - I didn't like much that InvalidatePossiblyObsoleteSlot() switched between
>    kill() and SendProcSignal() based on the "conflict". There very well could
>    be reasons to use InvalidatePossiblyObsoleteSlot() with an xid from outside
>    of the startup process in the future. Instead I made it differentiate based
>    on MyBackendType == B_STARTUP.
> 

Thanks for all of this and the above explanations.

> 
> I also:
> 
> Added new patch that replaces invalidated_at with a new enum, 'invalidated',
> listing the reason for the invalidation.

Yeah, that's a great idea.

> I added a check for !invalidated to
> ReplicationSlotsComputeRequiredLSN() etc.
> 

looked at 65-0001 and it looks good to me.

> Added new patch moving checks for invalid logical slots into
> CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
> no sense. As far as I can tell the old message in
> pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
> "never previously reserved WAL"
> 

looked at 65-0002 and it looks good to me.

> Split "Handle logical slot conflicts on standby." into two. I'm not sure that
> should stay that way, but it made it easier to hack on
> InvalidateObsoleteReplicationSlots.
> 

looked at 65-0003 and the others.

It's easier to understand/read the code now that the ReplicationSlotInvalidationCause
enum has been created and that data.invalidated also make use of the enum. It does "simplify"
the review and that looks good to me.

> 
> Todo:
> - write a test that invalidated logical slots stay invalidated across a restart

Done in 65-66-0008 attached.

> - write a test that invalidated logical slots do not lead to retaining WAL

I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
a standby.

> - Further evolve the API of InvalidateObsoleteReplicationSlots()
>    - pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
>    - rename xid to snapshotConflictHorizon, that'd be more in line with the
>      ResolveRecoveryConflictWithSnapshot and easier to understand, I think
> 

Done. The new API can be found in v65-66-InvalidateObsoleteReplicationSlots_API.patch
attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a switch/case
can now be used. The "default" case does not emit an error since this code runs as part
of checkpoint.

> - The test could stand a bit of cleanup and consolidation
>    - No need to start 4 psql processes to do 4 updates, just do it in one
>      safe_psql()

Right, done in v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch attached.

>    - the sequence of drop_logical_slots(), create_logical_slots(),
>      change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
>      repeated quite a few times

grouped in reactive_slots_change_hfs_and_wait_for_xmins() in 65-66-0008 attached.

>    - the stats queries checking for specific conflict counts, including
>      preceding tests, is pretty painful. I suggest to reset the stats at the
>      end of the test instead (likely also do the drop_logical_slot() there).

Good idea, done in 65-66-0008 attached.

>    - it's hard to correlate postgres log and the tap test, because the slots
>      are named the same across all tests. Perhaps they could have a per-test
>      prefix?

Good point. Done in 65-66-0008 attached. Thanks to that and the stats reset the
check for invalidation is now done in a single function "check_for_invalidation" that looks
for invalidation messages in the logfile and in pg_stat_database_conflicts.

Thanks for the suggestions: the TAP test is now easier to read/understand.

>    - numbering tests is a PITA, I had to renumber the later ones, when adding a
>      test for shared catalog tables

Yeah, sorry about that, it has been fixed in V63.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 17:13:13 +0200, Drouvot, Bertrand wrote:
> On 4/7/23 9:50 AM, Andres Freund wrote:
> > I added a check for !invalidated to
> > ReplicationSlotsComputeRequiredLSN() etc.
> > 
> 
> looked at 65-0001 and it looks good to me.
> 
> > Added new patch moving checks for invalid logical slots into
> > CreateDecodingContext(). Otherwise we end up with 5 or so checks, which makes
> > no sense. As far as I can tell the old message in
> > pg_logical_slot_get_changes_guts() was bogus, one couldn't get there having
> > "never previously reserved WAL"
> > 
> 
> looked at 65-0002 and it looks good to me.
> 
> > Split "Handle logical slot conflicts on standby." into two. I'm not sure that
> > should stay that way, but it made it easier to hack on
> > InvalidateObsoleteReplicationSlots.
> > 
> 
> looked at 65-0003 and the others.

Thanks for checking!


> > Todo:
> > - write a test that invalidated logical slots stay invalidated across a restart
> 
> Done in 65-66-0008 attached.

Cool.


> > - write a test that invalidated logical slots do not lead to retaining WAL
> 
> I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
> a standby.

You can do it on the primary and wait for the records to have been applied.


> > - Further evolve the API of InvalidateObsoleteReplicationSlots()
> >    - pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
> >    - rename xid to snapshotConflictHorizon, that'd be more in line with the
> >      ResolveRecoveryConflictWithSnapshot and easier to understand, I think
> > 
> 
> Done. The new API can be found in v65-66-InvalidateObsoleteReplicationSlots_API.patch
> attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a switch/case
> can now be used.

Integrated. I moved the cause to the first argument, makes more sense to me
that way.


> The "default" case does not emit an error since this code runs as part
> of checkpoint.

I made it an error - it's a programming error, not some data level
inconsistency if that ever happens.


> > - The test could stand a bit of cleanup and consolidation
> >    - No need to start 4 psql processes to do 4 updates, just do it in one
> >      safe_psql()
> 
> Right, done in v65-66-0008-New-TAP-test-for-logical-decoding-on-standby.patch attached.

> >    - the sequence of drop_logical_slots(), create_logical_slots(),
> >      change_hot_standby_feedback_and_wait_for_xmins(), make_slot_active() is
> >      repeated quite a few times
> 
> grouped in reactive_slots_change_hfs_and_wait_for_xmins() in 65-66-0008 attached.
> 
> >    - the stats queries checking for specific conflict counts, including
> >      preceding tests, is pretty painful. I suggest to reset the stats at the
> >      end of the test instead (likely also do the drop_logical_slot() there).
> 
> Good idea, done in 65-66-0008 attached.
> 
> >    - it's hard to correlate postgres log and the tap test, because the slots
> >      are named the same across all tests. Perhaps they could have a per-test
> >      prefix?
> 
> Good point. Done in 65-66-0008 attached. Thanks to that and the stats reset the
> check for invalidation is now done in a single function "check_for_invalidation" that looks
> for invalidation messages in the logfile and in pg_stat_database_conflicts.
> 
> Thanks for the suggestions: the TAP test is now easier to read/understand.

Integrated all of these.


I think pg_log_standby_snapshot() should be added in "Allow logical decoding
on standby", not the commit adding the tests.


Is this patchset sufficient to subscribe to a publication on a physical
standby, assuming the publication is created on the primary? If so, we should
have at least a minimal test. If not, we should note that restriction
explicitly.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 08:47:57 -0700, Andres Freund wrote:
> Integrated all of these.

Here's my current version. Changes:
- Integrated Bertrand's changes
- polished commit messages of 0001-0003
- edited code comments for 0003, including
  InvalidateObsoleteReplicationSlots()'s header
- added a bump of SLOT_VERSION to 0001
- moved addition of pg_log_standby_snapshot() to 0007
- added a catversion bump for pg_log_standby_snapshot()
- moved all the bits dealing with procsignals from 0003 to 0004, now the split
  makes sense IMO
- combined a few more sucessive ->safe_psql() calls

I see occasional failures in the tests, particularly in the new test using
pg_authid, but not solely. cfbot also seems to have seen these:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3740

I made a bogus attempt at a workaround for the pg_authid case last night. But
that didn't actually fix anything, it just changed the timing.

I think the issue is that VACUUM does not force WAL to be flushed at the end
(since it does not assign an xid). wait_for_replay_catchup() uses
$node->lsn('flush'), which, due to VACUUM not flushing, can be an LSN from
before VACUUM completed.

The problem can be made more likely by adding pg_usleep(1000000); before
walwriter.c's call to XLogBackgroundFlush().

We probably should introduce some infrastructure in Cluster.pm for this, but
for now I just added a 'flush_wal' table that we insert into after a
VACUUM. That guarantees a WAL flush.


I think some of the patches might have more reviewers than really applicable,
and might also miss some. I'd appreciate if you could go over that...

Greetings,

Andres Freund

Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 5:47 PM, Andres Freund wrote:
> Hi,
> 
>>> - write a test that invalidated logical slots do not lead to retaining WAL
>>
>> I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
>> a standby.
> 
> You can do it on the primary and wait for the records to have been applied.
> 

Thanks, will give it a try in a couple of hours.

> 
>>> - Further evolve the API of InvalidateObsoleteReplicationSlots()
>>>     - pass in the ReplicationSlotInvalidationCause we're trying to conflict on?
>>>     - rename xid to snapshotConflictHorizon, that'd be more in line with the
>>>       ResolveRecoveryConflictWithSnapshot and easier to understand, I think
>>>
>>
>> Done. The new API can be found in v65-66-InvalidateObsoleteReplicationSlots_API.patch
>> attached. It propagates the cause to InvalidatePossiblyObsoleteSlot() where a switch/case
>> can now be used.
> 
> Integrated. I moved the cause to the first argument, makes more sense to me
> that way.

thanks!

> 
> I made it an error - it's a programming error, not some data level
> inconsistency if that ever happens.

okay, makes sense.
> 
> Integrated all of these.

Thanks!

> 
> 
> I think pg_log_standby_snapshot() should be added in "Allow logical decoding
> on standby", not the commit adding the tests.

Yeah, that's a good point, I do agree.

> 
> Is this patchset sufficient to subscribe to a publication on a physical
> standby, assuming the publication is created on the primary? If so, we should
> have at least a minimal test. If not, we should note that restriction
> explicitly.

I gave it a try and it does work.

"
node3 subscribes to node2 (standby).
Insert done in node1 (primary) where the publication is created => node3 see the changes.
"

I started to create the TAP test but currently stuck as the "create subscription" waits for a
checkpoint/pg_log_standby_snapshot()on the primary.
 

So, trying to make use of things like:

"my %psql_subscriber = ('stdin' => '', 'stdout' => '');
$psql_subscriber{run} =
   $node_subscriber->background_psql('postgres', \$psql_subscriber{stdin},
     \$psql_subscriber{stdout},
     $psql_timeout);
$psql_subscriber{stdout} = '';
"

But in vain so far...

Will resume working on it in a couple of hours.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 8:12 PM, Andres Freund wrote:
> Hi,
> 
> On 2023-04-07 08:47:57 -0700, Andres Freund wrote:
>> Integrated all of these.
> 
> Here's my current version. Changes:
> - Integrated Bertrand's changes
> - polished commit messages of 0001-0003
> - edited code comments for 0003, including
>    InvalidateObsoleteReplicationSlots()'s header
> - added a bump of SLOT_VERSION to 0001
> - moved addition of pg_log_standby_snapshot() to 0007
> - added a catversion bump for pg_log_standby_snapshot()
> - moved all the bits dealing with procsignals from 0003 to 0004, now the split
>    makes sense IMO
> - combined a few more sucessive ->safe_psql() calls
> 

Thanks!

> I see occasional failures in the tests, particularly in the new test using
> pg_authid, but not solely. cfbot also seems to have seen these:
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3740
> 
> I made a bogus attempt at a workaround for the pg_authid case last night. But
> that didn't actually fix anything, it just changed the timing.
> 
> I think the issue is that VACUUM does not force WAL to be flushed at the end
> (since it does not assign an xid). wait_for_replay_catchup() uses
> $node->lsn('flush'), which, due to VACUUM not flushing, can be an LSN from
> before VACUUM completed.
> 
> The problem can be made more likely by adding pg_usleep(1000000); before
> walwriter.c's call to XLogBackgroundFlush().
> 
> We probably should introduce some infrastructure in Cluster.pm for this, but
> for now I just added a 'flush_wal' table that we insert into after a
> VACUUM. That guarantees a WAL flush.
> 
> 
Ack for the Cluster.pm "improvement" and thanks for the "workaround"!

> I think some of the patches might have more reviewers than really applicable,
> and might also miss some. I'd appreciate if you could go over that...
> 

Sure, will do in a couple of hours.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 8:27 PM, Drouvot, Bertrand wrote:
> Hi,
> 

>> I think some of the patches might have more reviewers than really applicable,
>> and might also miss some. I'd appreciate if you could go over that...
>>
> 
> Sure, will do in a couple of hours.
> 

That looks good to me, just few remarks:

0005 is missing author/reviewer, I'd propose:

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de

0006, I'd propose:

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>

0007, I'd propose:

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>

0009, I'd propose:

Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>

It's hard (given the amount of emails that have been send during all this time),
but I do hope it's correct and that nobody is missing.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/7/23 8:24 PM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 4/7/23 5:47 PM, Andres Freund wrote:
>> Hi,
>>
>>>> - write a test that invalidated logical slots do not lead to retaining WAL
>>>
>>> I'm not sure how to do that since pg_switch_wal() and friends can't be executed on
>>> a standby.
>>
>> You can do it on the primary and wait for the records to have been applied.
>>
> 
> Thanks, will give it a try in a couple of hours.

I looked at it but I think we'd also need things like pg_walfile_name() on the standby but is not allowed.

>> Is this patchset sufficient to subscribe to a publication on a physical
>> standby, assuming the publication is created on the primary? If so, we should
>> have at least a minimal test. If not, we should note that restriction
>> explicitly.
> 
> I gave it a try and it does work.
> 
> "
> node3 subscribes to node2 (standby).
> Insert done in node1 (primary) where the publication is created => node3 see the changes.
> "
> 
> I started to create the TAP test but currently stuck as the "create subscription" waits for a
checkpoint/pg_log_standby_snapshot()on the primary.
 
> 
> So, trying to make use of things like:
> 
> "my %psql_subscriber = ('stdin' => '', 'stdout' => '');
> $psql_subscriber{run} =
>    $node_subscriber->background_psql('postgres', \$psql_subscriber{stdin},
>      \$psql_subscriber{stdout},
>      $psql_timeout);
> $psql_subscriber{stdout} = '';
> "
> 
> But in vain so far...
> 

please find attached sub_in_progress.patch that "should work" but "does not" because
the wait_for_subscription_sync() call produces:

"
error running SQL: 'psql:<stdin>:1: ERROR:  recovery is in progress
HINT:  WAL control functions cannot be executed during recovery.'
while running 'psql -XAtq -d port=61441 host=/tmp/45dt3wqs2p dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql
'SELECTpg_current_wal_lsn()'
 
"

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 22:54:01 +0200, Drouvot, Bertrand wrote:
> That looks good to me

Cool.

I think I'll push these in a few hours. While this needed more changes than
I'd like shortly before the freeze, I think they're largely not in very
interesting bits and pieces - and this feature has been in the works for about
three eternities, and it is blocking a bunch of highly requested features.

If anybody still has energy, I would appreciate a look at 0001, 0002, the new
pieces I added, to make what's now 0003 and 0004 cleaner.


> 0005 is missing author/reviewer, I'd propose:
> [...]

Thanks, I'll integrate them...


> It's hard (given the amount of emails that have been send during all this time),

Indeed.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Alvaro Herrera
Date:
I gave a very quick look at 0001 and 0003.  I find no fault with 0001.
It was clear back when we added that stuff that invalidated_at was not
terribly useful -- I was just too conservative to not have it -- but now
that a lot of time has passed and we haven't done anything with it,
removing it seems perfectly OK.

As for 0003, I have no further concerns about the translatability.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)



Re: Minimal logical decoding on standbys

From
Melanie Plageman
Date:
Code review only of 0001-0005.

I noticed you had two 0008, btw.

On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-04-07 08:47:57 -0700, Andres Freund wrote:
> > Integrated all of these.
> 
> From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Thu, 6 Apr 2023 20:00:07 -0700
> Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN with
>  an enum
> 
> diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
> index 8872c80cdfe..ebcb637baed 100644
> --- a/src/include/replication/slot.h
> +++ b/src/include/replication/slot.h
> @@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
>      RS_TEMPORARY
>  } ReplicationSlotPersistency;
>  
> +/*
> + * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
> + * 'invalidated' field is set to a value other than _NONE.
> + */
> +typedef enum ReplicationSlotInvalidationCause
> +{
> +    RS_INVAL_NONE,
> +    /* required WAL has been removed */

I just wonder if RS_INVAL_WAL is too generic. Something like
RS_INVAL_WAL_MISSING or similar may be better since it seems there are
other inavlidation causes that may be related to WAL.

> +    RS_INVAL_WAL,
> +} ReplicationSlotInvalidationCause;
> +

0002 LGTM

> From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Fri, 7 Apr 2023 09:32:48 -0700
> Subject: [PATCH va67 3/9] Support invalidating replication slots due to
>  horizon and wal_level
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Needed for supporting logical decoding on a standby. The new invalidation
> methods will be used in a subsequent commit.
> 

You probably are aware, but applying 0003 and 0004 both gives me two
warnings:

warning: 1 line adds whitespace errors.
Warning: commit message did not conform to UTF-8.
You may want to amend it after fixing the message, or set the config
variable i18n.commitEncoding to the encoding your project uses.

> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index df23b7ed31e..c2a9accebf6 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void)
>  }
>  
>  /*
> - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
> - * and mark it invalid, if necessary and possible.
> + * Report that replication slot needs to be invalidated
> + */
> +static void
> +ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
> +                       bool terminating,
> +                       int pid,
> +                       NameData slotname,
> +                       XLogRecPtr restart_lsn,
> +                       XLogRecPtr oldestLSN,
> +                       TransactionId snapshotConflictHorizon)
> +{
> +    StringInfoData err_detail;
> +    bool        hint = false;
> +
> +    initStringInfo(&err_detail);
> +
> +    switch (cause)
> +    {
> +        case RS_INVAL_WAL:
> +            hint = true;
> +            appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
> +                             LSN_FORMAT_ARGS(restart_lsn),

I'm not sure what the below cast is meant to do. If you are trying to
protect against overflow/underflow, I think you'd need to cast before
doing the subtraction.

> +                             (unsigned long long) (oldestLSN - restart_lsn));
> +            break;
> +        case RS_INVAL_HORIZON:
> +            appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."),
> +                             snapshotConflictHorizon);
> +            break;
> +
> +        case RS_INVAL_WAL_LEVEL:
> +            appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical
onthe primary server"));
 
> +            break;
> +        case RS_INVAL_NONE:
> +            pg_unreachable();
> +    }

This ereport is quite hard to read. Is there any simplification you can
do of the ternaries without undue duplication?

> +    ereport(LOG,
> +            terminating ?
> +            errmsg("terminating process %d to release replication slot \"%s\"",
> +                   pid, NameStr(slotname)) :
> +            errmsg("invalidating obsolete replication slot \"%s\"",
> +                   NameStr(slotname)),
> +            errdetail_internal("%s", err_detail.data),
> +            hint ? errhint("You might need to increase max_slot_wal_keep_size.") : 0);
> +
> +    pfree(err_detail.data);
> +}
> +
> +/*
> + * Helper for InvalidateObsoleteReplicationSlots
> + *
> + * Acquires the given slot and mark it invalid, if necessary and possible.
>   *
>   * Returns whether ReplicationSlotControlLock was released in the interim (and
>   * in that case we're not holding the lock at return, otherwise we are).
> @@ -1253,7 +1303,10 @@ ReplicationSlotReserveWal(void)
>   * for syscalls, so caller must restart if we return true.
>   */
>  static bool
> -InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> +InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
> +                               ReplicationSlot *s,
> +                               XLogRecPtr oldestLSN,
> +                               Oid dboid, TransactionId snapshotConflictHorizon,
>                                 bool *invalidated)
>  {
>      int            last_signaled_pid = 0;
> @@ -1264,6 +1317,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>          XLogRecPtr    restart_lsn;
>          NameData    slotname;
>          int            active_pid = 0;
> +        ReplicationSlotInvalidationCause conflict = RS_INVAL_NONE;
>  
>          Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
>  
> @@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>          restart_lsn = s->data.restart_lsn;
>  
>          /*
> -         * If the slot is already invalid or is fresh enough, we don't need to
> -         * do anything.
> +         * If the slot is already invalid or is a non conflicting slot, we
> +         * don't need to do anything.
>           */
> -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
> +        if (s->data.invalidated == RS_INVAL_NONE)
> +        {
> +            switch (cause)
> +            {
> +                case RS_INVAL_WAL:
> +                    if (s->data.restart_lsn != InvalidXLogRecPtr &&
> +                        s->data.restart_lsn < oldestLSN)
> +                        conflict = cause;
> +                    break;

Should the below be an error? a physical slot with RS_INVAL_HORIZON
invalidation cause?

> +                case RS_INVAL_HORIZON:
> +                    if (!SlotIsLogical(s))
> +                        break;
> +                    /* invalid DB oid signals a shared relation */
> +                    if (dboid != InvalidOid && dboid != s->data.database)
> +                        break;
> +                    if (TransactionIdIsValid(s->effective_xmin) &&
> +                        TransactionIdPrecedesOrEquals(s->effective_xmin,
> +                                                      snapshotConflictHorizon))
> +                        conflict = cause;
> +                    else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
> +                             TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
> +                                                           snapshotConflictHorizon))
> +                        conflict = cause;
> +                    break;
> +                case RS_INVAL_WAL_LEVEL:
> +                    if (SlotIsLogical(s))
> +                        conflict = cause;
> +                    break;

All three of default, pg_unreachable(), and break seems a bit like
overkill. Perhaps remove the break?

> +                default:
> +                    pg_unreachable();
> +                    break;
> +            }
> +        }
> +

> @@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>              ReplicationSlotMarkDirty();
>              ReplicationSlotSave();
>              ReplicationSlotRelease();
> +            pgstat_drop_replslot(s);
>  
> -            ereport(LOG,
> -                    errmsg("invalidating obsolete replication slot \"%s\"",
> -                           NameStr(slotname)),
> -                    errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> -                              LSN_FORMAT_ARGS(restart_lsn),
> -                              (unsigned long long) (oldestLSN - restart_lsn)),
> -                    errhint("You might need to increase max_slot_wal_keep_size."));
> +            ReportSlotInvalidation(conflict, false, active_pid,
> +                                   slotname, restart_lsn,
> +                                   oldestLSN, snapshotConflictHorizon);
>  
>              /* done with this slot for now */
>              break;
> @@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
>  }
>  
>  /*
> - * Mark any slot that points to an LSN older than the given segment
> - * as invalid; it requires WAL that's about to be removed.
> + * Invalidate slots that require resources about to be removed.
>   *
>   * Returns true when any slot have got invalidated.
>   *
> + * Whether a slot needs to be invalidated depends on the cause. A slot is
> + * removed if it:
> + * - RS_INVAL_WAL: requires a LSN older than the given segment
> + * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the given db
> +     dboid may be InvalidOid for shared relations

the comma above reduces readability

is this what you mean?

RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
db; dboid may be InvalidOid for shared relations

> From 311a1d8f9c2d1acf0c22e091d53f7a533073c8b7 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Fri, 7 Apr 2023 09:56:02 -0700
> Subject: [PATCH va67 4/9] Handle logical slot conflicts on standby
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> During WAL replay on standby, when slot conflict is identified, invalidate
> such slots. Also do the same thing if wal_level on the primary server is
> reduced to below logical and there are existing logical slots on
> standby. Introduce a new ProcSignalReason value for slot conflict recovery.
> 
> Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
> Author: Andres Freund <andres@anarazel.de>
> Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
> Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
> Reviewed-by: Andres Freund <andres@anarazel.de>
> Reviewed-by: Robert Haas <robertmhaas@gmail.com>
> Reviewed-by: Fabr�zio de Royes Mello <fabriziomello@gmail.com>
> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
> ---
>  src/include/storage/procsignal.h     |  1 +
>  src/include/storage/standby.h        |  2 ++
>  src/backend/access/gist/gistxlog.c   |  2 ++
>  src/backend/access/hash/hash_xlog.c  |  1 +
>  src/backend/access/heap/heapam.c     |  3 +++
>  src/backend/access/nbtree/nbtxlog.c  |  2 ++
>  src/backend/access/spgist/spgxlog.c  |  1 +
>  src/backend/access/transam/xlog.c    | 15 +++++++++++++++
>  src/backend/replication/slot.c       |  8 +++++++-
>  src/backend/storage/ipc/procsignal.c |  3 +++
>  src/backend/storage/ipc/standby.c    | 20 +++++++++++++++++++-
>  src/backend/tcop/postgres.c          |  9 +++++++++
>  12 files changed, 65 insertions(+), 2 deletions(-)
> 
> diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
> index 905af2231ba..2f52100b009 100644
> --- a/src/include/storage/procsignal.h
> +++ b/src/include/storage/procsignal.h
> @@ -42,6 +42,7 @@ typedef enum
>      PROCSIG_RECOVERY_CONFLICT_TABLESPACE,
>      PROCSIG_RECOVERY_CONFLICT_LOCK,
>      PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
> +    PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
>      PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
>      PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
>  
> diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
> index 2effdea126f..41f4dc372e6 100644
> --- a/src/include/storage/standby.h
> +++ b/src/include/storage/standby.h
> @@ -30,8 +30,10 @@ extern void InitRecoveryTransactionEnvironment(void);
>  extern void ShutdownRecoveryTransactionEnvironment(void);
>  
>  extern void ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> +                                                bool isCatalogRel,
>                                                  RelFileLocator locator);
>  extern void ResolveRecoveryConflictWithSnapshotFullXid(FullTransactionId snapshotConflictHorizon,
> +                                                       bool isCatalogRel,
>                                                         RelFileLocator locator);
>  extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
>  extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
> diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
> index b7678f3c144..9a86fb3feff 100644
> --- a/src/backend/access/gist/gistxlog.c
> +++ b/src/backend/access/gist/gistxlog.c
> @@ -197,6 +197,7 @@ gistRedoDeleteRecord(XLogReaderState *record)
>          XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
>  
>          ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
> +                                            xldata->isCatalogRel,
>                                              rlocator);
>      }
>  
> @@ -390,6 +391,7 @@ gistRedoPageReuse(XLogReaderState *record)
>       */
>      if (InHotStandby)
>          ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
> +                                                   xlrec->isCatalogRel,
>                                                     xlrec->locator);
>  }
>  
> diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
> index f2dd9be8d3f..e8e06c62a95 100644
> --- a/src/backend/access/hash/hash_xlog.c
> +++ b/src/backend/access/hash/hash_xlog.c
> @@ -1003,6 +1003,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
>  
>          XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
>          ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
> +                                            xldata->isCatalogRel,
>                                              rlocator);
>      }
>  
> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 8b13e3f8925..f389ceee1ea 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -8769,6 +8769,7 @@ heap_xlog_prune(XLogReaderState *record)
>       */
>      if (InHotStandby)
>          ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
> +                                            xlrec->isCatalogRel,
>                                              rlocator);
>  
>      /*
> @@ -8940,6 +8941,7 @@ heap_xlog_visible(XLogReaderState *record)
>       */
>      if (InHotStandby)
>          ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
> +                                            xlrec->flags & VISIBILITYMAP_XLOG_CATALOG_REL,
>                                              rlocator);
>  
>      /*
> @@ -9061,6 +9063,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
>  
>          XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
>          ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
> +                                            xlrec->isCatalogRel,
>                                              rlocator);
>      }
>  
> diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
> index 414ca4f6deb..c87e46ed66e 100644
> --- a/src/backend/access/nbtree/nbtxlog.c
> +++ b/src/backend/access/nbtree/nbtxlog.c
> @@ -669,6 +669,7 @@ btree_xlog_delete(XLogReaderState *record)
>          XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
>  
>          ResolveRecoveryConflictWithSnapshot(xlrec->snapshotConflictHorizon,
> +                                            xlrec->isCatalogRel,
>                                              rlocator);
>      }
>  
> @@ -1007,6 +1008,7 @@ btree_xlog_reuse_page(XLogReaderState *record)
>  
>      if (InHotStandby)
>          ResolveRecoveryConflictWithSnapshotFullXid(xlrec->snapshotConflictHorizon,
> +                                                   xlrec->isCatalogRel,
>                                                     xlrec->locator);
>  }
>  
> diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
> index b071b59c8ac..459ac929ba5 100644
> --- a/src/backend/access/spgist/spgxlog.c
> +++ b/src/backend/access/spgist/spgxlog.c
> @@ -879,6 +879,7 @@ spgRedoVacuumRedirect(XLogReaderState *record)
>  
>          XLogRecGetBlockTag(record, 0, &locator, NULL, NULL);
>          ResolveRecoveryConflictWithSnapshot(xldata->snapshotConflictHorizon,
> +                                            xldata->isCatalogRel,
>                                              locator);
>      }
>  
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 1485e8f9ca9..5227fc675c8 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -7965,6 +7965,21 @@ xlog_redo(XLogReaderState *record)
>          /* Update our copy of the parameters in pg_control */
>          memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
>  
> +        /*
> +         * Invalidate logical slots if we are in hot standby and the primary
> +         * does not have a WAL level sufficient for logical decoding. No need
> +         * to search for potentially conflicting logically slots if standby is
> +         * running with wal_level lower than logical, because in that case, we
> +         * would have either disallowed creation of logical slots or
> +         * invalidated existing ones.
> +         */
> +        if (InRecovery && InHotStandby &&
> +            xlrec.wal_level < WAL_LEVEL_LOGICAL &&
> +            wal_level >= WAL_LEVEL_LOGICAL)
> +            InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL,
> +                                               0, InvalidOid,
> +                                               InvalidTransactionId);
> +
>          LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
>          ControlFile->MaxConnections = xlrec.MaxConnections;
>          ControlFile->max_worker_processes = xlrec.max_worker_processes;
> diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> index c2a9accebf6..1b1b51e21ed 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1443,7 +1443,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
>                                         slotname, restart_lsn,
>                                         oldestLSN, snapshotConflictHorizon);
>  
> -                (void) kill(active_pid, SIGTERM);
> +                if (MyBackendType == B_STARTUP)

Is SendProcSignal() marked warn_unused_result or something? I don't see
other callers who don't use its return value void casting it.

> +                    (void) SendProcSignal(active_pid,
> +                                          PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT,
> +                                          InvalidBackendId);
> +                else
> +                    (void) kill(active_pid, SIGTERM);
> +
>                  last_signaled_pid = active_pid;
>              }

> diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
> index 9f56b4e95cf..3b5d654347e 100644
> --- a/src/backend/storage/ipc/standby.c
> +++ b/src/backend/storage/ipc/standby.c
> @@ -24,6 +24,7 @@
>  #include "access/xlogutils.h"
>  #include "miscadmin.h"
>  #include "pgstat.h"
> +#include "replication/slot.h"
>  #include "storage/bufmgr.h"
>  #include "storage/lmgr.h"
>  #include "storage/proc.h"
> @@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
>   */
>  void
>  ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> +                                    bool isCatalogRel,
>                                      RelFileLocator locator)
>  {
>      VirtualTransactionId *backends;
> @@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
>                                             PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
>                                             WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
>                                             true);
> +
> +    /*
> +     * Note that WaitExceedsMaxStandbyDelay() is not taken into account here
> +     * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That
> +     * seems OK, given that this kind of conflict should not normally be

do you mean "when using a physical replication slot"?

> +     * reached, e.g. by using a physical replication slot.
> +     */
> +    if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
> +        InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, locator.dbOid,
> +                                           snapshotConflictHorizon);
>  }


0005 LGTM

- Melanie



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 18:32:04 -0400, Melanie Plageman wrote:
> Code review only of 0001-0005.
> 
> I noticed you had two 0008, btw.

Yea, sorry for that. One was the older version. Just before sending the patch
I saw another occurance of a test failure, which I then fixed. In the course
of that I changed something in the patch subject.


> On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-04-07 08:47:57 -0700, Andres Freund wrote:
> > > Integrated all of these.
> > 
> > From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Thu, 6 Apr 2023 20:00:07 -0700
> > Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN with
> >  an enum
> > 
> > diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
> > index 8872c80cdfe..ebcb637baed 100644
> > --- a/src/include/replication/slot.h
> > +++ b/src/include/replication/slot.h
> > @@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
> >      RS_TEMPORARY
> >  } ReplicationSlotPersistency;
> >  
> > +/*
> > + * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
> > + * 'invalidated' field is set to a value other than _NONE.
> > + */
> > +typedef enum ReplicationSlotInvalidationCause
> > +{
> > +    RS_INVAL_NONE,
> > +    /* required WAL has been removed */
> 
> I just wonder if RS_INVAL_WAL is too generic. Something like
> RS_INVAL_WAL_MISSING or similar may be better since it seems there are
> other inavlidation causes that may be related to WAL.

Renamed to RS_INVAL_WAL_REMOVED



> > From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Fri, 7 Apr 2023 09:32:48 -0700
> > Subject: [PATCH va67 3/9] Support invalidating replication slots due to
> >  horizon and wal_level
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=UTF-8
> > Content-Transfer-Encoding: 8bit
> > 
> > Needed for supporting logical decoding on a standby. The new invalidation
> > methods will be used in a subsequent commit.
> > 
> 
> You probably are aware, but applying 0003 and 0004 both gives me two
> warnings:
> 
> warning: 1 line adds whitespace errors.
> Warning: commit message did not conform to UTF-8.
> You may want to amend it after fixing the message, or set the config
> variable i18n.commitEncoding to the encoding your project uses.

I did see the whitespace error, but not the encoding error. We have a bunch of
other commit messages with Fabrizio's name "properly spelled" in, so I think
that's ok.



> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index df23b7ed31e..c2a9accebf6 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void)
> >  }
> >  
> >  /*
> > - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
> > - * and mark it invalid, if necessary and possible.
> > + * Report that replication slot needs to be invalidated
> > + */
> > +static void
> > +ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
> > +                       bool terminating,
> > +                       int pid,
> > +                       NameData slotname,
> > +                       XLogRecPtr restart_lsn,
> > +                       XLogRecPtr oldestLSN,
> > +                       TransactionId snapshotConflictHorizon)
> > +{
> > +    StringInfoData err_detail;
> > +    bool        hint = false;
> > +
> > +    initStringInfo(&err_detail);
> > +
> > +    switch (cause)
> > +    {
> > +        case RS_INVAL_WAL:
> > +            hint = true;
> > +            appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
> > +                             LSN_FORMAT_ARGS(restart_lsn),
> 
> I'm not sure what the below cast is meant to do. If you are trying to
> protect against overflow/underflow, I think you'd need to cast before
> doing the subtraction.


> > +                             (unsigned long long) (oldestLSN - restart_lsn));
> > +            break;

That's our current way of passing 64bit numbers to format string
functions. It ends up as a 64bit number everywhere, even windows (with its
stupid ILP32 model).



> > +        case RS_INVAL_HORIZON:
> > +            appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."),
> > +                             snapshotConflictHorizon);
> > +            break;
> > +
> > +        case RS_INVAL_WAL_LEVEL:
> > +            appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical
onthe primary server"));
 
> > +            break;
> > +        case RS_INVAL_NONE:
> > +            pg_unreachable();
> > +    }
> 
> This ereport is quite hard to read. Is there any simplification you can
> do of the ternaries without undue duplication?

I tried a bunch, and none really seemed an improvement.


> > @@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> >          restart_lsn = s->data.restart_lsn;
> >  
> >          /*
> > -         * If the slot is already invalid or is fresh enough, we don't need to
> > -         * do anything.
> > +         * If the slot is already invalid or is a non conflicting slot, we
> > +         * don't need to do anything.
> >           */
> > -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
> > +        if (s->data.invalidated == RS_INVAL_NONE)
> > +        {
> > +            switch (cause)
> > +            {
> > +                case RS_INVAL_WAL:
> > +                    if (s->data.restart_lsn != InvalidXLogRecPtr &&
> > +                        s->data.restart_lsn < oldestLSN)
> > +                        conflict = cause;
> > +                    break;
> 
> Should the below be an error? a physical slot with RS_INVAL_HORIZON
> invalidation cause?

InvalidatePossiblyObsoleteSlot() gets called for all existing slots, so it's
normal for RS_INVAL_HORIZON to encounter a physical slot.


> > +                case RS_INVAL_HORIZON:
> > +                    if (!SlotIsLogical(s))
> > +                        break;
> > +                    /* invalid DB oid signals a shared relation */
> > +                    if (dboid != InvalidOid && dboid != s->data.database)
> > +                        break;
> > +                    if (TransactionIdIsValid(s->effective_xmin) &&
> > +                        TransactionIdPrecedesOrEquals(s->effective_xmin,
> > +                                                      snapshotConflictHorizon))
> > +                        conflict = cause;
> > +                    else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
> > +                             TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
> > +                                                           snapshotConflictHorizon))
> > +                        conflict = cause;
> > +                    break;
> > +                case RS_INVAL_WAL_LEVEL:
> > +                    if (SlotIsLogical(s))
> > +                        conflict = cause;
> > +                    break;
> 
> All three of default, pg_unreachable(), and break seems a bit like
> overkill. Perhaps remove the break?

I always get nervous about case statements without a break, due to the
fallthrough behaviour of switch/case. So I add it very habitually. I replaced
it with
                case RS_INVAL_NONE:
                    pg_unreachable();
that way we get warnings about unhandled cases too. Not sure why I hadn't done
that.

> > @@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> >              ReplicationSlotMarkDirty();
> >              ReplicationSlotSave();
> >              ReplicationSlotRelease();
> > +            pgstat_drop_replslot(s);
> >  
> > -            ereport(LOG,
> > -                    errmsg("invalidating obsolete replication slot \"%s\"",
> > -                           NameStr(slotname)),
> > -                    errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> > -                              LSN_FORMAT_ARGS(restart_lsn),
> > -                              (unsigned long long) (oldestLSN - restart_lsn)),
> > -                    errhint("You might need to increase max_slot_wal_keep_size."));
> > +            ReportSlotInvalidation(conflict, false, active_pid,
> > +                                   slotname, restart_lsn,
> > +                                   oldestLSN, snapshotConflictHorizon);
> >  
> >              /* done with this slot for now */
> >              break;
> > @@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> >  }
> >  
> >  /*
> > - * Mark any slot that points to an LSN older than the given segment
> > - * as invalid; it requires WAL that's about to be removed.
> > + * Invalidate slots that require resources about to be removed.
> >   *
> >   * Returns true when any slot have got invalidated.
> >   *
> > + * Whether a slot needs to be invalidated depends on the cause. A slot is
> > + * removed if it:
> > + * - RS_INVAL_WAL: requires a LSN older than the given segment
> > + * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the given db
> > +     dboid may be InvalidOid for shared relations
> 
> the comma above reduces readability
> 
> is this what you mean?
> 
> RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
> db; dboid may be InvalidOid for shared relations

Yep.


> > @@ -1443,7 +1443,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
> >                                         slotname, restart_lsn,
> >                                         oldestLSN, snapshotConflictHorizon);
> >  
> > -                (void) kill(active_pid, SIGTERM);
> > +                if (MyBackendType == B_STARTUP)
> 
> Is SendProcSignal() marked warn_unused_result or something? I don't see
> other callers who don't use its return value void casting it.

I went back and forth about it. I think Bertrand added. It looks a bit odd to
have it, for the reason you say. It also looks a bit odd to not have, given
the parallel (void) kill().


> > diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
> > index 9f56b4e95cf..3b5d654347e 100644
> > --- a/src/backend/storage/ipc/standby.c
> > +++ b/src/backend/storage/ipc/standby.c
> > @@ -24,6 +24,7 @@
> >  #include "access/xlogutils.h"
> >  #include "miscadmin.h"
> >  #include "pgstat.h"
> > +#include "replication/slot.h"
> >  #include "storage/bufmgr.h"
> >  #include "storage/lmgr.h"
> >  #include "storage/proc.h"
> > @@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
> >   */
> >  void
> >  ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> > +                                    bool isCatalogRel,
> >                                      RelFileLocator locator)
> >  {
> >      VirtualTransactionId *backends;
> > @@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> >                                             PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
> >                                             WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
> >                                             true);
> > +
> > +    /*
> > +     * Note that WaitExceedsMaxStandbyDelay() is not taken into account here
> > +     * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That
> > +     * seems OK, given that this kind of conflict should not normally be
> 
> do you mean "when using a physical replication slot"?

> > +     * reached, e.g. by using a physical replication slot.
> > +     */
> > +    if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
> > +        InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, locator.dbOid,
> > +                                           snapshotConflictHorizon);
> >  }

No. I mean that normally a physical replication slot, or some other approach,
should prevent such conflicts. I replaced 'by' with 'due to'


Thanks a lot for the review!

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Fri, Apr 7, 2023 at 11:42 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-04-07 08:47:57 -0700, Andres Freund wrote:
> > Integrated all of these.
>
> Here's my current version. Changes:
> - Integrated Bertrand's changes
> - polished commit messages of 0001-0003
> - edited code comments for 0003, including
>   InvalidateObsoleteReplicationSlots()'s header
> - added a bump of SLOT_VERSION to 0001
> - moved addition of pg_log_standby_snapshot() to 0007
> - added a catversion bump for pg_log_standby_snapshot()
> - moved all the bits dealing with procsignals from 0003 to 0004, now the split
>   makes sense IMO
> - combined a few more sucessive ->safe_psql() calls
>

The new approach for invalidation looks clean. BTW, I see minor
inconsistency in the following two error messages (errmsg):

if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("can no longer get changes from replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it exceeded the
maximum reserved size.")));

if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot read from logical replication slot \"%s\"",
NameStr(MyReplicationSlot->data.name)),
errdetail("This slot has been invalidated because it was conflicting
with recovery.")));

Won't it be better to keep the same errmsg in the above two cases?

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-08 09:15:05 +0530, Amit Kapila wrote:
> The new approach for invalidation looks clean. BTW, I see minor
> inconsistency in the following two error messages (errmsg):

Thanks for checking.


> if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
> ereport(ERROR,
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("can no longer get changes from replication slot \"%s\"",
> NameStr(MyReplicationSlot->data.name)),
> errdetail("This slot has been invalidated because it exceeded the
> maximum reserved size.")));
> 
> if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> ereport(ERROR,
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> errmsg("cannot read from logical replication slot \"%s\"",
> NameStr(MyReplicationSlot->data.name)),
> errdetail("This slot has been invalidated because it was conflicting
> with recovery.")));
> 
> Won't it be better to keep the same errmsg in the above two cases?

Probably - do you have a preference? I think the former is a bit better?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Jonathan S. Katz"
Date:
On 4/8/23 12:01 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-04-08 09:15:05 +0530, Amit Kapila wrote:
>> The new approach for invalidation looks clean. BTW, I see minor
>> inconsistency in the following two error messages (errmsg):
> 
> Thanks for checking.
> 
> 
>> if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
>> ereport(ERROR,
>> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> errmsg("can no longer get changes from replication slot \"%s\"",
>> NameStr(MyReplicationSlot->data.name)),
>> errdetail("This slot has been invalidated because it exceeded the
>> maximum reserved size.")));
>>
>> if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
>> ereport(ERROR,
>> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> errmsg("cannot read from logical replication slot \"%s\"",
>> NameStr(MyReplicationSlot->data.name)),
>> errdetail("This slot has been invalidated because it was conflicting
>> with recovery.")));
>>
>> Won't it be better to keep the same errmsg in the above two cases?
> 
> Probably - do you have a preference? I think the former is a bit better?

+1 for the former, though perhaps "receive" instead of "get?"

Jonathan

Attachment

Re: Minimal logical decoding on standbys

From
Amit Kapila
Date:
On Sat, Apr 8, 2023 at 9:31 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-04-08 09:15:05 +0530, Amit Kapila wrote:
> > The new approach for invalidation looks clean. BTW, I see minor
> > inconsistency in the following two error messages (errmsg):
>
> Thanks for checking.
>
>
> > if (MyReplicationSlot->data.invalidated == RS_INVAL_WAL)
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > errmsg("can no longer get changes from replication slot \"%s\"",
> > NameStr(MyReplicationSlot->data.name)),
> > errdetail("This slot has been invalidated because it exceeded the
> > maximum reserved size.")));
> >
> > if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> > ereport(ERROR,
> > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > errmsg("cannot read from logical replication slot \"%s\"",
> > NameStr(MyReplicationSlot->data.name)),
> > errdetail("This slot has been invalidated because it was conflicting
> > with recovery.")));
> >
> > Won't it be better to keep the same errmsg in the above two cases?
>
> Probably - do you have a preference? I think the former is a bit better?
>

+1 for the former.

--
With Regards,
Amit Kapila.



Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 11:12:26 -0700, Andres Freund wrote:
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of active logical slots in this database that have been
> +       invalidated because they conflict with recovery (note that inactive ones
> +       are also invalidated but do not increment this counter)
> +      </para></entry>
> +     </row>
>      </tbody>
>     </tgroup>
>    </table>

This seems wrong to me. The counter is not for invalidated slots, it's for
recovery conflict interrupts. If phrased that way, the parenthetical would be
unnecessary.

I think something like
       Number of uses of logical slots in this database that have been
       canceled due to old snapshots or a too low <xref linkend="guc-wal-level"/>
       on the primary

would work and fit with the documentation of the other fields? Reads a bit
stilted, but so do several of the other fields...

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Bertrand Drouvot
Date:
Hi,

New wording works for me, thanks! 

Bertrand

Le sam. 8 avr. 2023, 08:26, Andres Freund <andres@anarazel.de> a écrit :
Hi,

On 2023-04-07 11:12:26 -0700, Andres Freund wrote:
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>confl_active_logicalslot</structfield> <type>bigint</type>
> +      </para>
> +      <para>
> +       Number of active logical slots in this database that have been
> +       invalidated because they conflict with recovery (note that inactive ones
> +       are also invalidated but do not increment this counter)
> +      </para></entry>
> +     </row>
>      </tbody>
>     </tgroup>
>    </table>

This seems wrong to me. The counter is not for invalidated slots, it's for
recovery conflict interrupts. If phrased that way, the parenthetical would be
unnecessary.

I think something like
       Number of uses of logical slots in this database that have been
       canceled due to old snapshots or a too low <xref linkend="guc-wal-level"/>
       on the primary

would work and fit with the documentation of the other fields? Reads a bit
stilted, but so do several of the other fields...

Greetings,

Andres Freund

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-07 14:27:09 -0700, Andres Freund wrote:
> I think I'll push these in a few hours. While this needed more changes than
> I'd like shortly before the freeze, I think they're largely not in very
> interesting bits and pieces - and this feature has been in the works for about
> three eternities, and it is blocking a bunch of highly requested features.
>
> If anybody still has energy, I would appreciate a look at 0001, 0002, the new
> pieces I added, to make what's now 0003 and 0004 cleaner.

Pushed. Thanks all!

I squashed some of the changes. There didn't seem to be a need for a separate
stats commit, or a separate docs commit. Besides that, I did find plenty of
grammar issues, and a bunch of formatting issues.

Let's see what the buildfarm says.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
"Jonathan S. Katz"
Date:
On 4/8/23 5:27 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-04-07 14:27:09 -0700, Andres Freund wrote:
>> I think I'll push these in a few hours. While this needed more changes than
>> I'd like shortly before the freeze, I think they're largely not in very
>> interesting bits and pieces - and this feature has been in the works for about
>> three eternities, and it is blocking a bunch of highly requested features.
>>
>> If anybody still has energy, I would appreciate a look at 0001, 0002, the new
>> pieces I added, to make what's now 0003 and 0004 cleaner.
> 
> Pushed. Thanks all!
> 
> I squashed some of the changes. There didn't seem to be a need for a separate
> stats commit, or a separate docs commit. Besides that, I did find plenty of
> grammar issues, and a bunch of formatting issues.
> 
> Let's see what the buildfarm says.

Thanks to everyone for working on this feature -- this should have a big 
impact on users of logical replication!

While it still needs to get through the beta period etc. this is a big 
milestone for what's been a multi-year effort to support this.

Thanks,

Jonathan

Attachment

Re: Minimal logical decoding on standbys

From
Noah Misch
Date:
On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
> --- /dev/null
> +++ b/src/test/recovery/t/035_standby_logical_decoding.pl
> @@ -0,0 +1,720 @@
> +# logical decoding on standby : test logical decoding,
> +# recovery conflict and standby promotion.
...
> +$node_primary->append_conf('postgresql.conf', q{
> +wal_level = 'logical'
> +max_replication_slots = 4
> +max_wal_senders = 4
> +log_min_messages = 'debug2'
> +log_error_verbosity = verbose
> +});

Buildfarm member hoverfly stopped reporting in when this test joined the tree.
It's currently been stuck here for 140 minutes:

===
$ tail -n5 regress_log_035_standby_logical_decoding
[02:57:48.390](0.100s) ok 66 - otherslot on standby not dropped

### Reloading node "standby"
# Running: pg_ctl -D
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_035_standby_logical_decoding_standby_data/pgdata
reload
server signaled
===

I've posted a tarball of the current logs at
https://drive.google.com/file/d/1JIZ5hSHBsKjEgU5WOGHOqXB7Z_-9XT5u/view?usp=sharing.
The test times out (PG_TEST_TIMEOUT_DEFAULT=5400), and uploading logs then
fails with 413 Request Entity Too Large.  Is the above
log_min_messages='debug2' important?  Removing that may make the logs small
enough to upload normally.



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/11/23 7:36 AM, Noah Misch wrote:
> On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
>> --- /dev/null
>> +++ b/src/test/recovery/t/035_standby_logical_decoding.pl
>> @@ -0,0 +1,720 @@
>> +# logical decoding on standby : test logical decoding,
>> +# recovery conflict and standby promotion.
> ...
>> +$node_primary->append_conf('postgresql.conf', q{
>> +wal_level = 'logical'
>> +max_replication_slots = 4
>> +max_wal_senders = 4
>> +log_min_messages = 'debug2'
>> +log_error_verbosity = verbose
>> +});
> 
> Buildfarm member hoverfly stopped reporting in when this test joined the tree.
> It's currently been stuck here for 140 minutes:
> 

Thanks for the report!

It's looping on:

2023-04-11 02:57:52.516 UTC [62718288:5] 035_standby_logical_decoding.pl LOG:  00000: statement: SELECT restart_lsn IS
NOTNULL
 
                         FROM pg_catalog.pg_replication_slots WHERE slot_name = 'promotion_inactiveslot'

And the reason is that the slot is not being created:

$ grep "CREATE_REPLICATION_SLOT" 035_standby_logical_decoding_standby.log | tail -2
2023-04-11 02:57:47.287 UTC [9241178:15] 035_standby_logical_decoding.pl STATEMENT:  CREATE_REPLICATION_SLOT
"otherslot"LOGICAL "test_decoding" ( SNAPSHOT 'nothing')
 
2023-04-11 02:57:47.622 UTC [9241178:23] 035_standby_logical_decoding.pl STATEMENT:  CREATE_REPLICATION_SLOT
"otherslot"LOGICAL "test_decoding" ( SNAPSHOT 'nothing')
 

Not sure why the slot is not being created.

There is also "replication apply delay" increasing:

2023-04-11 02:57:49.183 UTC [13304488:253] DEBUG:  00000: sendtime 2023-04-11 02:57:49.111363+00 receipttime 2023-04-11
02:57:49.183512+00replication apply delay 644 ms transfer latency 73 ms
 
2023-04-11 02:57:49.184 UTC [13304488:259] DEBUG:  00000: sendtime 2023-04-11 02:57:49.183461+00 receipttime 2023-04-11
02:57:49.1842+00replication apply delay 645 ms transfer latency 1 ms
 
2023-04-11 02:57:49.221 UTC [13304488:265] DEBUG:  00000: sendtime 2023-04-11 02:57:49.184166+00 receipttime 2023-04-11
02:57:49.221059+00replication apply delay 682 ms transfer latency 37 ms
 
2023-04-11 02:57:49.222 UTC [13304488:271] DEBUG:  00000: sendtime 2023-04-11 02:57:49.221003+00 receipttime 2023-04-11
02:57:49.222144+00replication apply delay 683 ms transfer latency 2 ms
 
2023-04-11 02:57:49.222 UTC [13304488:277] DEBUG:  00000: sendtime 2023-04-11 02:57:49.222095+00 receipttime 2023-04-11
02:57:49.2228+00replication apply delay 684 ms transfer latency 1 ms
 

Noah, I think hoverfly is yours, would it be possible to have access (I'm not an AIX expert though) or check if you see
aslot creation hanging and if so why?
 

> ===
> $ tail -n5 regress_log_035_standby_logical_decoding
> [02:57:48.390](0.100s) ok 66 - otherslot on standby not dropped
> 
> ### Reloading node "standby"
> # Running: pg_ctl -D
/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_035_standby_logical_decoding_standby_data/pgdata
reload
> server signaled
> ===
> 
> I've posted a tarball of the current logs at
> https://drive.google.com/file/d/1JIZ5hSHBsKjEgU5WOGHOqXB7Z_-9XT5u/view?usp=sharing.
> The test times out (PG_TEST_TIMEOUT_DEFAULT=5400), and uploading logs then
> fails with 413 Request Entity Too Large.  Is the above
> log_min_messages='debug2' important?  Removing that may make the logs small
> enough to upload normally.

I think debug2 might still be useful while investigating this issue (I'll compare a working TAP run with this one).

Regards

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/11/23 10:20 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> On 4/11/23 7:36 AM, Noah Misch wrote:
>> On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
>>> --- /dev/null
>>> +++ b/src/test/recovery/t/035_standby_logical_decoding.pl
>>> @@ -0,0 +1,720 @@
>>> +# logical decoding on standby : test logical decoding,
>>> +# recovery conflict and standby promotion.
>> ...
>>> +$node_primary->append_conf('postgresql.conf', q{
>>> +wal_level = 'logical'
>>> +max_replication_slots = 4
>>> +max_wal_senders = 4
>>> +log_min_messages = 'debug2'
>>> +log_error_verbosity = verbose
>>> +});
>>
>> Buildfarm member hoverfly stopped reporting in when this test joined the tree.
>> It's currently been stuck here for 140 minutes:
>>
> 
> Thanks for the report!
> 
> It's looping on:
> 
> 2023-04-11 02:57:52.516 UTC [62718288:5] 035_standby_logical_decoding.pl LOG:  00000: statement: SELECT restart_lsn
ISNOT NULL
 
>                          FROM pg_catalog.pg_replication_slots WHERE slot_name = 'promotion_inactiveslot'
> 
> And the reason is that the slot is not being created:
> 
> $ grep "CREATE_REPLICATION_SLOT" 035_standby_logical_decoding_standby.log | tail -2
> 2023-04-11 02:57:47.287 UTC [9241178:15] 035_standby_logical_decoding.pl STATEMENT:  CREATE_REPLICATION_SLOT
"otherslot"LOGICAL "test_decoding" ( SNAPSHOT 'nothing')
 
> 2023-04-11 02:57:47.622 UTC [9241178:23] 035_standby_logical_decoding.pl STATEMENT:  CREATE_REPLICATION_SLOT
"otherslot"LOGICAL "test_decoding" ( SNAPSHOT 'nothing')
 
> 
> Not sure why the slot is not being created.
> 
> There is also "replication apply delay" increasing:
> 
> 2023-04-11 02:57:49.183 UTC [13304488:253] DEBUG:  00000: sendtime 2023-04-11 02:57:49.111363+00 receipttime
2023-04-1102:57:49.183512+00 replication apply delay 644 ms transfer latency 73 ms
 
> 2023-04-11 02:57:49.184 UTC [13304488:259] DEBUG:  00000: sendtime 2023-04-11 02:57:49.183461+00 receipttime
2023-04-1102:57:49.1842+00 replication apply delay 645 ms transfer latency 1 ms
 
> 2023-04-11 02:57:49.221 UTC [13304488:265] DEBUG:  00000: sendtime 2023-04-11 02:57:49.184166+00 receipttime
2023-04-1102:57:49.221059+00 replication apply delay 682 ms transfer latency 37 ms
 
> 2023-04-11 02:57:49.222 UTC [13304488:271] DEBUG:  00000: sendtime 2023-04-11 02:57:49.221003+00 receipttime
2023-04-1102:57:49.222144+00 replication apply delay 683 ms transfer latency 2 ms
 
> 2023-04-11 02:57:49.222 UTC [13304488:277] DEBUG:  00000: sendtime 2023-04-11 02:57:49.222095+00 receipttime
2023-04-1102:57:49.2228+00 replication apply delay 684 ms transfer latency 1 ms
 
> 
> Noah, I think hoverfly is yours, would it be possible to have access (I'm not an AIX expert though) or check if you
seea slot creation hanging and if so why?
 
> 

Well, we can see in 035_standby_logical_decoding_standby.log:

2023-04-11 02:57:49.180 UTC [62718258:5] [unknown] FATAL:  3D000: database "testdb" does not exist

While, on the primary:

2023-04-11 02:57:48.505 UTC [62718254:5] 035_standby_logical_decoding.pl LOG:  00000: statement: CREATE DATABASE
testdb

The TAP test is doing:

"
##################################################
# Test standby promotion and logical decoding behavior
# after the standby gets promoted.
##################################################

$node_standby->reload;

$node_primary->psql('postgres', q[CREATE DATABASE testdb]);
$node_primary->safe_psql('testdb', qq[CREATE TABLE decoding_test(x integer, y text);]);

# create the logical slots
create_logical_slots($node_standby, 'promotion_');
"

I think we might want to add:

$node_primary->wait_for_replay_catchup($node_standby);

before calling the slot creation.

It's done in the attached, would it be possible to give it a try please?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
"Drouvot, Bertrand"
Date:
Hi,

On 4/11/23 10:55 AM, Drouvot, Bertrand wrote:
> Hi,
> 
> I think we might want to add:
> 
> $node_primary->wait_for_replay_catchup($node_standby);
> 
> before calling the slot creation.
> 
> It's done in the attached, would it be possible to give it a try please?

Actually, let's also wait for the cascading standby to catchup too, like done in the attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment

Re: Minimal logical decoding on standbys

From
Andres Freund
Date:
Hi,

On 2023-04-11 11:04:50 +0200, Drouvot, Bertrand wrote:
> On 4/11/23 10:55 AM, Drouvot, Bertrand wrote:
> > Hi,
> > 
> > I think we might want to add:
> > 
> > $node_primary->wait_for_replay_catchup($node_standby);
> > 
> > before calling the slot creation.
> > 
> > It's done in the attached, would it be possible to give it a try please?
> 
> Actually, let's also wait for the cascading standby to catchup too, like done in the attached.

Pushed. Seems like a clear race in the test, so I didn't think it was worth
waiting for testing it on hoverfly.

I think we should lower the log level, but perhaps wait for a few more cycles
in case there are random failures?

I wonder if we should make the connections in poll_query_until to reduce
verbosity - it's pretty annoying how much that can bloat the log. Perhaps also
introduce some backoff? It's really annoying to have to trawl through all
those logs when there's a problem.

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I think we should lower the log level, but perhaps wait for a few more cycles
> in case there are random failures?

Removing

-log_min_messages = 'debug2'
-log_error_verbosity = verbose

not only reduces 035's log output volume from 1.6MB to 260kB,
but also speeds it up nontrivially: on my machine it takes
about 8.50 sec as of HEAD, but 8.09 sec after silencing the
extra logging.  So I definitely want to see that out of there.

            regards, tom lane



Re: Minimal logical decoding on standbys

From
Noah Misch
Date:
On Tue, Apr 11, 2023 at 01:10:57PM -0700, Andres Freund wrote:
> On 2023-04-11 11:04:50 +0200, Drouvot, Bertrand wrote:
> > On 4/11/23 10:55 AM, Drouvot, Bertrand wrote:
> > > I think we might want to add:
> > > 
> > > $node_primary->wait_for_replay_catchup($node_standby);
> > > 
> > > before calling the slot creation.

> Pushed. Seems like a clear race in the test, so I didn't think it was worth
> waiting for testing it on hoverfly.

We'll see what happens in the next run.

> I think we should lower the log level, but perhaps wait for a few more cycles
> in case there are random failures?

Fine with me.

> I wonder if we should make the connections in poll_query_until to reduce
> verbosity - it's pretty annoying how much that can bloat the log. Perhaps also
> introduce some backoff? It's really annoying to have to trawl through all
> those logs when there's a problem.

Agreed.  My ranked wish list for poll_query_until is:

1. Exponential backoff
2. Closed-loop time control via Time::HiRes or similar, instead of assuming
   that ten loops complete in ~1s.  I've seen the loop take 3x as long as the
   intended timeout.
3. Connect less often than today's once per probe