Thread: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

Hi hackers,

Please find attached a patch proposal to $SUBJECT.

Indeed, we have seen occurrences in [1] that some slots were
not invalidated (while we expected vacuum to remove dead rows
leading to slots invalidation on the standby).

Though we don't have strong evidences that this
was due to transactions holding back global xmin (as vacuum did
not run in verbose mode), suspicion is high enough (as Tom pointed
out that the test is broken on its face (see [1])).

The proposed patch:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- Ensure that vacuum is able to remove dead rows before launching
the slots invalidation tests.
- If after 10 attempts the vacuum is still not able to remove the dead
rows then the slots invalidation tests are skipped: that should be pretty
rare, as currently the majority of the tests are green (with only one attempt).

While at it, the patch also addresses the nitpicks mentioned by Robert in [2].

[1]:
https://www.postgresql.org/message-id/flat/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9%40OSZPR01MB6310.jpnprd01.prod.outlook.com#71898e088d2a57564a1bd9c41f3e6f36
[2]: https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment
Hi,

On 5/30/23 12:34 PM, Drouvot, Bertrand wrote:
> Hi hackers,
> 
> Please find attached a patch proposal to $SUBJECT.
> 
> Indeed, we have seen occurrences in [1] that some slots were
> not invalidated (while we expected vacuum to remove dead rows
> leading to slots invalidation on the standby).
> 
> Though we don't have strong evidences that this
> was due to transactions holding back global xmin (as vacuum did
> not run in verbose mode), suspicion is high enough (as Tom pointed
> out that the test is broken on its face (see [1])).
> 
> The proposed patch:
> 
> - set autovacuum = off on the primary (as autovacuum is the usual suspect
> for holding global xmin).
> - Ensure that vacuum is able to remove dead rows before launching
> the slots invalidation tests.
> - If after 10 attempts the vacuum is still not able to remove the dead
> rows then the slots invalidation tests are skipped: that should be pretty
> rare, as currently the majority of the tests are green (with only one attempt).
> 
> While at it, the patch also addresses the nitpicks mentioned by Robert in [2].
> 
> [1]:
https://www.postgresql.org/message-id/flat/OSZPR01MB6310CFFD7D0DCD60A05DB1C3FD4A9%40OSZPR01MB6310.jpnprd01.prod.outlook.com#71898e088d2a57564a1bd9c41f3e6f36
> [2]: https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com
> 

Please find attached V2 that, instead of the above proposal, waits for a new snapshot
that has a newer horizon before doing the vacuum (as proposed by Andres in [1]).

So, V2:

- set autovacuum = off on the primary (as autovacuum is the usual suspect
for holding global xmin).
- waits for a new snapshot that has a newer horizon before doing the vacuum(s).
- addresses the nitpicks mentioned by Robert in [2].

V2 also keeps the verbose mode for the vacuum(s) (as done in V1), as it may help
for further analysis if needed.

[1]: https://www.postgresql.org/message-id/20230530152426.ensapay7pozh7ozn%40alap3.anarazel.de
[2]: https://www.postgresql.org/message-id/CA%2BTgmobHGpU2ZkChgKifGDLaf_%2BmFA7njEpeTjfyNf_msCZYew%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment
Hello Michael and Bertrand,

I'd also like to note that even with FREEZE added [1], I happened to see
the test failure:
5       #   Failed test 'inactiveslot slot invalidation is logged with vacuum on pg_class'
5       #   at t/035_standby_logical_decoding.pl line 222.
5
5       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
5       #   at t/035_standby_logical_decoding.pl line 227.

where 035_standby_logical_decoding_primary.log contains:
...
2024-01-09 07:44:26.480 UTC [820142] 035_standby_logical_decoding.pl LOG:  statement: DROP TABLE conflict_test;
2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl LOG:  statement: VACUUM (VERBOSE, FREEZE)
pg_class;
2024-01-09 07:44:26.687 UTC [820142] 035_standby_logical_decoding.pl INFO:  aggressively vacuuming 
"testdb.pg_catalog.pg_class"
2024-01-09 07:44:27.099 UTC [820143] DEBUG:  autovacuum: processing database "testdb"
2024-01-09 07:44:27.102 UTC [820142] 035_standby_logical_decoding.pl INFO:  finished vacuuming 
"testdb.pg_catalog.pg_class": index scans: 1
         pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
         tuples: 0 removed, 423 remain, 4 are dead but not yet removable
         removable cutoff: 762, which was 2 XIDs old when operation ended
         new relfrozenxid: 762, which is 2 XIDs ahead of previous value
         frozen: 1 pages from table (9.09% of total) had 1 tuples frozen
....

Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
0174c2d21 should be superseded by a patch like discussed (or just have
autovacuum = off added)...

09.01.2024 07:59, Michael Paquier wrote:
> Alexander, does the test gain in stability once you begin using the
> patch posted on [2], mentioned by Bertrand?
>
> (Also, perhaps we'd better move the discussion to the other thread
> where the patch has been sent.)
>
> [2]: https://www.postgresql.org/message-id/d40d015f-03a4-1cf2-6c1f-2b9aca860762@gmail.com

09.01.2024 08:29, Bertrand Drouvot wrote:
> Alexander, pleae find attached v3 which is more or less a rebased version of it.

Bertrand, thank you for updating the patch!

Michael, it definitely increases stability of the test (tens of iterations
with 20 tests in parallel performed successfully), although I've managed to
see another interesting failure (twice):
13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
13      #   at t/035_standby_logical_decoding.pl line 227.

psql:<stdin>:1: INFO:  vacuuming "testdb.pg_catalog.pg_class"
psql:<stdin>:1: INFO:  finished vacuuming "testdb.pg_catalog.pg_class": index scans: 1
pages: 0 removed, 11 remain, 11 scanned (100.00% of total)
tuples: 4 removed, 419 remain, 0 are dead but not yet removable
removable cutoff: 754, which was 0 XIDs old when operation ended
...
Waiting for replication conn standby's replay_lsn to pass 0/403E6F8 on primary

But I see no VACUUM records in WAL:
rmgr: Transaction len (rec/tot):    222/   222, tx:          0, lsn: 0/0403E468, prev 0/0403E370, desc: INVALIDATION ;

inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 relcache 
2662 relcache 2663 relcache 3455 relcache 1259
rmgr: Standby     len (rec/tot):    234/   234, tx:          0, lsn: 0/0403E548, prev 0/0403E468, desc: INVALIDATIONS ;

relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 55 catcache 54 catcache 55 catcache 54 catcache 55

catcache 54 catcache 55 catcache 54 relcache 2662 relcache 2663 relcache 3455 relcache 1259
rmgr: Heap        len (rec/tot):     60/   140, tx:        754, lsn: 0/0403E638, prev 0/0403E548, desc: INSERT off: 2,

flags: 0x08, blkref #0: rel 1663/16384/16385 blk 0 FPW
rmgr: Transaction len (rec/tot):     46/    46, tx:        754, lsn: 0/0403E6C8, prev 0/0403E638, desc: COMMIT 
2024-01-09 13:40:59.873385 UTC
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0403E6F8, prev 0/0403E6C8, desc: RUNNING_XACTS 
nextXid 755 latestCompletedXid 754 oldestRunningXid 755
rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn: 0/0403E730, prev 0/0403E6F8, desc:
CHECKPOINT_REDO
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0403E750, prev 0/0403E730, desc: RUNNING_XACTS 
nextXid 755 latestCompletedXid 754 oldestRunningXid 755
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/0403E788, prev 0/0403E750, desc: 
CHECKPOINT_ONLINE redo 0/403E730; tli 1; prev tli 1; fpw true; xid 0:755; oid 24576; multi 1; offset 0; oldest xid 728

in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 755; online
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/0403E800, prev 0/0403E788, desc: RUNNING_XACTS 
nextXid 755 latestCompletedXid 754 oldestRunningXid 755

(Full logs are attached.)

[1] https://www.postgresql.org/message-id/4fd52508-54d7-0202-5bd3-546c2295967f%40gmail.com

Best regards,
Alexander
Attachment
On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
> 0174c2d21 should be superseded by a patch like discussed (or just have
> autovacuum = off added)...

Adding an extra FREEZE offers an extra insurance, so I don't see why
it would be a problem to add it to stabilize the horizon conflicts on
the standbys.

> 09.01.2024 07:59, Michael Paquier wrote:
> Bertrand, thank you for updating the patch!
>
> Michael, it definitely increases stability of the test (tens of iterations
> with 20 tests in parallel performed successfully), although I've managed to
> see another interesting failure (twice):
> 13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> 13      #   at t/035_standby_logical_decoding.pl line 227.

Something I'd like to confirm here: you still see this failure with
the patch, but without an extra FREEZE, right?  If we do both, the
test would get more stable, wouldn't it?
--
Michael

Attachment
Hi,

On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
> Michael, it definitely increases stability of the test (tens of iterations
> with 20 tests in parallel performed successfully),

Thanks for testing!

> although I've managed to
> see another interesting failure (twice):
> 13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> 13      #   at t/035_standby_logical_decoding.pl line 227.
> 

Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding:

"
[11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class
[11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class
"

That seems weird, the inactive slot has been invalidated while the active one is not.
While it takes a bit longer to invalidate an active slot, I don't think the test can
move on until both are invalidated (then leading to the tests 24 and 25)). I can
see the tests are very slow to run (13.993s for 24) but still don't get how 24 could
succeed while 25 does not.

Looking at 2/regress_log_035_standby_logical_decoding:

"
[13:41:02.076](20.279s) ok 23 - inactiveslot slot invalidation is logged with vacuum on pg_class
[13:41:02.076](0.000s) not ok 24 - activeslot slot invalidation is logged with vacuum on pg_class
"

Same "weird" behavior but this time the tests numbering are not the same (23 and 24).
That is even more weird as those tests should be the 24 and 25 ones.

Would it be possible to also send the standby logs?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Hi,

10.01.2024 07:26, Michael Paquier wrote:
> On Tue, Jan 09, 2024 at 08:00:00PM +0300, Alexander Lakhin wrote:
>> Thus just adding FREEZE is not enough, seemingly. It makes me wonder if
>> 0174c2d21 should be superseded by a patch like discussed (or just have
>> autovacuum = off added)...
> Adding an extra FREEZE offers an extra insurance, so I don't see why
> it would be a problem to add it to stabilize the horizon conflicts on
> the standbys.

As 0174c2d21 added FREEZE already, I meant to add "autovacuum = off" or
apply a fix similar to what we're are discussing here.

>
>> Michael, it definitely increases stability of the test (tens of iterations
>> with 20 tests in parallel performed successfully), although I've managed to
>> see another interesting failure (twice):
>> 13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
>> 13      #   at t/035_standby_logical_decoding.pl line 227.
> Something I'd like to confirm here: you still see this failure with
> the patch, but without an extra FREEZE, right?  If we do both, the
> test would get more stable, wouldn't it?

Yes, I tested the patch as-is, without FREEZE, but it looks like it doesn't
matter in that case. And sorry for misleading information about missing
VACUUM records in my previous message, please ignore it.

10.01.2024 12:46, Bertrand Drouvot wrote:

>> although I've managed to
>> see another interesting failure (twice):
>> 13      #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
>> 13      #   at t/035_standby_logical_decoding.pl line 227.
>>
> Looking at the attached log files and particularly 1/regress_log_035_standby_logical_decoding:
>
> "
> [11:05:28.118](13.993s) ok 24 - inactiveslot slot invalidation is logged with vacuum on pg_class
> [11:05:28.119](0.001s) not ok 25 - activeslot slot invalidation is logged with vacuum on pg_class
> "
>
> That seems weird, the inactive slot has been invalidated while the active one is not.
> While it takes a bit longer to invalidate an active slot, I don't think the test can
> move on until both are invalidated (then leading to the tests 24 and 25)). I can
> see the tests are very slow to run (13.993s for 24) but still don't get how 24 could
> succeed while 25 does not.
>
> ...
>
> Would it be possible to also send the standby logs?

Yes, please look at the attached logs. This time I've build postgres with
-DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
wal_keep_size=1GB
wal_debug = on
log_autovacuum_min_duration = 0
log_statement = 'all'
log_min_messages = INFO

The archive attached contains logs from four runs:
recovery-1-ok -- an example of successful run for reference
recovery-7-pruning and recovery-19-pruning -- failures with a failed
  subtest 'activeslot slot invalidation is logged with on-access pruning'
recovery-15-vacuum_pg_class -- a failure with a failed
  subtest 'activeslot slot invalidation is logged with vacuum on pg_class'

The distinction that I see in the failed run logs, for example in
recovery-15-vacuum_pg_class, 035_standby_logical_decoding_standby.log:
2024-01-10 11:00:18.700 UTC [789618] LOG:  REDO @ 0/4020220; LSN 0/4020250: prev 0/401FDE0; xid 753; len 20 - 
Transaction/COMMIT: 2024-01-10 11:00:18.471694+00
2024-01-10 11:00:18.797 UTC [789618] LOG:  REDO @ 0/4020250; LSN 0/4020288: prev 0/4020220; xid 0; len 24 - 
Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid 753 oldestRunningXid 754
2024-01-10 11:00:19.013 UTC [789618] LOG:  REDO @ 0/4020288; LSN 0/40202C8: prev 0/4020250; xid 0; len 9; blkref #0:
rel
 
1663/16384/2610, blk 0 - Heap2/PRUNE: snapshotConflictHorizon: 752, nredirected: 0, ndead: 1, isCatalogRel: T, nunused:

0, redirected: [], dead: [48], unused: []
2024-01-10 11:00:19.111 UTC [789618] LOG:  invalidating obsolete replication slot "row_removal_inactiveslot"
2024-01-10 11:00:19.111 UTC [789618] DETAIL:  The slot conflicted with xid horizon 752.
2024-01-10 11:00:19.111 UTC [789618] CONTEXT:  WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752, 
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 
1663/16384/2610, blk 0
2024-01-10 11:00:29.109 UTC [789618] LOG:  terminating process 790377 to release replication slot
"row_removal_activeslot"
2024-01-10 11:00:29.109 UTC [789618] DETAIL:  The slot conflicted with xid horizon 752.
2024-01-10 11:00:29.109 UTC [789618] CONTEXT:  WAL redo at 0/4020288 for Heap2/PRUNE: snapshotConflictHorizon: 752, 
nredirected: 0, ndead: 1, isCatalogRel: T, nunused: 0, redirected: [], dead: [48], unused: []; blkref #0: rel 
1663/16384/2610, blk 0
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl ERROR:  canceling statement due to conflict with 
recovery
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl DETAIL:  User was using a logical replication slot

that must be invalidated.
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl STATEMENT:  START_REPLICATION SLOT 
"row_removal_activeslot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')
2024-01-10 11:00:30.144 UTC [790377] 035_standby_logical_decoding.pl LOG:  released logical replication slot 
"row_removal_activeslot"

is an absent message 'obsolete replication slot "row_removal_activeslot"'
and an additional record 'Standby/RUNNING_XACTS', which can be found in
035_standby_logical_decoding_primary.log:
2024-01-10 11:00:18.515 UTC [783410] LOG:  xlog bg flush request write 0/4020250; flush: 0/4020250, current is write 
0/4020220; flush 0/4020220
2024-01-10 11:00:18.646 UTC [783387] LOG:  INSERT @ 0/4020288:  - Standby/RUNNING_XACTS: nextXid 754 latestCompletedXid

753 oldestRunningXid 754
2024-01-10 11:00:18.702 UTC [790526] 035_standby_logical_decoding.pl LOG:  statement: SELECT (select 
txid_snapshot_xmin(txid_current_snapshot()) - 753) > 0
2024-01-10 11:00:18.724 UTC [783410] LOG:  xlog bg flush request write 0/4020288; flush: 0/4020288, current is write 
0/4020250; flush 0/4020250

So perhaps it can affect an active slot invalidation?

Best regards,
Alexander
Attachment
Hi,

On Wed, Jan 10, 2024 at 05:00:01PM +0300, Alexander Lakhin wrote:
> 10.01.2024 12:46, Bertrand Drouvot wrote:
> 
> > Would it be possible to also send the standby logs?
> 
> Yes, please look at the attached logs. This time I've build postgres with
> -DWAL_DEBUG and ran tests with TEMP_CONFIG as below:
> wal_keep_size=1GB
> wal_debug = on
> log_autovacuum_min_duration = 0
> log_statement = 'all'
> log_min_messages = INFO
> 
> The archive attached contains logs from four runs:
> recovery-1-ok -- an example of successful run for reference
> recovery-7-pruning and recovery-19-pruning -- failures with a failed
>  subtest 'activeslot slot invalidation is logged with on-access pruning'
> recovery-15-vacuum_pg_class -- a failure with a failed
>  subtest 'activeslot slot invalidation is logged with vacuum on pg_class'

Thanks a lot for the testing!

> is an absent message 'obsolete replication slot "row_removal_activeslot"'

Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:

Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().

In case of an active slot we first call ReportSlotInvalidation with the second
parameter set to true (to emit the "terminating" message), then SIGTERM the active
process and then (later) we should call the other ReportSlotInvalidation()
call with the second parameter set to false (to emit the message that we don't
see here).

So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
call. 

The thing it that it looks like we exited the loop in InvalidatePossiblyObsoleteSlot()
because there is more messages from the startup process (789618) after the:


"
2024-01-10 11:00:29.109 UTC [789618] LOG:  terminating process 790377 to release replication slot
"row_removal_activeslot"
"

one.

Do you think you could try to add more debug messages in InvalidatePossiblyObsoleteSlot()
to understand why the second call to ReportSlotInvalidation() is not done and IIUC
why/how we exited the loop?

FWIW, I did try to reproduce by launching pg_recvlogical and then kill -SIGSTOP
it. Then producing a conflict, I'm able to get the first message and not the second
one (which is expected). But the startup process does not exit the loop, which is
expected here.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Hi Bertrand,

10.01.2024 19:32, Bertrand Drouvot wrote:
>
>> is an absent message 'obsolete replication slot "row_removal_activeslot"'
> Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:
>
> Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().
>
> In case of an active slot we first call ReportSlotInvalidation with the second
> parameter set to true (to emit the "terminating" message), then SIGTERM the active
> process and then (later) we should call the other ReportSlotInvalidation()
> call with the second parameter set to false (to emit the message that we don't
> see here).
>
> So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
> call.

I've found a way to reproduce the issue without slowing down a machine or
running multiple tests in parallel. It's enough for this to add a delay to
allow BackgroundWriterMain() to execute LogStandbySnapshot():
@@ -692,2 +690,3 @@
  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
+$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);

With this delay, I get the failure immediately:
$ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery
# +++ tap check in src/test/recovery +++
t/035_standby_logical_decoding.pl .. 47/?
#   Failed test 'activeslot slot invalidation is logged with on-access pruning'
#   at t/035_standby_logical_decoding.pl line 227.

_primary.log contains:
2024-01-11 09:37:01.731 UTC [67656] 035_standby_logical_decoding.pl STATEMENT:  UPDATE prun SET s = 'D';
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG:  statement: SELECT pg_sleep(15);
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl LOG:  xlog flush request 0/404D8F0; write 0/0;
flush
 
0/0 at character 8
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl CONTEXT:  writing block 14 of relation
base/16384/1247
2024-01-11 09:37:01.738 UTC [67664] 035_standby_logical_decoding.pl STATEMENT:  SELECT pg_sleep(15);
2024-01-11 09:37:01.905 UTC [67204] LOG:  xlog flush request 0/404DA58; write 0/404BB00; flush 0/404BB00
2024-01-11 09:37:01.905 UTC [67204] CONTEXT:  writing block 4 of relation base/16384/2673
2024-01-11 09:37:12.514 UTC [67204] LOG:  INSERT @ 0/4057768:  - Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid

768 oldestRunningXid 769
2024-01-11 09:37:12.514 UTC [67206] LOG:  xlog bg flush request write 0/4057768; flush: 0/4057768, current is write 
0/4057730; flush 0/4057730
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG:  statement: UPDATE prun SET s = 'E';
2024-01-11 09:37:16.760 UTC [67712] 035_standby_logical_decoding.pl LOG:  INSERT @ 0/40577A8:  - Heap2/PRUNE: 
snapshotConflictHorizon: 768,...

Note RUNNING_XACTS here...

_standby.log contains:
2024-01-11 09:37:16.842 UTC [67606] LOG:  invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL:  The slot conflicted with xid horizon 768.
2024-01-11 09:37:16.842 UTC [67606] CONTEXT:  WAL redo at 0/4057768 for Heap2/PRUNE: snapshotConflictHorizon: 768, ...
and no 'invalidating obsolete replication slot "pruning_activeslot"' below.

Debug logging added (see attached) gives more information:
2024-01-11 09:37:16.842 UTC [67606] LOG:  invalidating obsolete replication slot "pruning_inactiveslot"
2024-01-11 09:37:16.842 UTC [67606] DETAIL:  The slot conflicted with xid horizon 768.
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| RS_INVAL_HORIZON, s: 0x7f7985475c10, 
s->effective_xmin: 0, s->effective_catalog_xmin: 769, snapshotConflictHorizon: 768
...
2024-01-11 09:37:16.842 UTC [67606] LOG: !!!InvalidatePossiblyObsoleteSlot| conflict: 0

so the condition TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
     snapshotConflictHorizon) is not satisfied, hence conflict = 0 and it breaks
the loop in InvalidatePossiblyObsoleteSlot().
Several lines above in the log we can see:
2024-01-11 09:37:12.514 UTC [67606] LOG:  REDO @ 0/4057730; LSN 0/4057768: prev 0/4057700; xid 0; len 24 - 
Standby/RUNNING_XACTS: nextXid 769 latestCompletedXid 768 oldestRunningXid 769
2024-01-11 09:37:12.540 UTC [67643] 035_standby_logical_decoding.pl LOG:  !!!LogicalConfirmReceivedLocation| 
MyReplicationSlot: 0x7f7985475c10, MyReplicationSlot->effective_catalog_xmin: 769

and that's the first occurrence of xid 769 in the log.

The decoded stack trace for the LogicalConfirmReceivedLocation call is:
ogicalConfirmReceivedLocation at logical.c:1886:1
ProcessStandbyReplyMessage at walsender.c:2327:1
ProcessStandbyMessage at walsender.c:2188:1
ProcessRepliesIfAny at walsender.c:2121:5
WalSndWaitForWal at walsender.c:1735:7
logical_read_xlog_page at walsender.c:1068:13
ReadPageInternal at xlogreader.c:1062:12
XLogDecodeNextRecord at xlogreader.c:601:5
XLogReadAhead at xlogreader.c:976:5
XLogReadRecord at xlogreader.c:406:3
XLogSendLogical at walsender.c:3229:5
WalSndLoop at walsender.c:2658:7
StartLogicalReplication at walsender.c:1477:2
exec_replication_command at walsender.c:1985:6
PostgresMain at postgres.c:4649:10

Best regards,
Alexander
Attachment
Hi,

On Thu, Jan 11, 2024 at 01:00:00PM +0300, Alexander Lakhin wrote:
> Hi Bertrand,
> 
> 10.01.2024 19:32, Bertrand Drouvot wrote:
> > 
> > > is an absent message 'obsolete replication slot "row_removal_activeslot"'
> > Looking at recovery-15-vacuum_pg_class/i035_standby_logical_decoding_standby.log here:
> > 
> > Yeah, the missing message has to come from InvalidatePossiblyObsoleteSlot().
> > 
> > In case of an active slot we first call ReportSlotInvalidation with the second
> > parameter set to true (to emit the "terminating" message), then SIGTERM the active
> > process and then (later) we should call the other ReportSlotInvalidation()
> > call with the second parameter set to false (to emit the message that we don't
> > see here).
> > 
> > So it means InvalidatePossiblyObsoleteSlot() did not trigger the second ReportSlotInvalidation()
> > call.
> 
> I've found a way to reproduce the issue without slowing down a machine or
> running multiple tests in parallel. It's enough for this to add a delay to
> allow BackgroundWriterMain() to execute LogStandbySnapshot():
> @@ -692,2 +690,3 @@
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'D';]);
> +$node_primary->safe_psql('testdb', qq[SELECT pg_sleep(15);]);
>  $node_primary->safe_psql('testdb', qq[UPDATE prun SET s = 'E';]);
> 
> With this delay, I get the failure immediately:
> $ PROVE_TESTS="t/035*" TEMP_CONFIG=/tmp/extra.config make check -s -C src/test/recovery
> # +++ tap check in src/test/recovery +++
> t/035_standby_logical_decoding.pl .. 47/?
> #   Failed test 'activeslot slot invalidation is logged with on-access pruning'
> #   at t/035_standby_logical_decoding.pl line 227.

Thanks a lot for the testing!

So I think we have 2 issues here:

1) The one you're mentioning above related to the on-access pruning test:

I think the engine behavior is right here and that the test is racy. I'm
proposing to bypass the active slot invalidation check for this particular test (
as I don't see any "easy" way to take care of this race condition). The active slot
invalidation is already well covered in the other tests anyway. 

I'm proposing the attached v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch
for it.

2) The fact that sometime we're getting a termination message which is not followed
by an obsolete one (like as discussed in [1]).

For this one, I think that InvalidatePossiblyObsoleteSlot() is racy:

In case of an active slot we proceed in 2 steps:
 - terminate the backend holding the slot
 - report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that time.

I'm proposing the attached v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
for it.

Would it be possible to re-launch your repro (the slow one, not the pg_sleep() one)
with bot patch applied and see how it goes? (Please note that v4 replaces v3 that
you're already using in your tests).

If it helps, I'll propose v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
into a dedicated hackers thread.

[1]: https://www.postgresql.org/message-id/ZZ7GpII4bAYN%2BjT5%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
11.01.2024 17:58, Bertrand Drouvot wrote:
> So I think we have 2 issues here:
>
> 1) The one you're mentioning above related to the on-access pruning test:
>
> I think the engine behavior is right here and that the test is racy. I'm
> proposing to bypass the active slot invalidation check for this particular test (
> as I don't see any "easy" way to take care of this race condition). The active slot
> invalidation is already well covered in the other tests anyway.
>
> I'm proposing the attached v4-0001-Fix-035_standby_logical_decoding.pl-race-conditio.patch
> for it.
>
> 2) The fact that sometime we're getting a termination message which is not followed
> by an obsolete one (like as discussed in [1]).
>
> For this one, I think that InvalidatePossiblyObsoleteSlot() is racy:
>
> In case of an active slot we proceed in 2 steps:
>   - terminate the backend holding the slot
>   - report the slot as obsolete
>
> This is racy because between the two we release the mutex on the slot, which
> means the effective_xmin and effective_catalog_xmin could advance during that time.
>
> I'm proposing the attached v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
> for it.
>
> Would it be possible to re-launch your repro (the slow one, not the pg_sleep() one)
> with bot patch applied and see how it goes? (Please note that v4 replaces v3 that
> you're already using in your tests).
>
> If it helps, I'll propose v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
> into a dedicated hackers thread.
>
> [1]: https://www.postgresql.org/message-id/ZZ7GpII4bAYN%2BjT5%40ip-10-97-1-34.eu-west-3.compute.internal

Bertrand, I've relaunched tests in the same slowed down VM with both
patches applied (but with no other modifications) and got a failure
with pg_class, similar to what we had seen before:
9       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
9       #   at t/035_standby_logical_decoding.pl line 230.

Please look at the logs attached (I see there Standby/RUNNING_XACTS near
'invalidating obsolete replication slot "row_removal_inactiveslot"').

Best regards,
Alexander
Attachment
On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:
> Bertrand, I've relaunched tests in the same slowed down VM with both
> patches applied (but with no other modifications) and got a failure
> with pg_class, similar to what we had seen before:
> 9       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> 9       #   at t/035_standby_logical_decoding.pl line 230.
>
> Please look at the logs attached (I see there Standby/RUNNING_XACTS near
> 'invalidating obsolete replication slot "row_removal_inactiveslot"').

Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
= minimal, because these lead to unpredictible records inserted,
impacting the reliability of the tests.  We cannot do that here,
obviously.  That may be a long shot, but could it be possible to tweak
the test with a retry logic, retrying things if such a standby
snapshot is found because we know that the invalidation is not going
to work anyway?
--
Michael

Attachment
Hi,

On Fri, Jan 12, 2024 at 07:01:55AM +0900, Michael Paquier wrote:
> On Thu, Jan 11, 2024 at 11:00:01PM +0300, Alexander Lakhin wrote:
> > Bertrand, I've relaunched tests in the same slowed down VM with both
> > patches applied (but with no other modifications) and got a failure
> > with pg_class, similar to what we had seen before:
> > 9       #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
> > 9       #   at t/035_standby_logical_decoding.pl line 230.
> > 
> > Please look at the logs attached (I see there Standby/RUNNING_XACTS near
> > 'invalidating obsolete replication slot "row_removal_inactiveslot"').

Thanks! 

For this one, the "good" news is that it looks like that we don’t see the
"terminating" message not followed by an "obsolete" message (so the engine
behaves correctly) anymore.

There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin
advanced and there is no conflict).

And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" the
catalog_xmin of the active slot.

> Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
> = minimal, because these lead to unpredictible records inserted,
> impacting the reliability of the tests.  We cannot do that here,
> obviously.  That may be a long shot, but could it be possible to tweak
> the test with a retry logic, retrying things if such a standby
> snapshot is found because we know that the invalidation is not going
> to work anyway?

I think it all depends what the xl_running_xacts does contain (means does it
"advance" or not the catalog_xmin in our case).

In our case it does advance it (should it occurs) due to the "select txid_current()"
that is done in wait_until_vacuum_can_remove() in 035_standby_logical_decoding.pl.

I suggest to make use of txid_current_snapshot() instead (that does not produce
a Transaction/COMMIT wal record, as opposed to txid_current()).

I think that it could be "enough" for our case here, and it's what v5 attached is
now doing.

Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
Hi,

12.01.2024 10:15, Bertrand Drouvot wrote:
>
> For this one, the "good" news is that it looks like that we don’t see the
> "terminating" message not followed by an "obsolete" message (so the engine
> behaves correctly) anymore.
>
> There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin
> advanced and there is no conflict).

Yes, judging from all the failures that we see now, it looks like the
0001-Fix-race-condition...patch works as expected.

> And I agree that this is due to the Standby/RUNNING_XACTS that is "advancing" the
> catalog_xmin of the active slot.
>
>> Standby/RUNNING_XACTS is exactly why 039_end_of_wal.pl uses wal_level
>> = minimal, because these lead to unpredictible records inserted,
>> impacting the reliability of the tests.  We cannot do that here,
>> obviously.  That may be a long shot, but could it be possible to tweak
>> the test with a retry logic, retrying things if such a standby
>> snapshot is found because we know that the invalidation is not going
>> to work anyway?
> I think it all depends what the xl_running_xacts does contain (means does it
> "advance" or not the catalog_xmin in our case).
>
> In our case it does advance it (should it occurs) due to the "select txid_current()"
> that is done in wait_until_vacuum_can_remove() in 035_standby_logical_decoding.pl.
>
> I suggest to make use of txid_current_snapshot() instead (that does not produce
> a Transaction/COMMIT wal record, as opposed to txid_current()).
>
> I think that it could be "enough" for our case here, and it's what v5 attached is
> now doing.
>
> Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
> too).

Unfortunately, I've got the failure again (please see logs attached).
(_primary.log can confirm that I have used exactly v5 — I see no
txid_current() calls there...)

Best regards,
Alexander
Attachment
Hi,

On Fri, Jan 12, 2024 at 02:00:01PM +0300, Alexander Lakhin wrote:
> Hi,
> 
> 12.01.2024 10:15, Bertrand Drouvot wrote:
> > 
> > For this one, the "good" news is that it looks like that we don’t see the
> > "terminating" message not followed by an "obsolete" message (so the engine
> > behaves correctly) anymore.
> > 
> > There is simply nothing related to the row_removal_activeslot at all (the catalog_xmin
> > advanced and there is no conflict).
> 
> Yes, judging from all the failures that we see now, it looks like the
> 0001-Fix-race-condition...patch works as expected.

Yeah, thanks for confirming, I'll create a dedicated hackers thread for that one.

> > Let's give v5 a try? (please apply v1-0001-Fix-race-condition-in-InvalidatePossiblyObsoleteS.patch
> > too).
> 
> Unfortunately, I've got the failure again (please see logs attached).
> (_primary.log can confirm that I have used exactly v5 — I see no
> txid_current() calls there...)

Okay ;-( Thanks for the testing. Then I can think of:

1) Michael's proposal up-thread (means tweak the test with a retry logic, retrying
things if such a standby snapshot is found).

2) Don't report a test error for active slots in case its catalog_xmin advanced.

I'd vote for 2) as:

- this is a corner case and the vast majority of the animals don't report any
issues (means the active slot conflict detection is already well covered).

- even on the same animal it should be pretty rare to not have an active slot 
conflict detection not covered at all (and the "failing" one would be probably
moving over time).

- It may be possible that 1) ends up failing (as we'd need to put a limit on the
retry logic anyhow).

What do you think?

And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's
fully correct, so I'll give it another look.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Fri, Jan 12, 2024 at 01:46:08PM +0000, Bertrand Drouvot wrote:
> 1) Michael's proposal up-thread (means tweak the test with a retry logic, retrying
> things if such a standby snapshot is found).
>
> 2) Don't report a test error for active slots in case its catalog_xmin advanced.
>
> I'd vote for 2) as:
>
> - this is a corner case and the vast majority of the animals don't report any
> issues (means the active slot conflict detection is already well covered).
>
> - even on the same animal it should be pretty rare to not have an active slot
> conflict detection not covered at all (and the "failing" one would be probably
> moving over time).
>
> - It may be possible that 1) ends up failing (as we'd need to put a limit on the
> retry logic anyhow).
>
> What do you think?
>
> And BTW, looking closely at wait_until_vacuum_can_remove(), I'm not sure it's
> fully correct, so I'll give it another look.

The WAL records related to standby snapshots are playing a lot with
the randomness of the failures we are seeing.  Alexander has mentioned
offlist something else: using SIGSTOP on the bgwriter to avoid these
records and make the test more stable.  That would not be workable for
Windows, but I could live with that knowing that logical decoding for
standbys has no platform-speficic tweak for the code paths we're
testing here, and that would put as limitation to skip the test for
$windows_os.

While thinking about that, a second idea came into my mind: a
superuser-settable developer GUC to disable such WAL records to be
generated within certain areas of the test.  This requires a small
implementation, but nothing really huge, while being portable
everywhere.  And it is not the first time I've been annoyed with these
records when wanting a predictible set of WAL records for some test
case.

Another possibility would be to move these records elsewhere, outside
of the bgwriter, but we need such records at a good frequency for the
availability of read-only standbys.  And surely we'd want an on/off
switch anyway to get a full control for test sequences.
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> While thinking about that, a second idea came into my mind: a
> superuser-settable developer GUC to disable such WAL records to be
> generated within certain areas of the test.  This requires a small
> implementation, but nothing really huge, while being portable
> everywhere.  And it is not the first time I've been annoyed with these
> records when wanting a predictible set of WAL records for some test
> case.

Hmm ... I see what you are after, but to what extent would this mean
that what we are testing is not our real-world behavior?

            regards, tom lane



On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> While thinking about that, a second idea came into my mind: a
>> superuser-settable developer GUC to disable such WAL records to be
>> generated within certain areas of the test.  This requires a small
>> implementation, but nothing really huge, while being portable
>> everywhere.  And it is not the first time I've been annoyed with these
>> records when wanting a predictible set of WAL records for some test
>> case.
>
> Hmm ... I see what you are after, but to what extent would this mean
> that what we are testing is not our real-world behavior?

Don't think so.  We don't care much about these records when it comes
to checking slot invalidation scenarios with a predictible XID
horizon, AFAIK.
--
Michael

Attachment
Hi,

On Mon, Jan 15, 2024 at 01:11:26PM +0900, Michael Paquier wrote:
> On Sun, Jan 14, 2024 at 11:08:39PM -0500, Tom Lane wrote:
> > Michael Paquier <michael@paquier.xyz> writes:
> >> While thinking about that, a second idea came into my mind: a
> >> superuser-settable developer GUC to disable such WAL records to be
> >> generated within certain areas of the test.  This requires a small
> >> implementation, but nothing really huge, while being portable
> >> everywhere.  And it is not the first time I've been annoyed with these
> >> records when wanting a predictible set of WAL records for some test
> >> case.
> > 
> > Hmm ... I see what you are after, but to what extent would this mean
> > that what we are testing is not our real-world behavior?
> 
> Don't think so.  We don't care much about these records when it comes
> to checking slot invalidation scenarios with a predictible XID
> horizon, AFAIK.

Yeah, we want to test slot invalidation behavior so we need to ensure that such
an invalidation occur (which is not the case if we get a xl_running_xacts in the
middle) at the first place.

OTOH I also see Tom's point: for example I think we'd not have discovered [1]
(outside from the field) with such a developer GUC in place.

We did a few things in this thread, so to sum up what we've discovered:

- a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
- we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
( proposal in in v6 attached)
- we need a way to control how frequent xl_running_xacts are emmitted (to ensure
they are not triggered in a middle of an active slot invalidation test).

I'm not sure it's possible to address Tom's concern and keep the test "predictable".

So, I think I'd vote for Michael's proposal to implement a superuser-settable
developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would
still not address Tom's concern anyway).

Another option would be to "sacrifice" the full predictablity of the test (in
favor of real-world behavior testing)?

[1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
Hello Michael and Bertrand,

15.01.2024 06:59, Michael Paquier wrote:
> The WAL records related to standby snapshots are playing a lot with
> the randomness of the failures we are seeing.  Alexander has mentioned
> offlist something else: using SIGSTOP on the bgwriter to avoid these
> records and make the test more stable.  That would not be workable for
> Windows, but I could live with that knowing that logical decoding for
> standbys has no platform-speficic tweak for the code paths we're
> testing here, and that would put as limitation to skip the test for
> $windows_os.

I've found a way to implement pause/resume for Windows processed and it
looks acceptable to me if we can afford "use Win32::API;" on Windows
(maybe the test could be skipped only if this perl module is absent).
Please look at the PoC patch for the test 035_standby_logical_decoding.
(The patched test passes for me.)

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.

> While thinking about that, a second idea came into my mind: a
> superuser-settable developer GUC to disable such WAL records to be
> generated within certain areas of the test.  This requires a small
> implementation, but nothing really huge, while being portable
> everywhere.  And it is not the first time I've been annoyed with these
> records when wanting a predictible set of WAL records for some test
> case.

I see that the test in question exists in REL_16_STABLE, it means that a
new GUC would not help there?

Best regards,
Alexander
Attachment
Hello,

15.01.2024 12:00, Alexander Lakhin wrote:
> If this approach looks promising to you, maybe we could add a submodule to
> perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
> 019_replslot_limit) as well.
>
> Personally I think that having such a functionality for using in tests
> might be useful not only to avoid some "problematic" behaviour but also to
> test the opposite cases.

After spending a few days on it, I've discovered two more issues:
https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com

(The latter is not related to bgwriter directly, but it was discovered
thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)

So it becomes clear that the 035 test is not the only one, which might
suffer from bgwriter's activity, and inventing a way to stop bgwriter/
control it is a different subject, getting out of scope of the current
issue.


15.01.2024 11:49, Bertrand Drouvot wrote:
> We did a few things in this thread, so to sum up what we've discovered:
>
> - a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
> - we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
> ( proposal in in v6 attached)
> - we need a way to control how frequent xl_running_xacts are emmitted (to ensure
> they are not triggered in a middle of an active slot invalidation test).
>
> I'm not sure it's possible to address Tom's concern and keep the test "predictable".
>
> So, I think I'd vote for Michael's proposal to implement a superuser-settable
> developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would
> still not address Tom's concern anyway).
>
> Another option would be to "sacrifice" the full predictablity of the test (in
> favor of real-world behavior testing)?
>
> [1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal

So, now we have the test 035 failing due to nondeterministic vacuum
activity in the first place, and due to bgwriter's activity in the second.
Maybe it would be a right move to commit the fix, and then think about
more rare failures.

Though I have a couple of question regarding the fix left, if you don't
mind:
1) The test has minor defects in the comments, that I noted before [1];
would you like to fix them in passing?

> BTW, it looks like the comment:
> # One way to produce recovery conflict is to create/drop a relation and
> # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
> in scenario 3 is a copy-paste error.
> Also, there are two "Scenario 4" in this test.
>

2) Shall we align the 035_standby_logical_decoding with
031_recovery_conflict in regard to improving stability of vacuum?
I see the following options for this:
a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
b) use FREEZE and autovacuum = off in both tests;
c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
  autovacuum = off in both.

I've re-tested the v6 patch today and confirmed that it makes the test
more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
parallel and got failures ('inactiveslot slot invalidation is logged with
vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
(With unlimited CPU, when average test duration is around 70 seconds.)

But with v6 applied, 60 iterations succeeded.

[1] https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010@gmail.com

Best regards,
Alexander



Hi,

On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> Hello,
> 
> 15.01.2024 12:00, Alexander Lakhin wrote:
> > If this approach looks promising to you, maybe we could add a submodule to
> > perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
> > 019_replslot_limit) as well.
> > 
> > Personally I think that having such a functionality for using in tests
> > might be useful not only to avoid some "problematic" behaviour but also to
> > test the opposite cases.
> 
> After spending a few days on it, I've discovered two more issues:
> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
> 
> (The latter is not related to bgwriter directly, but it was discovered
> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
> 
> So it becomes clear that the 035 test is not the only one, which might
> suffer from bgwriter's activity,

Yeah... thanks for sharing!

> and inventing a way to stop bgwriter/
> control it is a different subject, getting out of scope of the current
> issue.

Agree.

> 15.01.2024 11:49, Bertrand Drouvot wrote:
> > We did a few things in this thread, so to sum up what we've discovered:
> > 
> > - a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
> > - we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
> > ( proposal in in v6 attached)
> > - we need a way to control how frequent xl_running_xacts are emmitted (to ensure
> > they are not triggered in a middle of an active slot invalidation test).
> > 
> > I'm not sure it's possible to address Tom's concern and keep the test "predictable".
> > 
> > So, I think I'd vote for Michael's proposal to implement a superuser-settable
> > developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) would
> > still not address Tom's concern anyway).
> > 
> > Another option would be to "sacrifice" the full predictablity of the test (in
> > favor of real-world behavior testing)?
> > 
> > [1]: https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal
> 
> So, now we have the test 035 failing due to nondeterministic vacuum
> activity in the first place, and due to bgwriter's activity in the second.

Yeah, that's also my understanding.

> Maybe it would be a right move to commit the fix, and then think about
> more rare failures.

+1

> Though I have a couple of question regarding the fix left, if you don't
> mind:
> 1) The test has minor defects in the comments, that I noted before [1];
> would you like to fix them in passing?
> 
> > BTW, it looks like the comment:
> > # One way to produce recovery conflict is to create/drop a relation and
> > # launch a vacuum on pg_class with hot_standby_feedback turned off on the standby.
> > in scenario 3 is a copy-paste error.

Nice catch, thanks! Fixed in v7 attached.

> > Also, there are two "Scenario 4" in this test.

D'oh! Fixed in v7.

> > 
> 
> 2) Shall we align the 035_standby_logical_decoding with
> 031_recovery_conflict in regard to improving stability of vacuum?

Yeah, I think that could make sense.

> I see the following options for this:
> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
> b) use FREEZE and autovacuum = off in both tests;
> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>  autovacuum = off in both.
>

I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
sure using FREEZE would give full "stabilization predictability", at least for
035_standby_logical_decoding.pl). That said I did not test what the outcome would
be for 031_recovery_conflict.pl by making use of a).

> I've re-tested the v6 patch today and confirmed that it makes the test
> more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
> parallel and got failures ('inactiveslot slot invalidation is logged with
> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
> (With unlimited CPU, when average test duration is around 70 seconds.)
> 
> But with v6 applied, 60 iterations succeeded.

Nice! Thanks for the testing!

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Fri, Jan 19, 2024 at 09:03:01AM +0000, Bertrand Drouvot wrote:
> On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
>> 15.01.2024 12:00, Alexander Lakhin wrote:
>>> If this approach looks promising to you, maybe we could add a submodule to
>>> perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
>>> 019_replslot_limit) as well.
>>>
>>> Personally I think that having such a functionality for using in tests
>>> might be useful not only to avoid some "problematic" behaviour but also to
>>> test the opposite cases.
>>
>> After spending a few days on it, I've discovered two more issues:
>> https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
>> https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
>>
>> (The latter is not related to bgwriter directly, but it was discovered
>> thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)
>>
>> So it becomes clear that the 035 test is not the only one, which might
>> suffer from bgwriter's activity,
>
> Yeah... thanks for sharing!
>
>> and inventing a way to stop bgwriter/
>> control it is a different subject, getting out of scope of the current
>> issue.
>
> Agree.
>
>> 15.01.2024 11:49, Bertrand Drouvot wrote:
>> Maybe it would be a right move to commit the fix, and then think about
>> more rare failures.
>
> +1

Yeah, agreed to make things more stable before making them fancier.

>> 2) Shall we align the 035_standby_logical_decoding with
>> 031_recovery_conflict in regard to improving stability of vacuum?
>
> Yeah, I think that could make sense.

Probably.  That can always be done as a separate change, after a few
runs of the slow buildfarm members able to reproduce the failure.

>> I see the following options for this:
>> a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
>> b) use FREEZE and autovacuum = off in both tests;
>> c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
>>  autovacuum = off in both.
>
> I'd vote for a) as I've the feeling it's "easier" to understand (and I'm not
> sure using FREEZE would give full "stabilization predictability", at least for
> 035_standby_logical_decoding.pl). That said I did not test what the outcome would
> be for 031_recovery_conflict.pl by making use of a).

Yeah, I think I agree here with a), as v7 does for 035.

+# Launch $sql and wait for a new snapshot that has a newer horizon before
+# doing the vacuum with $vac_option on $to_vac.
+sub wait_until_vacuum_can_remove

This had better document what the arguments of this routine are,
because that's really unclear.  $to_vac is the relation that will be
vacuumed, for one.

Also, wouldn't it be better to document in the test why
txid_current_snapshot() is chosen?  Contrary to txid_current(), it
does not generate a Transaction/COMMIT to make the test more
predictible, something you have mentioned upthread, and implied in the
test.

-  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal

This removes two INSERTs on flush_wal and refactors the code to do the
INSERT in wait_until_vacuum_can_remove(), using a SQL comment to
document a reference about the reason why an INSERT is used.  Couldn't
you just use a normal comment here?

>> I've re-tested the v6 patch today and confirmed that it makes the test
>> more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
>> parallel and got failures ('inactiveslot slot invalidation is logged with
>> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
>> (With unlimited CPU, when average test duration is around 70 seconds.)
>>
>> But with v6 applied, 60 iterations succeeded.
>
> Nice! Thanks for the testing!

I need to review what you have more thoroughly, but is it OK to assume
that both of you are happy with the latest version of the patch in
terms of stability gained?  That's still not the full picture, still a
good step towards that.
--
Michael

Attachment
Hi,

On Mon, Jan 22, 2024 at 03:54:44PM +0900, Michael Paquier wrote:
> On Fri, Jan 19, 2024 at 09:03:01AM +0000, Bertrand Drouvot wrote:
> > On Fri, Jan 19, 2024 at 09:00:01AM +0300, Alexander Lakhin wrote:
> +# Launch $sql and wait for a new snapshot that has a newer horizon before
> +# doing the vacuum with $vac_option on $to_vac.
> +sub wait_until_vacuum_can_remove
> 
> This had better document what the arguments of this routine are,
> because that's really unclear.  $to_vac is the relation that will be
> vacuumed, for one.

Agree, done that way in v8 attached.

> Also, wouldn't it be better to document in the test why
> txid_current_snapshot() is chosen?  Contrary to txid_current(), it
> does not generate a Transaction/COMMIT to make the test more
> predictible, something you have mentioned upthread, and implied in the
> test.

Good point, added more comments in v8 (but not mentioning txid_current() as 
after giving more thought about it then I think it was not the right approach in
any case).

> 
> -  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
> 
> This removes two INSERTs on flush_wal and refactors the code to do the
> INSERT in wait_until_vacuum_can_remove(), using a SQL comment to
> document a reference about the reason why an INSERT is used.  Couldn't
> you just use a normal comment here?

Agree, done in v8.

> >> I've re-tested the v6 patch today and confirmed that it makes the test
> >> more stable. I ran (with bgwriter_delay = 10000 in temp.config) 20 tests in
> >> parallel and got failures ('inactiveslot slot invalidation is logged with
> >> vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
> >> (With unlimited CPU, when average test duration is around 70 seconds.)
> >> 
> >> But with v6 applied, 60 iterations succeeded.
> > 
> > Nice! Thanks for the testing!
> 
> I need to review what you have more thoroughly, but is it OK to assume
> that both of you are happy with the latest version of the patch in
> terms of stability gained?  That's still not the full picture, still a
> good step towards that.

Yeah, I can clearly see how this patch helps from a theoritical perspective but
rely on Alexander's testing to see how it actually stabilizes the test.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Mon, Jan 22, 2024 at 09:07:45AM +0000, Bertrand Drouvot wrote:
> On Mon, Jan 22, 2024 at 03:54:44PM +0900, Michael Paquier wrote:
>> Also, wouldn't it be better to document in the test why
>> txid_current_snapshot() is chosen?  Contrary to txid_current(), it
>> does not generate a Transaction/COMMIT to make the test more
>> predictible, something you have mentioned upthread, and implied in the
>> test.
>
> Good point, added more comments in v8 (but not mentioning txid_current() as
> after giving more thought about it then I think it was not the right approach in
> any case).

Fine by me.

>> -  INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
>>
>> This removes two INSERTs on flush_wal and refactors the code to do the
>> INSERT in wait_until_vacuum_can_remove(), using a SQL comment to
>> document a reference about the reason why an INSERT is used.  Couldn't
>> you just use a normal comment here?
>
> Agree, done in v8.

I've rewritten some of that, and applied the patch down to v16.  Let's
see how this stabilizes things, but that's likely going to take some
time as it depends on skink's mood.

>> I need to review what you have more thoroughly, but is it OK to assume
>> that both of you are happy with the latest version of the patch in
>> terms of stability gained?  That's still not the full picture, still a
>> good step towards that.
>
> Yeah, I can clearly see how this patch helps from a theoritical perspective but
> rely on Alexander's testing to see how it actually stabilizes the test.

Anyway, that's not the end of it.  What should we do for snapshot
snapshot records coming from the bgwriter?  The slower the system, the
higher the odds of hitting a conflict with such records, even if the
horizon check should help.
--
Michael

Attachment
Hi,

On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:
> On Mon, Jan 22, 2024 at 09:07:45AM +0000, Bertrand Drouvot wrote:
> 
> I've rewritten some of that, and applied the patch down to v16.

Thanks!

> > Yeah, I can clearly see how this patch helps from a theoritical perspective but
> > rely on Alexander's testing to see how it actually stabilizes the test.
> 
> Anyway, that's not the end of it.  What should we do for snapshot
> snapshot records coming from the bgwriter?

I've mixed feeling about it. On one hand we'll decrease even more the risk of
seeing a xl_running_xacts in the middle of a test, but on the other hand I agree
with Tom's concern [1]: I think that we could miss "corner cases/bug" detection
(like the one reported in [2]).

What about?

1) Apply "wait_until_vacuum_can_remove() + autovacuum disabled" where it makes
sense and for tests that suffers from random xl_running_xacts. I can look at
031_recovery_conflict.pl, do you have others in mind?
2) fix [2]
3) depending on how stabilized this test (and others that suffer from "random"
xl_running_xacts) is, then think about the bgwriter.

[1]: https://www.postgresql.org/message-id/1375923.1705291719%40sss.pgh.pa.us
[2]: https://www.postgresql.org/message-id/flat/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Hello Bertrand and Michael,

23.01.2024 11:07, Bertrand Drouvot wrote:
> On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:
>
>> Anyway, that's not the end of it.  What should we do for snapshot
>> snapshot records coming from the bgwriter?
> What about?
>
> 3) depending on how stabilized this test (and others that suffer from "random"
> xl_running_xacts) is, then think about the bgwriter.

A recent buildfarm failure [1] reminds me of that remaining question.
Here we have a slow machine (a successful run, for example [2], shows
541.13s duration of the test) and the following information logged:

[13:55:13.725](34.411s) ok 25 - inactiveslot slot invalidation is logged with vacuum on pg_class
[13:55:13.727](0.002s) not ok 26 - activeslot slot invalidation is logged with vacuum on pg_class
[13:55:13.728](0.001s) #   Failed test 'activeslot slot invalidation is logged with vacuum on pg_class'
#   at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 229.
[14:27:42.995](1949.267s) # poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
[14:27:42.999](0.004s) not ok 27 - confl_active_logicalslot updated
[14:27:43.000](0.001s) #   Failed test 'confl_active_logicalslot updated'
#   at C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.
Timed out waiting confl_active_logicalslot to be updated at 
C:/prog/bf/root/HEAD/pgsql/src/test/recovery/t/035_standby_logical_decoding.pl line 235.

---
035_standby_logical_decoding_standby.log:
2024-06-06 13:55:07.715 UTC [9172:7] LOG:  invalidating obsolete replication slot "row_removal_inactiveslot"
2024-06-06 13:55:07.715 UTC [9172:8] DETAIL:  The slot conflicted with xid horizon 754.
2024-06-06 13:55:07.715 UTC [9172:9] CONTEXT:  WAL redo at 0/4020A80 for Heap2/PRUNE_ON_ACCESS:
snapshotConflictHorizon:
 
754, isCatalogRel: T, nplans: 0, nredirected: 0, ndead: 1, nunused: 0, dead: [48]; blkref #0: rel 1663/16384/2610, blk
0
2024-06-06 13:55:14.372 UTC [7532:1] [unknown] LOG:  connection received: host=127.0.0.1 port=55328
2024-06-06 13:55:14.381 UTC [7532:2] [unknown] LOG:  connection authenticated: identity="EC2AMAZ-P7KGG90\\pgrunner" 
method=sspi 

(C:/prog/bf/root/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata/pg_hba.conf:2)
2024-06-06 13:55:14.381 UTC [7532:3] [unknown] LOG:  connection authorized: user=pgrunner database=postgres 
application_name=035_standby_logical_decoding.pl
2024-06-06 13:55:14.443 UTC [7532:4] 035_standby_logical_decoding.pl LOG:  statement: select (confl_active_logicalslot
=
 
1) from pg_stat_database_conflicts where datname = 'testdb'
2024-06-06 13:55:14.452 UTC [7532:5] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.090 
user=pgrunner database=postgres host=127.0.0.1 port=55328
# (there is no `invalidating obsolete replication slot "row_removal_activeslot"` message)
...
2024-06-06 14:27:42.675 UTC [4032:4] 035_standby_logical_decoding.pl LOG:  statement: select (confl_active_logicalslot
=
 
1) from pg_stat_database_conflicts where datname = 'testdb'
2024-06-06 14:27:42.681 UTC [4032:5] 035_standby_logical_decoding.pl LOG:  disconnection: session time: 0:00:00.080 
user=pgrunner database=postgres host=127.0.0.1 port=58713
2024-06-06 14:27:43.095 UTC [7892:2] FATAL:  could not receive data from WAL stream: server closed the connection 
unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.

It's hard to determine from this info, why row_removal_activeslot was not
invalidated, but running this test on a slowed down Windows VM, I (still)
get the same looking failures caused by RUNNING_XACTS appeared just before
`invalidating obsolete replication slot "row_removal_inactiveslot"`.
So I would consider this failure as yet another result of bgwriter activity
and add it to the list of known failures as such...

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-06-06%2012%3A36%3A11
[2] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=drongo&dt=2024-06-05%2017%3A03%3A13&stg=misc-check

Best regards,
Alexander



Hi Alexander,

On Sat, Jun 08, 2024 at 07:00:00AM +0300, Alexander Lakhin wrote:
> Hello Bertrand and Michael,
> 
> 23.01.2024 11:07, Bertrand Drouvot wrote:
> > On Tue, Jan 23, 2024 at 02:50:06PM +0900, Michael Paquier wrote:
> > 
> > > Anyway, that's not the end of it.  What should we do for snapshot
> > > snapshot records coming from the bgwriter?
> > What about?
> > 
> > 3) depending on how stabilized this test (and others that suffer from "random"
> > xl_running_xacts) is, then think about the bgwriter.
> 
> A recent buildfarm failure [1] reminds me of that remaining question.
> 
> It's hard to determine from this info, why row_removal_activeslot was not
> invalidated, but running this test on a slowed down Windows VM, I (still)
> get the same looking failures caused by RUNNING_XACTS appeared just before
> `invalidating obsolete replication slot "row_removal_inactiveslot"`.
> So I would consider this failure as yet another result of bgwriter activity
> and add it to the list of known failures as such...

Thanks for the report! I think it makes sense to add it to the list of known
failures.

One way to deal with those corner cases could be to make use of injection points
around places where RUNNING_XACTS is emitted, thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



On Mon, Jun 10, 2024 at 06:29:17AM +0000, Bertrand Drouvot wrote:
> Thanks for the report! I think it makes sense to add it to the list of known
> failures.
>
> One way to deal with those corner cases could be to make use of injection points
> around places where RUNNING_XACTS is emitted, thoughts?

Ah.  You mean to force a wait in the code path generating the standby
snapshots for the sake of this test?  That's interesting, I like it.
--
Michael

Attachment
Hi,

On Mon, Jun 10, 2024 at 03:39:34PM +0900, Michael Paquier wrote:
> On Mon, Jun 10, 2024 at 06:29:17AM +0000, Bertrand Drouvot wrote:
> > Thanks for the report! I think it makes sense to add it to the list of known
> > failures.
> > 
> > One way to deal with those corner cases could be to make use of injection points
> > around places where RUNNING_XACTS is emitted, thoughts?
> 
> Ah.  You mean to force a wait in the code path generating the standby
> snapshots for the sake of this test?

Yeah.

>  That's interesting, I like it.

Great, will look at it.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com