Re: Introduce XID age and inactive timeout based replication slot invalidation - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Date | |
Msg-id | CAHut+PtaLA6v9bzmH=g51HhZNNJ5ixvMiokosdnrSFpU51arEg@mail.gmail.com Whole thread Raw |
In response to | Re: Introduce XID age and inactive timeout based replication slot invalidation (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
Hi, my previous review posts did not cover the test code. Here are my review comments for the v44-0001 test code ====== TEST CASE #1 1. +# Wait for the inactive replication slot to be invalidated. +$standby1->poll_query_until( + 'postgres', qq[ + SELECT COUNT(slot_name) = 1 FROM pg_replication_slots + WHERE slot_name = 'lsub1_sync_slot' AND + invalidation_reason = 'inactive_timeout'; +]) + or die + "Timed out while waiting for lsub1_sync_slot invalidation to be synced on standby"; + Is that comment correct? IIUC the synced slot should *already* be invalidated from the primary, so here we are not really "waiting" for it to be invalidated; Instead, we are just "confirming" that the synchronized slot is already invalidated with the correct reason as expected. ~~~ 2. +# Synced slot mustn't get invalidated on the standby even after a checkpoint, +# it must sync invalidation from the primary. So, we must not see the slot's +# invalidation message in server log. +$standby1->safe_psql('postgres', "CHECKPOINT"); +ok( !$standby1->log_contains( + "invalidating obsolete replication slot \"lsub1_sync_slot\"", + $standby1_logstart), + 'check that syned lsub1_sync_slot has not been invalidated on the standby' +); + This test case seemed bogus, for a couple of reasons: 2a. IIUC this 'lsub1_sync_slot' is the same one that is already invalid (from the primary), so nobody should be surprised that an already invalid slot doesn't get flagged as invalid again. i.e. Shouldn't your test scenario here be done using a valid synced slot? 2b. AFAICT it was only moments above this CHECKPOINT where you assigned the standby inactivity timeout to 2s. So even if there was some bug invalidating synced slots I don't think you gave it enough time to happen -- e.g. I doubt 2s has elapsed yet. ~ 3. +# Stop standby to make the standby's replication slot on the primary inactive +$standby1->stop; + +# Wait for the standby's replication slot to become inactive +wait_for_slot_invalidation($primary, 'sb1_slot', $logstart, + $inactive_timeout); This seems a bit tricky. Both these (the stop and the wait) seem to belong together, so I think maybe a single bigger explanatory comment covering both parts would help for understanding. ====== TEST CASE #2 4. +# Stop subscriber to make the replication slot on publisher inactive +$subscriber->stop; + +# Wait for the replication slot to become inactive and then invalidated due to +# timeout. +wait_for_slot_invalidation($publisher, 'lsub1_slot', $logstart, + $inactive_timeout); IIUC, this is just like comment #3 above. Both these (the stop and the wait) seem to belong together, so I think maybe a single bigger explanatory comment covering both parts would help for understanding. ~~~ 5. +# Testcase end: Invalidate logical subscriber's slot due to +# replication_slot_inactive_timeout. +# ============================================================================= IMO the rest of the comment after "Testcase end" isn't very useful. ====== sub wait_for_slot_invalidation 6. +sub wait_for_slot_invalidation +{ An explanatory header comment for this subroutine would be helpful. ~~~ 7. + # Wait for the replication slot to become inactive + $node->poll_query_until( + 'postgres', qq[ + SELECT COUNT(slot_name) = 1 FROM pg_replication_slots + WHERE slot_name = '$slot_name' AND active = 'f'; + ]) + or die + "Timed out while waiting for slot $slot_name to become inactive on node $name"; + + # Wait for the replication slot info to be updated + $node->poll_query_until( + 'postgres', qq[ + SELECT COUNT(slot_name) = 1 FROM pg_replication_slots + WHERE inactive_since IS NOT NULL + AND slot_name = '$slot_name' AND active = 'f'; + ]) + or die + "Timed out while waiting for info of slot $slot_name to be updated on node $name"; + Why are there are 2 separate poll_query_until's here? Can't those be combined into just one? ~~~ 8. + # Sleep at least $inactive_timeout duration to avoid multiple checkpoints + # for the slot to get invalidated. + sleep($inactive_timeout); + Maybe this special sleep to prevent too many CHECKPOINTs should be moved to be inside the other subroutine, which is actually doing those CHECKPOINTs. ~~~ 9. + # Wait for the inactive replication slot to be invalidated + $node->poll_query_until( + 'postgres', qq[ + SELECT COUNT(slot_name) = 1 FROM pg_replication_slots + WHERE slot_name = '$slot_name' AND + invalidation_reason = 'inactive_timeout'; + ]) + or die + "Timed out while waiting for inactive slot $slot_name to be invalidated on node $name"; + The comment seems misleading. IIUC you are not "waiting" for the invalidation here, because it is the other subroutine doing the waiting for the invalidation message in the logs. Instead, here I think you are just confirming the 'invalidation_reason' got set correctly. The comment should say what it is really doing. ====== sub check_for_slot_invalidation_in_server_log 10. +# Check for invalidation of slot in server log +sub check_for_slot_invalidation_in_server_log +{ I think the main function of this subroutine is the CHECKPOINT and the waiting for the server log to say invalidation happened. It is doing a loop of a) CHECKPOINT then b) inspecting the server log for the slot invalidation, and c) waiting for a bit. Repeat 10 times. A comment describing the logic for this subroutine would be helpful. The most important side-effect of this function is the CHECKPOINT because without that nothing will ever get invalidated due to inactivity, but this key point is not obvious from the subroutine name. IMO it would be better to name this differently to reflect what it is really doing: e.g. "CHECKPOINT_and_wait_for_slot_invalidation_in_server_log" ====== Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: