On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote:
> Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on
> injection points. The issue reproduced consistently across all
> branches up to PG13. You can use the attached
> 049_slot_get_changes_wait_continously_pg17.pl script (found in the
> 049_slot_get_changes_wait_continously_pg17.zip file) to verify this.
> Just copy the script to src/test/recovery and run the test to observe
> the problem.
+# Abruptly stop the server (1 second should be enough for the checkpoint
+# to finish; it would be better).
+$node->stop('immediate');
While v3-0001 is good enough to reproduce the original issue after a
few dozen runs here, we cannot reintroduce the test without the proper
wait_for_log() logic making sure that the checkpoint is completed
before we stop the server.
I like the addition of an extra pg_logical_emit_message() in test 046
anyway, down to v17, in the test 046 for all the branches. Even if
the reproduction is sporadic, we have seen it pretty quickly in the CI
and in the buildfarm so it would not go unnoticed for a long time if
we mess up with this stuff again.
We are lucky enough that the existing test does not fail in v17, TBH,
so let's tackle all the issues on this thread step by step:
1) Fix let's add the header check.
2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with
an extra pg_logical_emit_message() that travels across WAL pages
forcing a reconstruction of the record and the extra header check.
3) Let's fix recovery test 046 currently in v17, for the checkpoint
wait logic and the extra pg_logical_emit_message().
4) Let's improve the stability of 047 everywhere for the checkpoint
wait logic, as already done by Alexander.
I have been doing some tests with the patch vs HEAD (thanks for the
script, it has saved some time), and I can summarize my results by
using a message size of 819200000, hence worth 100k pages of WAL or
so. Then, taking 20 runs, and eliminating the three highest and
lowest numbers to eliminate some of the variance, I am getting an
average of runtime when re-assembling the record of:
- 695473.09us for HEAD
- 695679.18us for the patch.
So I am not going to stress more on this point.
Attached is the result of all that for HEAD. There is also one extra
patch dedicated to v17, where I have checked that the extra
pg_logical_emit_message() is enough to reproduce the problem without
the patch, and that the problem is fixed with the patch. The patch
set for v17 is simpler of course, as test 046 still exists on
REL_17_STABLE. Note that I have moved this extra contrecord to be
generated before the checkpoint is completed, leading to the same
result.
With all that said, I'll move on with this stuff once the embargo for
v18 beta2 is lifted and the tag is pushed. That should happen in 24h
or so, I guess.
--
Michael