Hi Kuroda-san,
On Mon, Apr 07, 2025 at 06:15:13AM +0000, Hayato Kuroda (Fujitsu) wrote:
> I had been debugging and found the case that VACUUM FULL also has a timing issue.
> This means the we cannot keep the testcase.
>
> PSA the reproducer for PG17. IIUC this can happen even in PG16.
> I considered what happened here;
>
> 1. Run a CHECKPOINT and wait sometime in wait_until_vacuum_can_remove().
> This ensures that RUNNING_XACTS record can be generated and catalog_xmin can
> be advanced after the user SQLs.
> 2. Assuming that another RUNNING_XACTS record is generated *WHILE* doing a VACUUM
> FULL. This can be done by the periodic checkpoint or the reproducer.
> 3. Logical walsender detects the RUNNING_XACTS record.
> Note that this must be done before startup tries to invalidate slot.
> 4. In sometime the walsender receives the ack and advance the catalog_xmin.
> Note again that this must be done before startup tries to invalidate slot.
> 5. Startup process detects the PRUNE_ON_ACCESS record and tries to invalidate the
> slot. However, the catalog_xmin has been advanced so that the invalidation
> cannot be done.
Thanks for the testing and explanation! I did apply your repro and I'm able to
see the test failing (with an active slot). The scenario is more unlikely
to happen (as compare to the non vacuum full cases) and that's why it was not
visible in drongo's reports in [1]. So yeah, let's do as you suggested and do
not make the slot active for the vacuum full case too.
[1]: https://www.postgresql.org/message-id/386386.1737736935@sss.pgh.pa.us
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com