11.06.2025 09:00, Evgeny Voropaev wrote:
> 2) About repairing fragmentation.
>
> The original approach implemented in PG18 assumes that fragmentation
> occurs during every `prune_freeze` operation. It happens because the
> logic of the "redo"-function `heap_xlog_prune_freeze` assumes that
> fragmentation has to be done by `heap_page_prune_execute`.
> Attempting to
> omit fragmentation can result in page inconsistencies on the "redo"-side
> (i.e. on a secondary node, or during the recovery process on primary
> one).
No! Because patch uses flag in WAL record to instruct "redo"-side to omit
fragmentation as well if needed.
> So, implementation of optional repairing of fragmentation
> conflicts with the basic assumption about "necessity of fragmentation".
> In order to prevent inconsistency xid64v62 patch invokes
> `heap_page_prune_and_freeze` with `repairFragmentation` equal to true
> from everywhere in the patch code except from
> `heap_page_prepare_for_xid` which uses `repairFragmentation=false`.
>
> So, why must we perform a `heap_page_prune_execute` without a
> fragmentation during the preparation of a page for xid?
>
> What exactly would break if we did invoke `heap_page_prune_execute` with
> `repairFragmentation=true` during performing of `heap_page_prepare_for_xid`?
Short answer:
- `repairFragmentation` parameter were added after investigating real
production issues with earlier patch versions.
Long answer:
How SELECT works with tuples on a page?
It:
- PINS the page
- takes CONTENT LOCK in SHARED mode
- collects HeapTuples which LOOKS INTO RAW PAGE with t_data.t_choice.t_heap
- RELEASES content lock
- may use those HeapTuples for indefinitely long time relying only on PIN
of the page.
I.e. SELECT relies on the fact, while a page is pinned, tuples on the page
stay at the same positions in memory.
That is why LockBufferForCleanup and ConditionalLockBufferForCleanup checks
there is only single PIN on the page - only backend which will perform
cleanup is allowed to PIN the page.
UPDATE/INSERT/DELETE lock CONTENT LOCK in EXCLUSIVE mode because they may
add new tuples. But they are not allowed to move tuples because concurrent
backends allowed to read tuples from the page in exactly same moment.
--
regards
Yura Sokolov aka funny-falcon