Re: [WiP] B-tree page merge during vacuum to reduce index bloat - Mailing list pgsql-hackers

From Wenbo Lin
Subject Re: [WiP] B-tree page merge during vacuum to reduce index bloat
Date
Msg-id d729a05e-7eb0-4f49-825b-f115db451bdb@gmail.com
Whole thread Raw
In response to Re: [WiP] B-tree page merge during vacuum to reduce index bloat  (Madhav Madhusoodanan <madhavmadhusoodanan@gmail.com>)
List pgsql-hackers

On 3/6/26 4:45 AM, Madhav Madhusoodanan wrote:
> 
> Some parts of the merge flow that I am coming up with are as follows
> (assuming tuples from index page B are migrated rightwards into C):
> 
> 1. Leaving B's tuples as it is even after merge, to remove the
> possible risk of scans "skipping over" tuples. Essentially, the tuples
> then would be "copied" into C.
> 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say,
> MERGE_SRC and MERGE_DEST respectively) before the actual merge
> process, then marking the pages with another flag upon completion
> (MERGE_COMPLETE) so that other processes can handle transient merge
> states.
> 3. For example, scans that reach page B post-merge (MERGE_SRC +
> MERGE_COMPLETE) would be made to skip to the page to its right.
> 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B).
> 
> Note: The underlying assumption is that B and C both are children of a
> single parent page P. For a start I'm picturing this flow only for
> sibling nodes, not for cousin nodes (B and C having different parent
> nodes).
> 
> Would the above approach be alright? These are not exhaustive steps,
> I'm only raising these lines of thought so that I can fast-fail them
> (if there are any issues).
> I'd be happy to provide a doc of the exact flow I'm picturing (how to
> ensure TIDs are read exactly-once, how the merge process would hold up
> with other concurrent processes, etc), incase my thoughts seem
> promising.
> 
> Thanks in advance!
> 
> Madhav
> 
> 
Hi Madhav
I've been following this thread with interest and have been thinking
about some of the same problem. A few things came to mind reading your
proposal:

1. Backward scans. Your point 3 mentions "skip to right" when scans
hitting a MERGE_SRC page. I think this would work for forward scan, but
backward scans travel right-to-left, so they'd read C first, then step
to B. By the time they reach B, they've already passed C. How would the
scan know whether it already picked up B's tuples from C or not?
2. Lock gap. Building on the above, between releasing one page's lock
and acquiring the next, the merge could actually happen. So a backward
scan might read C before the merge, then find B marked as MERGE_SRC
after the merge. In that case C didn't have B's tuples when the scan
read it.so the scan need to read B. But if the merge happened earlier, C
already has B's tuples and reading B would produce duplicates. I think
it will need some detection mechanism (I wonder if some kind of per-scan
tracking of what was already returned could help distinguish these two
cases, though I haven't thought through the details yet.)
3. MERGE_COMPLETE flag. I was wondering whether two flags (MERGE_SRC and
MERGE_DEST) might be enough. if the tuple copy and flag settings are
done atomically within a single critical section, then other backends
either see the pre-merge or post-merge state, never a partial one. Or am
i missing a case where you'd need to distinguish "merge in progres" from
"merge done"

Looking forward to hearing your thoughts and anyone's else. Happy to be
corrected if I'm off base on any of these.

Regards




pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: log XLogPrefetch stats at end of recovery
Next
From: Fujii Masao
Date:
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?