Re: [WiP] B-tree page merge during vacuum to reduce index bloat - Mailing list pgsql-hackers
| From | Wenbo Lin |
|---|---|
| Subject | Re: [WiP] B-tree page merge during vacuum to reduce index bloat |
| Date | |
| Msg-id | d729a05e-7eb0-4f49-825b-f115db451bdb@gmail.com Whole thread Raw |
| In response to | Re: [WiP] B-tree page merge during vacuum to reduce index bloat (Madhav Madhusoodanan <madhavmadhusoodanan@gmail.com>) |
| List | pgsql-hackers |
On 3/6/26 4:45 AM, Madhav Madhusoodanan wrote: > > Some parts of the merge flow that I am coming up with are as follows > (assuming tuples from index page B are migrated rightwards into C): > > 1. Leaving B's tuples as it is even after merge, to remove the > possible risk of scans "skipping over" tuples. Essentially, the tuples > then would be "copied" into C. > 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say, > MERGE_SRC and MERGE_DEST respectively) before the actual merge > process, then marking the pages with another flag upon completion > (MERGE_COMPLETE) so that other processes can handle transient merge > states. > 3. For example, scans that reach page B post-merge (MERGE_SRC + > MERGE_COMPLETE) would be made to skip to the page to its right. > 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B). > > Note: The underlying assumption is that B and C both are children of a > single parent page P. For a start I'm picturing this flow only for > sibling nodes, not for cousin nodes (B and C having different parent > nodes). > > Would the above approach be alright? These are not exhaustive steps, > I'm only raising these lines of thought so that I can fast-fail them > (if there are any issues). > I'd be happy to provide a doc of the exact flow I'm picturing (how to > ensure TIDs are read exactly-once, how the merge process would hold up > with other concurrent processes, etc), incase my thoughts seem > promising. > > Thanks in advance! > > Madhav > > Hi Madhav I've been following this thread with interest and have been thinking about some of the same problem. A few things came to mind reading your proposal: 1. Backward scans. Your point 3 mentions "skip to right" when scans hitting a MERGE_SRC page. I think this would work for forward scan, but backward scans travel right-to-left, so they'd read C first, then step to B. By the time they reach B, they've already passed C. How would the scan know whether it already picked up B's tuples from C or not? 2. Lock gap. Building on the above, between releasing one page's lock and acquiring the next, the merge could actually happen. So a backward scan might read C before the merge, then find B marked as MERGE_SRC after the merge. In that case C didn't have B's tuples when the scan read it.so the scan need to read B. But if the merge happened earlier, C already has B's tuples and reading B would produce duplicates. I think it will need some detection mechanism (I wonder if some kind of per-scan tracking of what was already returned could help distinguish these two cases, though I haven't thought through the details yet.) 3. MERGE_COMPLETE flag. I was wondering whether two flags (MERGE_SRC and MERGE_DEST) might be enough. if the tuple copy and flag settings are done atomically within a single critical section, then other backends either see the pre-merge or post-merge state, never a partial one. Or am i missing a case where you'd need to distinguish "merge in progres" from "merge done" Looking forward to hearing your thoughts and anyone's else. Happy to be corrected if I'm off base on any of these. Regards
pgsql-hackers by date: