[GSoC 2026] - B-tree Index Bloat Reduction - Introduction - Mailing list pgsql-hackers

From Salma El-Sayed
Subject [GSoC 2026] - B-tree Index Bloat Reduction - Introduction
Date
Msg-id CANBEAPFW64QUGmMYoPekV6jGrn5Yx8kTNfeEtcTG9CCBUUBoLQ@mail.gmail.com
Whole thread
List pgsql-hackers
Hello Hackers!

I will be working on the B-tree Index Bloat Reduction (Page Merge) project for GSoC 2026 [1]: For the last 2 months, I have been actively engaging with the mentors about the project. Here is a brief introduction about myself, the project, and the work done so far.

I am being mentored by Kirk, Andreas, Pavlo, Nikolay, and Andrei. I am sending this introductory email to share my progress and to learn how to properly engage with the PostgreSQL Community.

*Brief introduction about me*
I am a final-year Computer Engineering student. I recently completed a training program at STMicroelectronics where I built various projects in C, including a custom Unix-like shell [2] and a dynamic memory manager overriding libc's malloc [3]. For the last couple of months, I have also been working on the CMU 15-445/645 database systems course alongside my university work. Through this course, I built a Buffer Pool Manager and a B+Tree Index. (I cannot provide the code publicly due to academic integrity rules, but I passed all tests for both projects [4]).

Working with my mentors, I have reviewed and defined my proposed solution to the bloat problem [5] and have already written tooling to help visualize the nbtree before and after a merge [6].*project review*
B-tree indexes can become severely bloated after heavy UPDATE/DELETE workloads - in production systems, indexes with 90%+ bloat are common. The core engine currently has no way to consolidate two sparse pages into one: VACUUM can only delete pages that are completely empty.

*proposed approach*

To solve this, I propose introducing a B-tree page merge mechanism. The core idea is to identify adjacent sparse pages that share the same parent, consolidate their tuples into a single page, and update the parent node's downlinks accordingly.

While my initial proposal outlines this specific mechanism, I intend to keep the approach completely flexible. I fully expect the design to evolve based on the community's feedback and what we learn during the analysis phase. Our immediate plan is to finalize this analysis, identify impacted areas (WAL, reverse/forward scans, RECOVERY), and ensure we have comprehensive testing strategies mapped out.

As a GSoC participant, I am willing to do the hard work and I know this might not get committed during the GSoC2026, but I will stick around if it is close, and if not, I will try to leave it in good enough shape that a more permanent Hacker can pick this up and carry it through.

-----
[1] wiki.postgresql.org/wiki/GSoC_2026#B-tree_Index_Bloat_Reduction_(Page_Merge)
[2] github.com/salmaaliia/My-Shell
[3] github.com/salmaaliia/Heap-Memory-Manager
[4] B-Tree Index, Buffer Pool Manager
[5] docs.google.com/document/d/1PIMG0N__4BIB0uDWOfcK-AruatNL8SCTlfTlBTCaoCo/edit?usp=sharing
[6] github.com/salmaaliia/postgres/tree/tooling

Best regards,
Salma El-Sayed

pgsql-hackers by date:

Previous
From: Ayush Tiwari
Date:
Subject: Re: Refactor code around GUC default_toast_compression
Next
From: Amit Kapila
Date:
Subject: Re: [PATCH] Fix stale relation close in sequence synchronization