Thread: Proposal: Global Index for PostgreSQL

Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
PostgreSQL’s table partitioning allows large tables to be broken into
smaller, more manageable pieces for better performance. However, a key
limitation currently is the absence of global indexes, which restricts
using partitioned tables, especially when you need unique constraints
on columns that aren't part of the partition key.  It's true that
global indexes might seem to go against the core idea of partitioning.
However, they become essential when business needs dictate
partitioning data on one key while also enforcing uniqueness on a
different column. Without global indexes, users simply couldn't
leverage partitioning at all in such scenarios.

I've been developing a design and a POC for global indexes to address
this. I'd like to share my work, including the challenges we still
face and my proposed solutions.

Storage
=======
As we know that since global indexes are covering tuple of multiple
partitions, TIDs are not sufficient to uniquely identifying the tuple
so for that I have introduced a partitioned identifier which is 4
bytes integer and we may argue more on this whether this can be a
variable length or not for saving space.  I will describe more on this
partition identifier in a later part of the email and what's the
reason for choosing this instead of just a relation Oid as that seems
much straightforward.

Partition identifier is stored as part of the index tuple right after
the last key column, storing it here after the last key column make a
lot of things work in very straight forward way without any special
handling for this partition identifier column, e.g. btree tuple are
arranged in key order and if key are duplicate its arranged in TID
order and now for global index we want if the keys are duplicates it
should be arranged in partitioned identifier order and if partition
identifier is also same then in TID order, so by storing identifier
after the last key column and if we consider that field as part of the
extended key column then tuple will automatically will be arranged in
(keys, partition identifier, TID) order as we desire.  Similarly
suffix truncation and deduplication will also be simpler with this
design choice.

Partition Identifier
==============
Before discussing anything about Create Index/Attach/Detach partition,
I think we need to discuss the partition identifier.  So the main
reason for not using relation Oid as partition identifier is because
if we do so we would be forced to remove all the tuple of the detached
partitions from the global index, otherwise it would be very hard to
identify the tuple which are not valid or we should not scan for
example if the same partition detaches and reattaches then we would
not be able to distinguish between old (which should be ignore) and
new tuples, and also if Oid got reassigned to some other partition old
partition is dropped and wraparound happened.

To solve this, we've introduced a partition identifier, a 4-byte
integer. We'll use a new catalog called pg_index_partitions to store a
mapping from (partition identifier, global index OID) to relation OID.

The mapping isn't a direct one from partition identifier to relation
ID because each partition will have a different partition identifier
for each global index it's associated with. This is crucial for
correctly handling global indexes in multi-level partition
hierarchies.

Consider this example:
- T is the top-level parent and has global index g.
- T1 and T2 are children of T. T2 is partitioned table itself and also
has global index g2.
- T22 is a child of T2.

Let's consider a scenario where if you assign a single partition
identifier to T22 and create a one-to-one mapping with its relation
IDs. What happens if we detach T2 from T? (While we haven't delved
into the detach process yet, our current thought is to invalidate the
partition identifier entry so that during a scan, when we look up the
partition identifier to find the corresponding reloid, we'd simply
ignore it.)

However, this approach presents a problem. We can't simply invalidate
the entry for T22 entirely, because while its partition identifier
should be considered invalid when scanning global index 'g' (since T2
is detached from T), it should still be valid when scanning global
index 'g2' (which belongs to T2).

To address this, each leaf partition needs to be assigned a separate
partition identifier for every global index it's associated with.
While a separate partition identifier for each level in the partition
hierarchy could suffice, we've opted for the simpler approach of
assigning a distinct identifier per global index. We made this choice
assuming users will be judicious in creating global indexes within
their partition hierarchies, meaning we won't have to worry about
rapidly consuming a large number of partition IDs. Additionally, the
partition identifier counter can be maintained per global index rather
than globally, this will avoid growing this counter rapidly.

Creating a global Index
=================
While creating an index if the index is marked as GLOBAL, then there
would be some differences compared to the partitioned index, 1) And
additional internal partition identifier key column will be added 2) A
partition id will be allocated to each leaf partitions exist under the
top level partition on which we are creating a global index and
mapping will be stored in pg_index_partition table  3) Unlike
partitioned index global index will not be created on each child
instead this will be created only on the partitioned relation on which
user chooses to create 4) Index build needs to be modified to scan
through all the leaf partitions and create a single sort space for
inserting into the global index.

Attach Partition
=============
While attaching a partition, if this is a leaf partition, we need to
assign a new partition id with respect to each global index present on
its parent and ancestors and insert the mapping in the
pg_index_partitions table.  If the partition being attached is itself
a partitioned table then this step has to be done for every leaf
partition present in the partition hierarchy under the partitioned
table being attached.

Currently we reindex the global indexes present on the parent parent
and ancestor to which we are attaching a partition(s), but this can be
optimized for example we may choose to selectively insert the tuple of
the partition being attached but in some cases it could be costly if
the attached partition itself has a lot of tuple compared to existing
partitions which are already attached.  So this could be a mixed
approach and can be decided based on the existing number of tuples
index the index vs new in coming tuples, and as of now I am listing
this as open for suggestion items.

Detach Partition
=============
When detaching or dropping a partition, we just need to invalidate the
corresponding mapping in pg_index_partitions. Specifically, for each
leaf partition being detached, we'll mark its reloid as invalid within
the (indexoid, partition id) entry. This ensures that during an index
scan, when the system attempts to convert a partition ID to its
reloid, it will recognize this entry as invalid. We'll also need to
consider a mechanism to clean up these invalidated entries at some
point.

Global Index Scan
==============
In short, the global index scan will be similar to the normal index.
The key difference is that we'll also need to track heapOid alongside
heapTid within BTScanPosItem. Although we store the partition ID
directly in the index tuple, we can readily convert it to a heapOid by
referencing the pg_index_partitions table. For performance, we'll
maintain a cache within the global index's Relcache entry; this cache
will simply store the mapping of partitions associated with that
specific global index.

Furthermore, we must account for new index access paths when global
indexes are present. Currently, set_append_rel_size() generates index
restriction information for each append relation. Now, this
restriction information must also be generated for the partitioned
table itself. Additionally, within set_append_rel_pathlist(), we'll
need to invoke create_index_paths() for the partitioned table. This
ensures that we consider index paths for the partitioned table, which,
in the case of partitioned tables, will be global index scan paths.

Locking consideration
=================
Currently, when we perform DML operations on a top-level partitioned
table, we acquire a lock on the entire partition hierarchy beneath it,
which works well. However, if we're directly operating on a leaf
relation, say, inserting a tuple, we only lock that specific relation.
This isn't sufficient for global indexes.

Since a global index can reside on a parent table, inserting a tuple
into a leaf relation also necessitates an insertion into the global
index. This means we also need to lock the parent table on which the
global index is created.

However, even that isn't enough. We actually need to lock the entire
partition hierarchy under the parent table that has the global index.
Here's why: when you insert a tuple into any leaf partition, you'll
also insert it into the global index. If it's a unique index, a
conflict might arise with a key belonging to another partition. To
validate whether the tuple is live or not, we might need to access
data in other partitions.

Based on our chosen design, we identify all these necessary locks
during the planning phase. We then acquire these locks during planning
and store them within the planned statement. This way, if the
statement is prepared and executed later, these locks can be
re-acquired during AcquireExecutorLocks.

Open Problems/Need suggestions
==========================
Here is the list of top few problems which would be good to discuss
sooner than later

Vacuum for global indexes
—----------------------------------
I think this has been discussed multiple times in the past whenever we
talked about the global index. The core issue is that, by default,
global indexes are vacuumed with each individual partition. This
becomes incredibly inefficient and costly when dealing with millions
of partitions, as it leads to repeated, expensive scans of a large
global index.

Ideally, global indexes should only be vacuumed once, after all
partitions have been processed. This is because a global index keeps
data from all partitions, meaning a single vacuum operation after all
partition vacuums would be sufficient and far more efficient. This
approach would significantly reduce the overhead associated with
maintaining global indexes in large, partitioned datasets.

Our testing highlights a significant performance bottleneck when
vacuuming with global indexes. In a scenario with 1,000 partitions and
a total of 50 million tuples, vacuuming without global indexes took a
mere 15 seconds. However, after introducing global indexes, the vacuum
time skyrocketed to 45 minutes, a 300-fold increase! This dramatic
slowdown is expected, as the global index, containing data from all
partitions, is effectively vacuumed with each of the 1,000 individual
partition vacuums.

We've made a quick but significant optimization: we now vacuum each
partition and its local indexes first, skipping the global index
vacuum and the second pass of heap vacuuming for a bit. Instead, we
just keep track of the dead-tid stores (dead tuples) for each
partition. Once all partitions are done, we then vacuum the global
index once and perform that second heap pass across all partitions.

This change slashed our vacuuming time from 45 minutes down to just 40
seconds! This clearly shows we have plenty of room for optimization,
and this challenge isn't a showstopper. I'll share more details on
other solutions I'm testing in an upcoming email to keep this one
concise.


Global Index rewrite
—-------------------------
One particular class of things that needs careful consideration is
table-rewriting operations. We have a few of those: CLUSTER, VACUUM
FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs
potentially change, so tablecmds.c arranges to rebuild indexes. When
there’s a global index involved, we also need to rebuild global
indexes. Or, as a possible alternative strategy, we could allocate a
new partition ID and just leave the index entries to be cleaned out
eventually, but that seems to risk a lot of bloat. However, a key
point here is that we don’t want to be rebuilding the same global
indexes over and over. If someone does a table-rewriting operation
across a whole partitioning hierarchy, we don’t want to rebuild the
same global indexes multiple times.

We also need to consider what happens when relations are truncated.
Here, the two possible strategies are: (1) assign new partition IDs
and leave the old index entries around, (2) truncate the entire index
and then reinsert entries for any untruncated partitions. As above (1)
risks bloat, but it also makes TRUNCATE fast. Of course, if the whole
partitioning hierarchy is truncated all at once, then (2) is clearly
better.

I'm aiming to submit the first WIP patch set before the July
commitfest. It won't have all the issues ironed out yet, but the main
design will be functional.

Thanks, Robert, for many of the key design ideas and regular
discussion throughout designing this. I'd also like to thank Joe,
Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
independent issues with me offlist.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Global Index for PostgreSQL

From
wenhui qiu
Date:
Hi Dilip Kumar 
   Thank you for your working on this ,I remember six years ago there was talk about global index ,You can see if this mailing list has any references to (https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com)



Thanks

On Fri, Jun 6, 2025 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
PostgreSQL’s table partitioning allows large tables to be broken into
smaller, more manageable pieces for better performance. However, a key
limitation currently is the absence of global indexes, which restricts
using partitioned tables, especially when you need unique constraints
on columns that aren't part of the partition key.  It's true that
global indexes might seem to go against the core idea of partitioning.
However, they become essential when business needs dictate
partitioning data on one key while also enforcing uniqueness on a
different column. Without global indexes, users simply couldn't
leverage partitioning at all in such scenarios.

I've been developing a design and a POC for global indexes to address
this. I'd like to share my work, including the challenges we still
face and my proposed solutions.

Storage
=======
As we know that since global indexes are covering tuple of multiple
partitions, TIDs are not sufficient to uniquely identifying the tuple
so for that I have introduced a partitioned identifier which is 4
bytes integer and we may argue more on this whether this can be a
variable length or not for saving space.  I will describe more on this
partition identifier in a later part of the email and what's the
reason for choosing this instead of just a relation Oid as that seems
much straightforward.

Partition identifier is stored as part of the index tuple right after
the last key column, storing it here after the last key column make a
lot of things work in very straight forward way without any special
handling for this partition identifier column, e.g. btree tuple are
arranged in key order and if key are duplicate its arranged in TID
order and now for global index we want if the keys are duplicates it
should be arranged in partitioned identifier order and if partition
identifier is also same then in TID order, so by storing identifier
after the last key column and if we consider that field as part of the
extended key column then tuple will automatically will be arranged in
(keys, partition identifier, TID) order as we desire.  Similarly
suffix truncation and deduplication will also be simpler with this
design choice.

Partition Identifier
==============
Before discussing anything about Create Index/Attach/Detach partition,
I think we need to discuss the partition identifier.  So the main
reason for not using relation Oid as partition identifier is because
if we do so we would be forced to remove all the tuple of the detached
partitions from the global index, otherwise it would be very hard to
identify the tuple which are not valid or we should not scan for
example if the same partition detaches and reattaches then we would
not be able to distinguish between old (which should be ignore) and
new tuples, and also if Oid got reassigned to some other partition old
partition is dropped and wraparound happened.

To solve this, we've introduced a partition identifier, a 4-byte
integer. We'll use a new catalog called pg_index_partitions to store a
mapping from (partition identifier, global index OID) to relation OID.

The mapping isn't a direct one from partition identifier to relation
ID because each partition will have a different partition identifier
for each global index it's associated with. This is crucial for
correctly handling global indexes in multi-level partition
hierarchies.

Consider this example:
- T is the top-level parent and has global index g.
- T1 and T2 are children of T. T2 is partitioned table itself and also
has global index g2.
- T22 is a child of T2.

Let's consider a scenario where if you assign a single partition
identifier to T22 and create a one-to-one mapping with its relation
IDs. What happens if we detach T2 from T? (While we haven't delved
into the detach process yet, our current thought is to invalidate the
partition identifier entry so that during a scan, when we look up the
partition identifier to find the corresponding reloid, we'd simply
ignore it.)

However, this approach presents a problem. We can't simply invalidate
the entry for T22 entirely, because while its partition identifier
should be considered invalid when scanning global index 'g' (since T2
is detached from T), it should still be valid when scanning global
index 'g2' (which belongs to T2).

To address this, each leaf partition needs to be assigned a separate
partition identifier for every global index it's associated with.
While a separate partition identifier for each level in the partition
hierarchy could suffice, we've opted for the simpler approach of
assigning a distinct identifier per global index. We made this choice
assuming users will be judicious in creating global indexes within
their partition hierarchies, meaning we won't have to worry about
rapidly consuming a large number of partition IDs. Additionally, the
partition identifier counter can be maintained per global index rather
than globally, this will avoid growing this counter rapidly.

Creating a global Index
=================
While creating an index if the index is marked as GLOBAL, then there
would be some differences compared to the partitioned index, 1) And
additional internal partition identifier key column will be added 2) A
partition id will be allocated to each leaf partitions exist under the
top level partition on which we are creating a global index and
mapping will be stored in pg_index_partition table  3) Unlike
partitioned index global index will not be created on each child
instead this will be created only on the partitioned relation on which
user chooses to create 4) Index build needs to be modified to scan
through all the leaf partitions and create a single sort space for
inserting into the global index.

Attach Partition
=============
While attaching a partition, if this is a leaf partition, we need to
assign a new partition id with respect to each global index present on
its parent and ancestors and insert the mapping in the
pg_index_partitions table.  If the partition being attached is itself
a partitioned table then this step has to be done for every leaf
partition present in the partition hierarchy under the partitioned
table being attached.

Currently we reindex the global indexes present on the parent parent
and ancestor to which we are attaching a partition(s), but this can be
optimized for example we may choose to selectively insert the tuple of
the partition being attached but in some cases it could be costly if
the attached partition itself has a lot of tuple compared to existing
partitions which are already attached.  So this could be a mixed
approach and can be decided based on the existing number of tuples
index the index vs new in coming tuples, and as of now I am listing
this as open for suggestion items.

Detach Partition
=============
When detaching or dropping a partition, we just need to invalidate the
corresponding mapping in pg_index_partitions. Specifically, for each
leaf partition being detached, we'll mark its reloid as invalid within
the (indexoid, partition id) entry. This ensures that during an index
scan, when the system attempts to convert a partition ID to its
reloid, it will recognize this entry as invalid. We'll also need to
consider a mechanism to clean up these invalidated entries at some
point.

Global Index Scan
==============
In short, the global index scan will be similar to the normal index.
The key difference is that we'll also need to track heapOid alongside
heapTid within BTScanPosItem. Although we store the partition ID
directly in the index tuple, we can readily convert it to a heapOid by
referencing the pg_index_partitions table. For performance, we'll
maintain a cache within the global index's Relcache entry; this cache
will simply store the mapping of partitions associated with that
specific global index.

Furthermore, we must account for new index access paths when global
indexes are present. Currently, set_append_rel_size() generates index
restriction information for each append relation. Now, this
restriction information must also be generated for the partitioned
table itself. Additionally, within set_append_rel_pathlist(), we'll
need to invoke create_index_paths() for the partitioned table. This
ensures that we consider index paths for the partitioned table, which,
in the case of partitioned tables, will be global index scan paths.

Locking consideration
=================
Currently, when we perform DML operations on a top-level partitioned
table, we acquire a lock on the entire partition hierarchy beneath it,
which works well. However, if we're directly operating on a leaf
relation, say, inserting a tuple, we only lock that specific relation.
This isn't sufficient for global indexes.

Since a global index can reside on a parent table, inserting a tuple
into a leaf relation also necessitates an insertion into the global
index. This means we also need to lock the parent table on which the
global index is created.

However, even that isn't enough. We actually need to lock the entire
partition hierarchy under the parent table that has the global index.
Here's why: when you insert a tuple into any leaf partition, you'll
also insert it into the global index. If it's a unique index, a
conflict might arise with a key belonging to another partition. To
validate whether the tuple is live or not, we might need to access
data in other partitions.

Based on our chosen design, we identify all these necessary locks
during the planning phase. We then acquire these locks during planning
and store them within the planned statement. This way, if the
statement is prepared and executed later, these locks can be
re-acquired during AcquireExecutorLocks.

Open Problems/Need suggestions
==========================
Here is the list of top few problems which would be good to discuss
sooner than later

Vacuum for global indexes
—----------------------------------
I think this has been discussed multiple times in the past whenever we
talked about the global index. The core issue is that, by default,
global indexes are vacuumed with each individual partition. This
becomes incredibly inefficient and costly when dealing with millions
of partitions, as it leads to repeated, expensive scans of a large
global index.

Ideally, global indexes should only be vacuumed once, after all
partitions have been processed. This is because a global index keeps
data from all partitions, meaning a single vacuum operation after all
partition vacuums would be sufficient and far more efficient. This
approach would significantly reduce the overhead associated with
maintaining global indexes in large, partitioned datasets.

Our testing highlights a significant performance bottleneck when
vacuuming with global indexes. In a scenario with 1,000 partitions and
a total of 50 million tuples, vacuuming without global indexes took a
mere 15 seconds. However, after introducing global indexes, the vacuum
time skyrocketed to 45 minutes, a 300-fold increase! This dramatic
slowdown is expected, as the global index, containing data from all
partitions, is effectively vacuumed with each of the 1,000 individual
partition vacuums.

We've made a quick but significant optimization: we now vacuum each
partition and its local indexes first, skipping the global index
vacuum and the second pass of heap vacuuming for a bit. Instead, we
just keep track of the dead-tid stores (dead tuples) for each
partition. Once all partitions are done, we then vacuum the global
index once and perform that second heap pass across all partitions.

This change slashed our vacuuming time from 45 minutes down to just 40
seconds! This clearly shows we have plenty of room for optimization,
and this challenge isn't a showstopper. I'll share more details on
other solutions I'm testing in an upcoming email to keep this one
concise.


Global Index rewrite
—-------------------------
One particular class of things that needs careful consideration is
table-rewriting operations. We have a few of those: CLUSTER, VACUUM
FULL..ALTER TABLE etc, When such an operation occurs, all the TIDs
potentially change, so tablecmds.c arranges to rebuild indexes. When
there’s a global index involved, we also need to rebuild global
indexes. Or, as a possible alternative strategy, we could allocate a
new partition ID and just leave the index entries to be cleaned out
eventually, but that seems to risk a lot of bloat. However, a key
point here is that we don’t want to be rebuilding the same global
indexes over and over. If someone does a table-rewriting operation
across a whole partitioning hierarchy, we don’t want to rebuild the
same global indexes multiple times.

We also need to consider what happens when relations are truncated.
Here, the two possible strategies are: (1) assign new partition IDs
and leave the old index entries around, (2) truncate the entire index
and then reinsert entries for any untruncated partitions. As above (1)
risks bloat, but it also makes TRUNCATE fast. Of course, if the whole
partitioning hierarchy is truncated all at once, then (2) is clearly
better.

I'm aiming to submit the first WIP patch set before the July
commitfest. It won't have all the issues ironed out yet, but the main
design will be functional.

Thanks, Robert, for many of the key design ideas and regular
discussion throughout designing this. I'd also like to thank Joe,
Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
independent issues with me offlist.

--
Regards,
Dilip Kumar
Google


Re: Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
On Fri, Jun 6, 2025 at 1:01 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
>
> Hi Dilip Kumar
>    Thank you for your working on this ,I remember six years ago there was talk about global index ,You can see if
thismailing list has any references to
(https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com)
>
Sure Thanks.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Global Index for PostgreSQL

From
Nikita Malakhov
Date:
Hi Dilip!

Global Indexes is a very interesting functionality that has both significant advantages
and drawbacks, and the community seems not ready to accept it without very strong
motivation.
There was a more recent approach to Global index problem [1], please check it out.

I've read you proposal and have several questions:
1) New catalog table with global index partitions would immediately affect interaction
with user tables with global indexes because of corresponding locks that should be
taken for [in]validation and attach/detach operations, this should be investigated;
2) Changing relation OIDs (by, say, vacuum full) would immediately result in index
inconsistency, what do you suppose to do with internal processes that could change
relation OIDs? Also this question 
3) Would single sort space be enough for a more typical case when we have
hundreds of partitions with hundreds of millions records in each? It is a normal
production case for partitioned tables.
4) Update-heavy partitioned tables that should run vacuum frequently. Significant
vacuum slowdown would result in going beyond SLAs without corresponding
significant improvements.


Thank you!

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

Re: Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> Hi Dilip!

Thanks Nikita for your response and reading my proposal.

> Global Indexes is a very interesting functionality that has both significant advantages
> and drawbacks, and the community seems not ready to accept it without very strong
> motivation.

I understand that this is a hard problem and needs changes in many
critical modules. I don't think there should be a problem with the
motivation of this work, but I believe the main issue lies in the
project's complexity.

> There was a more recent approach to Global index problem [1], please check it out.

I've reviewed the proposal, and I understand it aims to address
uniqueness on partitioned tables for non-partition key columns.
However, I'm concerned about the basic design principles' scalability.
I believe it won't scale effectively beyond a relatively small number
of partitions, and this limitation will be quite surprising to users.
Specifically, checking uniqueness during inserts/updates across all
indexes on each partition (since there's no global index) will become
a significant bottleneck.

> I've read you proposal and have several questions:

> 1) New catalog table with global index partitions would immediately affect interaction
> with user tables with global indexes because of corresponding locks that should be
> taken for [in]validation and attach/detach operations, this should be investigated;

Yeah that's right, but logically the attach/detach are DDL and are not
most frequent operations, so are you worried about performance due to
locking?

> 2) Changing relation OIDs (by, say, vacuum full) would immediately result in index
> inconsistency, what do you suppose to do with internal processes that could change
> relation OIDs? Also this question

I want to clarify that we don't store relation OIDs directly in the
global index. Instead, the global index holds partition IDs, and the
mapping from partition ID to relation OID is managed in a new catalog
table, pg_index_partition. It's important to note that VACUUM FULL
operations only alter the relfilenumber (the disk file OID), not the
relation OID itself, which remains constant for the lifetime of the
relation.

What does change during a VACUUM FULL are the TIDs (tuple IDs).
Because of this, all indexes, including global indexes, are reindexed.
As I mentioned in my proposal, there's a significant opportunity for
optimization here. Reindexing large global indexes is a costly
operation, and I've proposed some ideas to improve this process.

> 3) Would single sort space be enough for a more typical case when we have
> hundreds of partitions with hundreds of millions records in each? It is a normal
> production case for partitioned tables.

In general, users ideally wouldn't use a global index everywhere. It
really comes down to their specific use case – they should only opt
for a global index when they can't effectively partition their data
without one. The idea is that the amount of data in the sort space
should essentially be the same as if the table wasn't partitioned at
all. That's a good point for consideration. I agree that global
indexes shouldn't be a default choice for every use case. They're most
beneficial when a user's data access patterns inherently prevent
effective partitioning without them. In such scenarios, the amount of
data in the sort space would ideally remain comparable to an
unpartitioned table.

> 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
> vacuum slowdown would result in going beyond SLAs without corresponding
> significant improvements.
>
You've got it. I'm on board with prioritizing a VACUUM optimization
solution for partitioned tables with global indexes. My initial
proposal touched on a proof-of-concept experiment, which indicated no
significant performance hit with global index after the optimization.
I'll share the detailed VACUUM optimization proposal in this thread
within the next couple of days.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Global Index for PostgreSQL

From
Bruce Momjian
Date:
On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
> On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
> > Global Indexes is a very interesting functionality that has both significant advantages
> > and drawbacks, and the community seems not ready to accept it without very strong
> > motivation.
> 
> I understand that this is a hard problem and needs changes in many
> critical modules. I don't think there should be a problem with the
> motivation of this work, but I believe the main issue lies in the
> project's complexity.
...
> In general, users ideally wouldn't use a global index everywhere. It
> really comes down to their specific use case – they should only opt
> for a global index when they can't effectively partition their data
> without one. The idea is that the amount of data in the sort space
> should essentially be the same as if the table wasn't partitioned at
> all. That's a good point for consideration. I agree that global
> indexes shouldn't be a default choice for every use case. They're most
> beneficial when a user's data access patterns inherently prevent
> effective partitioning without them. In such scenarios, the amount of
> data in the sort space would ideally remain comparable to an
> unpartitioned table.

There are certainly use cases where this would be helpful, but I think
the big question is whether it would have so many negatives that most
people who try to use it would eventually remove it.  I have heard that
happened to other relational systems who support global indexes, so I
think we have to consider that possibility.  The problem is you might
need to actually write the patch to find out.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
On Wed, Jun 11, 2025 at 1:08 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Mon, Jun  9, 2025 at 05:51:25PM -0400, Bruce Momjian wrote:
> > On Mon, Jun  9, 2025 at 03:28:38PM +0530, Dilip Kumar wrote:
> > There are certainly use cases where this would be helpful, but I think
> > the big question is whether it would have so many negatives that most
> > people who try to use it would eventually remove it.  I have heard that
> > happened to other relational systems who support global indexes, so I
> > think we have to consider that possibility.  The problem is you might
> > need to actually write the patch to find out.
>
> FYI, I wrote a blog about global indexes in 2020:

Oh interesting, I would be happy to have a look at it, Thanks.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
On Mon, Jun 9, 2025 at 3:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:

> > 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
> > vacuum slowdown would result in going beyond SLAs without corresponding
> > significant improvements.
> >
> You've got it. I'm on board with prioritizing a VACUUM optimization
> solution for partitioned tables with global indexes. My initial
> proposal touched on a proof-of-concept experiment, which indicated no
> significant performance hit with global index after the optimization.
> I'll share the detailed VACUUM optimization proposal in this thread
> within the next couple of days.

As discussed earlier the one of main problems is that the global
indexes are vacuumed along with each partition whereas logically it
should be vacuumed only once when all the partitions are vacuum in an
ideal world.

So my proposal is that we make some infrastructure change in vacuum
api's such that there is option to tell heap_vacuum_rel() to skip the
global index vacuum and also return back the deadtid_store, vacuum
layer will store these deadtid_store, hash it by the reloid until it
vacuum all the partitions which are covered by a global index.  Once
that is done it will vacuum the global index, we also need to modify
the vac_tid_reaped() so that it can take additional input of reloid so
that it can find the appropriate deadtid_store by looking into the
hash and then look up the dead tid in respective deadtid_store.

We also need to do something for the autovacuum, because currently
autovacuum workers scan the pg_class and identify the relation which
needs to be vacuumed and vacuum it one at a time.  However once we
have educated vacuum machinery to first vacuum all the partitions and
then perform global index vacuum, the autovacuum worker should know
which all partitions are supposed to be vacuum together and the
autovacuum worker can pass a list of all those partitions together to
the vacuum machinery.

I believe this enhancement can be implemented in autovacuum without
significant difficulty. Currently, autovacuum scans pg_class to
generate a list of relations requiring vacuuming. For partitioned
tables that have a global index, we can extend this process by
additionally maintaining a list of all their leaf relations within the
parent table's entry.

I believe this enhancement can be implemented in autovacuum without
significant difficulty. Currently, autovacuum scans pg_class to
generate a list of relations requiring vacuuming. For partitioned
tables that have a global index, we can extend this process by
additionally maintaining a list of all their leaf relations within the
parent table's entry.

An autovacuum worker would then process all the leaf relations in this
complete hierarchy. To effectively prevent other workers from
attempting to vacuum the same hierarchy concurrently, the worker would
publish the top-most partitioned relation ID (which identifies the
table with the global index) as MyWorkerInfo->wi_tableoid. This
mechanism ensures that if one worker is processing a partitioned
table, the entire set of child relations under that top-level ID is
automatically skipped by other workers

While this approach offers benefits, it's important to acknowledge
certain potential drawbacks. One concern is a possible impact on
parallelism, as currently each partition might be vacuumed by a
separate worker, but with this change, all partitions covered by the
same global index would have to be processed by a single worker.
Another significant challenge lies in effectively choosing which
partitions to vacuum; for instance, if a table with a global index has
1000 partitions and only 10 meet the vacuum threshold in a given
cycle, this method might not be very efficient. Although we wouldn't
vacuum the global index once for every partition, we could still end
up vacuuming it numerous times by the time all partitions are
eventually processed. To mitigate this, an alternative strategy could
involve proactively vacuuming partitions that haven't fully met their
vacuum threshold but are close (e.g., reached 50% of the threshold).
This would allow us to combine more partitions for a single vacuum
operation, thereby reducing the number of times the global index needs
to be vacuumed.


--
Regards,
Dilip Kumar
Google



Re: Proposal: Global Index for PostgreSQL

From
Masahiko Sawada
Date:
On Sat, Jun 14, 2025 at 2:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 9, 2025 at 3:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 2:03 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
>
> > > 4) Update-heavy partitioned tables that should run vacuum frequently. Significant
> > > vacuum slowdown would result in going beyond SLAs without corresponding
> > > significant improvements.
> > >
> > You've got it. I'm on board with prioritizing a VACUUM optimization
> > solution for partitioned tables with global indexes. My initial
> > proposal touched on a proof-of-concept experiment, which indicated no
> > significant performance hit with global index after the optimization.
> > I'll share the detailed VACUUM optimization proposal in this thread
> > within the next couple of days.
>
> As discussed earlier the one of main problems is that the global
> indexes are vacuumed along with each partition whereas logically it
> should be vacuumed only once when all the partitions are vacuum in an
> ideal world.
>
> So my proposal is that we make some infrastructure change in vacuum
> api's such that there is option to tell heap_vacuum_rel() to skip the
> global index vacuum and also return back the deadtid_store, vacuum
> layer will store these deadtid_store, hash it by the reloid until it
> vacuum all the partitions which are covered by a global index.  Once
> that is done it will vacuum the global index, we also need to modify
> the vac_tid_reaped() so that it can take additional input of reloid so
> that it can find the appropriate deadtid_store by looking into the
> hash and then look up the dead tid in respective deadtid_store.

Does it need to keep holding dead TIDs for each partition until it
completes vacuuming all partitions that are covered by the global
index? If so, it would end up holding a huge amount of memory in cases
where there are many partitions. How does maintanence_work_mem (or
autovacuum_work_mem) work in this context? Also, what if the
autovacuum worker who is processing the partitioned table with global
indexes gets cancelled? I guess that we would need to scan the
partitions again in order to collect dead TIDs to vacuum the global
index but it would be very expensive.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
On Wed, Jun 18, 2025 at 4:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Jun 14, 2025 at 2:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >

Thanks for your opinion, Sawada-San.

> Does it need to keep holding dead TIDs for each partition until it
> completes vacuuming all partitions that are covered by the global
> index? If so, it would end up holding a huge amount of memory in cases
> where there are many partitions. How does maintanence_work_mem (or
> autovacuum_work_mem) work in this context?

So it will keep holding the deadtids until we vacuum all the
partitions or we run out of the 'maintanence_work_mem', consider this
similar to the case where you do not have concept of global index and
for supporting the unique constraint on not partitioned key column
user has to keep one giant table without partitioning it, then every
time we run out of the  'maintanence_work_mem' we need to vacuum the
indexes, same theory applies for the global index.

 Also, what if the
> autovacuum worker who is processing the partitioned table with global
> indexes gets cancelled? I guess that we would need to scan the
> partitions again in order to collect dead TIDs to vacuum the global
> index but it would be very expensive.

You're right, but that's comparable to the cost of managing a single,
unpartitioned giant table. I believe the primary value of global
indexes lies in enabling users to partition tables that otherwise
couldn't be. So, while certain maintenance tasks might incur similar
costs to a single large table, you'll gain significant advantages
during many other DML operations due to partitioning that wouldn't be
possible without global indexes.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
On Wed, Jun 18, 2025 at 4:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 18, 2025 at 4:38 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
Here is the first WIP version of the patch set.  Commit message in
each patch explains in detail what exactly it does so not adding the
description in email.  Patches applies on the latest head
(732061150b004385810e522f8629f5bf91d977b7)

Open issues yet to be handled:
=========================
1) Vacuum optimization as I described above is still highly unstable
patch so not attached here, so with this patch set if you create
global index with large table with a lot of partition then there will
be a huge regression in vacuum performance, which should be resolved
after vacuum optimization patch which I am planning to post by this
month.
2) If you are attaching a partition which has a different column order
than the parent and create a global index on that, it will not work,
still working on it.
3) For unique checking in btree, global index may get conflicting
tuple belongs to different partitions so relation open is done on the
fly, ideally that description should have been done in executor and
pass down to btree but still identifying how to do that with minimum
changes in AM interfaces, or can it be done without AM changes.
4) Need to write more test cases for REINDEX TABLE, TRUNCATE TABLE and
need to tighten up the global index rebuild so that it doesn't get
rebuilt multiple times or doesn't skip rebuilding when needed.
5) ALTER COLUMN SET TYPE, which triggers global index rewrite is not
handled properly. This needs some more work, in the alter table
machinery where we identify the list of indexes to rebuild.
6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
we already know the VACUUM performance.
7) global_index.sql is taking a long time to execute as I have kept a
high data load, so that needs to be simplified.

Note: Patches are still WIP and might have many loose ends, for now,
make check-world is passing, including the global index test cases

Credit:  I have already mentioned while sending the first design email
but missed some, so adding it again here.

1. Robert: for many of the key design ideas and regular discussion
throughout designing this.
2. Joe: Also has regular discussion on this and many suggestions,
specially related to vacuum
3. Peter Geoghegan, Alvaro, and Masahiko Sawada for discussing the
independent issues with me offlist.
4. I got the idea of syntax for creating global indexes and also
keeping a cache of mapping from reloid to relation descriptor during
the global index scan, from some of the very old thread related to
global index (not able to find the thread now).

--
Regards,
Dilip Kumar
Google

Attachment

Re: Proposal: Global Index for PostgreSQL

From
Amit Langote
Date:
Hi Dilip,

Happy to see you working on this.  It’s clear a lot of thought has
gone into the design.

On Tue, Jul 1, 2025 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
> we already know the VACUUM performance.

One point I want to check my understanding of is around the locking
implications of global index scans, especially in prepared statement
scenarios.

I’ve been working on improving how we handle partition locking during
execution of generic plans. Specifically, I committed a patch to defer
locking of partitions until after pruning during ExecutorStart(), so
we avoid taking locks on partitions that aren’t actually needed --
even when the plan contains scans on all partitions. That patch was
later reverted, as Tom pointed out that the plan invalidation logic
wasn't cleanly handled. But the goal remains: to avoid locking
unnecessary partitions, particularly in high-partition-count OLTP
setups that use PREPARE/EXECUTE.

The proposed global index design, IIUC, requires locking all leaf
partitions up front during planning, and I guess during
AcquireExecutorLocks() when using a cached plan, because the index
scan could return tuples from any partition. This seems to directly
undercut that effort: we'd be back to generic plans causing broad
locking regardless of actual runtime needs.

I understand that this is currently necessary, given that a global
index scan is a single node without per-partition awareness. But it
might be worth considering whether the scan could opportunistically
defer heap relation locking until it returns a tuple that actually
belongs to a particular partition -- similar to how inserts into
partitioned tables only lock the target partition at execution time.
Or did I miss that inserts also need to lock all partitions up front
when global indexes are present, due to cross-partition uniqueness
checks?

Let me know if I’ve misunderstood the design.

--
Thanks, Amit Langote



Re: Proposal: Global Index for PostgreSQL

From
Dilip Kumar
Date:
On Tue, Jul 1, 2025 at 7:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Dilip,
>
> Happy to see you working on this.  It’s clear a lot of thought has
> gone into the design.


Thanks, Amit.  And thanks for your comment.

> On Tue, Jul 1, 2025 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > 6) Need to perform a performance test, for SELECT/UPDATE/INSERT cases,
> > we already know the VACUUM performance.
>
> One point I want to check my understanding of is around the locking
> implications of global index scans, especially in prepared statement
> scenarios.

Sure

> I’ve been working on improving how we handle partition locking during
> execution of generic plans. Specifically, I committed a patch to defer
> locking of partitions until after pruning during ExecutorStart(), so
> we avoid taking locks on partitions that aren’t actually needed --
> even when the plan contains scans on all partitions. That patch was
> later reverted, as Tom pointed out that the plan invalidation logic
> wasn't cleanly handled.

Yes I was following that thread, and at times when I was working on
locking for global index I had in mind that I would have to do
something to the locking after that patch is in.  Unfortunately that
got reverted and then I didn't put any effort in reconsidering how
locking is handled for global index.

 But the goal remains: to avoid locking
> unnecessary partitions, particularly in high-partition-count OLTP
> setups that use PREPARE/EXECUTE.

That makes sense.

> The proposed global index design, IIUC, requires locking all leaf
> partitions up front during planning, and I guess during
> AcquireExecutorLocks() when using a cached plan, because the index
> scan could return tuples from any partition. This seems to directly
> undercut that effort: we'd be back to generic plans causing broad
> locking regardless of actual runtime needs.

Just trying to understand the locking difference more with/without
global index, let's assume we have normal partitioned index on non
partition key column, and if we issue a scan on the partitioned table
then internally from expand_partitioned_rtentry() we will take lock on
all the partitions, because we can not prune any partition.
Similarly, with the global index also all the child partitions under
the top partition on which we have global index will be locked.  So in
this case we do not have a difference.

> I understand that this is currently necessary, given that a global
> index scan is a single node without per-partition awareness. But it
> might be worth considering whether the scan could opportunistically
> defer heap relation locking until it returns a tuple that actually
> belongs to a particular partition -- similar to how inserts into
> partitioned tables only lock the target partition at execution time.
> Or did I miss that inserts also need to lock all partitions up front
> when global indexes are present, due to cross-partition uniqueness
> checks?
>
> Let me know if I’ve misunderstood the design.

So there difference is in the cases, where we are directly operating
on the leaf table, e.g. if you inserting directly on the leaf
relation, currently we just need to lock that partition, but if there
is global index we need to lock other siblings as well (in short all
the leaf under the parent which has global index) because if the
global index is unique we might need to check unique conflict in other
leafs as well.  I believe when the table is partitioned, it might not
be the most preferred way to operate directly on the leaf, and with
global index only this case will be impacted where we are doing DML
directly on the leaf.  I am not sure in this case how much delay we
can do in locking, because e.g. for insert we will only identify which
partition has a duplicate key while inserting in the btree.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Global Index for PostgreSQL

From
Amit Langote
Date:
On Wed, Jul 2, 2025 at 1:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, Jul 1, 2025 at 7:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > I’ve been working on improving how we handle partition locking during
> > execution of generic plans. Specifically, I committed a patch to defer
> > locking of partitions until after pruning during ExecutorStart(), so
> > we avoid taking locks on partitions that aren’t actually needed --
> > even when the plan contains scans on all partitions. That patch was
> > later reverted, as Tom pointed out that the plan invalidation logic
> > wasn't cleanly handled.
>
> Yes I was following that thread, and at times when I was working on
> locking for global index I had in mind that I would have to do
> something to the locking after that patch is in.  Unfortunately that
> got reverted and then I didn't put any effort in reconsidering how
> locking is handled for global index.
...
>  But the goal remains: to avoid locking
> > unnecessary partitions, particularly in high-partition-count OLTP
> > setups that use PREPARE/EXECUTE.
>
> That makes sense.

Ok, good to know you were keeping a tab on it.

> > The proposed global index design, IIUC, requires locking all leaf
> > partitions up front during planning, and I guess during
> > AcquireExecutorLocks() when using a cached plan, because the index
> > scan could return tuples from any partition. This seems to directly
> > undercut that effort: we'd be back to generic plans causing broad
> > locking regardless of actual runtime needs.
>
> Just trying to understand the locking difference more with/without
> global index, let's assume we have normal partitioned index on non
> partition key column, and if we issue a scan on the partitioned table
> then internally from expand_partitioned_rtentry() we will take lock on
> all the partitions, because we can not prune any partition.
> Similarly, with the global index also all the child partitions under
> the top partition on which we have global index will be locked.  So in
> this case we do not have a difference.

Just to clarify -- I was hoping that, at least for SELECTs, we
wouldn’t need to lock all leaf partitions up front.

One of the potential selling points of global indexes (compared to
partitioned indexes) is that we can avoid expand_partitioned_rtentry()
and the full scan path setup per partition, though that's admittedly
quite an undertaking.  So I was imagining we could just lock the
parent and the global index during planning, and only lock individual
heap relations at execution time -- once we know which partition the
returned tuple belongs to.

Locking isn’t cheap -- and in workloads with thousands of partitions,
it becomes a major source of overhead, especially when the query
doesn't contain pruning quals and ends up touching only a few
partitions in practice. So I think it’s worth seeing if we can avoid
planning-time locking of all partitions in at least the SELECT case,
even if INSERTs may require broader locking due to uniqueness checks,
but see below...

> > I understand that this is currently necessary, given that a global
> > index scan is a single node without per-partition awareness. But it
> > might be worth considering whether the scan could opportunistically
> > defer heap relation locking until it returns a tuple that actually
> > belongs to a particular partition -- similar to how inserts into
> > partitioned tables only lock the target partition at execution time.
> > Or did I miss that inserts also need to lock all partitions up front
> > when global indexes are present, due to cross-partition uniqueness
> > checks?
> >
> > Let me know if I’ve misunderstood the design.
>
> So there difference is in the cases, where we are directly operating
> on the leaf table, e.g. if you inserting directly on the leaf
> relation, currently we just need to lock that partition, but if there
> is global index we need to lock other siblings as well (in short all
> the leaf under the parent which has global index) because if the
> global index is unique we might need to check unique conflict in other
> leafs as well.  I believe when the table is partitioned, it might not
> be the most preferred way to operate directly on the leaf, and with
> global index only this case will be impacted where we are doing DML
> directly on the leaf.  I am not sure in this case how much delay we
> can do in locking, because e.g. for insert we will only identify which
> partition has a duplicate key while inserting in the btree.

Hmm, it’s my understanding that with a true global index, meaning a
single btree structure spanning all partitions, uniqueness conflicts
are detected by probing the index after inserting the tuple into the
heap. So unless we find a matching key in the index, there is no need
to consult any other partitions.

Even if a match is found, we only need to access the heap page for
that specific TID to check visibility, and that would involve just one
partition.

Why then do we need to lock all leaf partitions in advance? It seems
like we could defer locking until the uniqueness check identifies a
partition that actually contains a conflicting tuple, and only then
lock that one heap.

I understand that in some earlier floated ideas for enforcing global
uniqueness (perhaps only briefly mentioned in past discussions), the
approach was to scan all per-partition indexes, which would have
required locking everything up front. But with a unified global index,
that overhead seems avoidable.

Is there something subtle that I am missing that still requires
locking all heaps in advance?

--
Thanks, Amit Langote