Thread: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

15 December 2023, 19:07:29

Hello, hackers!

I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
improvements) in some lighter way.

Yes, a serious bug was (2) caused by this optimization and now it reverted.

But what about a more safe idea in that direction:
1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys queries
2) use this horizon for settings LP_DEAD bit in indexes (excluding
indexes being built of course)

Index LP_DEAD hints are not used by standby in any way (they are just
ignored), also heap scan done by index building does not use them as
well.

But, at the same time:
1) index scans will be much faster during index creation or standby
reporting queries
2) indexes can keep them fit using different optimizations
3) less WAL due to a huge amount of full pages writes (which caused by
tons of LP_DEAD in indexes)

The patch seems more-less easy to implement.
Does it worth being implemented? Or to scary?

[1]: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
[2]: https://postgr.es/m/17485-396609c6925b982d%40postgresql.org

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

15 December 2023, 21:11:59

On Fri, 15 Dec 2023, 20:07 Michail Nikolaev, <michail.nikolaev@gmail.com> wrote:

Hello, hackers!

I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
improvements) in some lighter way.

Yes, a serious bug was (2) caused by this optimization and now it reverted.

But what about a more safe idea in that direction:
1) add new horizon which ignores PROC_IN_SAFE_IC backends and standbys queries
2) use this horizon for settings LP_DEAD bit in indexes (excluding
indexes being built of course)

Index LP_DEAD hints are not used by standby in any way (they are just
ignored), also heap scan done by index building does not use them as
well.

But, at the same time:
1) index scans will be much faster during index creation or standby
reporting queries
2) indexes can keep them fit using different optimizations
3) less WAL due to a huge amount of full pages writes (which caused by
tons of LP_DEAD in indexes)

The patch seems more-less easy to implement.
Does it worth being implemented? Or to scary?

I hihgly doubt this is worth the additional cognitive overhead of another liveness state, and I think there might be other issues with marking index tuples dead in indexes before the table tuple is dead that I can't think of right now.

I've thought about alternative solutions, too: how about getting a new snapshot every so often?

We don't really care about the liveness of the already-scanned data; the snapshots used for RIC are used only during the scan. C/RIC's relation's lock level means vacuum can't run to clean up dead line items, so as long as we only swap the backend's reported snapshot (thus xmin) while the scan is between pages we should be able to reduce the time C/RIC is the one backend holding back cleanup of old tuples.

Kind regards,

Matthias van de Meent

Neon (https://neon.tech)

[1]: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
[2]: https://postgr.es/m/17485-396609c6925b982d%40postgresql.org

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

17 December 2023, 20:14:27

Hello!

> I've thought about alternative solutions, too: how about getting a new snapshot every so often?
> We don't really care about the liveness of the already-scanned data; the snapshots used for RIC
> are used only during the scan. C/RIC's relation's lock level means vacuum can't run to clean up
> dead line items, so as long as we only swap the backend's reported snapshot (thus xmin) while
> the scan is between pages we should be able to reduce the time C/RIC is the one backend
> holding back cleanup of old tuples.

Hm, it looks like an interesting idea! It may be more dangerous, but
at least it feels much more elegant than an LP_DEAD-related way.
Also, feels like we may apply this to both phases (first and the second scans).
The original patch (1) was helping only to the second one (after call
to set_indexsafe_procflags).

But for the first scan we allowed to do so only for non-unique indexes
because of:

> * The reason for doing that is to avoid
> * bogus unique-index failures due to concurrent UPDATEs (we might see
> * different versions of the same row as being valid when we pass over them,
> * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
> * does not contain any tuples added to the table while we built the index.

Also, (1) was limited to indexes without expressions and predicates
(2) because such may execute queries to other tables (sic!).
One possible solution is to add some checks to make sure no
user-defined functions are used.
But as far as I understand, it affects only CIC for now and does not
affect the ability to use the proposed technique (updating snapshot
time to time).

However, I think we need some more-less formal proof it is safe - it
is really challenging to keep all the possible cases in the head. I’ll
try to do something here.
Another possible issue may be caused by the new locking pattern - we
will be required to wait for all transaction started before the ending
of the phase to exit.

[1]: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
[2]:
https://www.postgresql.org/message-id/flat/CAAaqYe_tq_Mtd9tdeGDsgQh%2BwMvouithAmcOXvCbLaH2PPGHvA%40mail.gmail.com#cbe3997b75c189c3713f243e25121c20

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

17 December 2023, 23:53:34

On Sun, 17 Dec 2023, 21:14 Michail Nikolaev, <michail.nikolaev@gmail.com> wrote:
>
> Hello!
>
> > I've thought about alternative solutions, too: how about getting a new snapshot every so often?
> > We don't really care about the liveness of the already-scanned data; the snapshots used for RIC
> > are used only during the scan. C/RIC's relation's lock level means vacuum can't run to clean up
> > dead line items, so as long as we only swap the backend's reported snapshot (thus xmin) while
> > the scan is between pages we should be able to reduce the time C/RIC is the one backend
> > holding back cleanup of old tuples.
>
> Hm, it looks like an interesting idea! It may be more dangerous, but
> at least it feels much more elegant than an LP_DEAD-related way.
> Also, feels like we may apply this to both phases (first and the second scans).
> The original patch (1) was helping only to the second one (after call
> to set_indexsafe_procflags).
>
> But for the first scan we allowed to do so only for non-unique indexes
> because of:
>
> > * The reason for doing that is to avoid
> > * bogus unique-index failures due to concurrent UPDATEs (we might see
> > * different versions of the same row as being valid when we pass over them,
> > * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
> > * does not contain any tuples added to the table while we built the index.

Yes, for that we'd need an extra scan of the index that validates
uniqueness. I think there was a proposal (though it may only have been
an idea) some time ago, about turning existing non-unique indexes into
unique ones by validating the data. Such a system would likely be very
useful to enable this optimization.

> Also, (1) was limited to indexes without expressions and predicates
> (2) because such may execute queries to other tables (sic!).

Note that the use of such expressions would be a violation of the
function's definition; it would depend on data from other tables which
makes the function behave like a STABLE function, as opposed to the
IMMUTABLE that is required for index expressions. So, I don't think we
should specially care about being correct for incorrectly marked
function definitions.

> One possible solution is to add some checks to make sure no
> user-defined functions are used.
> But as far as I understand, it affects only CIC for now and does not
> affect the ability to use the proposed technique (updating snapshot
> time to time).
>
> However, I think we need some more-less formal proof it is safe - it
> is really challenging to keep all the possible cases in the head. I’ll
> try to do something here.

I just realised there is one issue with this design: We can't cheaply
reset the snapshot during the second table scan:
It is critically important that the second scan of R/CIC uses an index
contents summary (made with index_bulk_delete) that was created while
the current snapshot was already registered. If we didn't do that, the
following would occur:

1. The index is marked ready for inserts from concurrent backends, but
not yet ready for queries.
2. We get the bulkdelete data
3. A concurrent backend inserts a new tuple T on heap page P, inserts
it into the index, and commits. This tuple is not in the summary, but
has been inserted into the index.
4. R/CIC resets the snapshot, making T visible.
5. R/CIC scans page P, finds that tuple T has to be indexed but is not
present in the summary, thus inserts that tuple into the index (which
already had it inserted at 3)

This thus would be a logic bug, as indexes assume at-most-once
semantics for index tuple insertion; duplicate insertion are an error.

So, the "reset the snapshot every so often" trick cannot be applied in
phase 3 (the rescan), or we'd have to do an index_bulk_delete call
every time we reset the snapshot. Rescanning might be worth the cost
(e.g. when using BRIN), but that is very unlikely.

Alternatively, we'd need to find another way to prevent us from
inserting these duplicate entries - maybe by storing the scan's data
in a buffer to later load into the index after another
index_bulk_delete()? Counterpoint: for BRIN indexes that'd likely
require a buffer much larger than the result index would be.

Either way, for the first scan (i.e. phase 2 "build new indexes") this
is not an issue: we don't care about what transaction adds/deletes
tuples at that point.
For all we know, all tuples of the table may be deleted concurrently
before we even allow concurrent backends to start inserting tuples,
and the algorithm would still work as it does right now.

> Another possible issue may be caused by the new locking pattern - we
> will be required to wait for all transaction started before the ending
> of the phase to exit.

What do you mean by "new locking pattern"? We already keep an
ShareUpdateExclusiveLock on every heap table we're accessing during
R/CIC, and that should already prevent any concurrent VACUUM
operations, right?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

20 December 2023, 09:56:33

Hello!

> Also, feels like we may apply this to both phases (first and the second scans).
> The original patch (1) was helping only to the second one (after call
> to set_indexsafe_procflags).

Oops, I was wrong here. The original version of the patch was also applied to
both phases.

> Note that the use of such expressions would be a violation of the
> function's definition; it would depend on data from other tables which
> makes the function behave like a STABLE function, as opposed to the
> IMMUTABLE that is required for index expressions. So, I don't think we
> should specially care about being correct for incorrectly marked
> function definitions.

Yes, but such cases could probably cause crashes also...
So, I think it is better to check them for custom functions. But I
still not sure -
if such limitations still required for proposed optimization or not.

> I just realised there is one issue with this design: We can't cheaply
> reset the snapshot during the second table scan:
> It is critically important that the second scan of R/CIC uses an index
> contents summary (made with index_bulk_delete) that was created while
> the current snapshot was already registered.

> So, the "reset the snapshot every so often" trick cannot be applied in
> phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> every time we reset the snapshot. Rescanning might be worth the cost
> (e.g. when using BRIN), but that is very unlikely.

Hm, I think it is still possible. We could just manually recheck the
tuples we see
to the snapshot currently used for the scan. If an "old" snapshot can see
the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
index summary.

> What do you mean by "new locking pattern"? We already keep an
> ShareUpdateExclusiveLock on every heap table we're accessing during
> R/CIC, and that should already prevent any concurrent VACUUM
> operations, right?

I was thinking not about "classical" locking, but about waiting for
other backends
by WaitForLockers(heaplocktag, ShareLock, true). But I think
everything should be
fine.

Best regards,
Michail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

20 December 2023, 11:14:32

On Wed, 20 Dec 2023 at 10:56, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
> > Note that the use of such expressions would be a violation of the
> > function's definition; it would depend on data from other tables which
> > makes the function behave like a STABLE function, as opposed to the
> > IMMUTABLE that is required for index expressions. So, I don't think we
> > should specially care about being correct for incorrectly marked
> > function definitions.
>
> Yes, but such cases could probably cause crashes also...
> So, I think it is better to check them for custom functions. But I
> still not sure -
> if such limitations still required for proposed optimization or not.

I think contents could be inconsistent, but not more inconsistent than
if the index was filled across multiple transactions using inserts.
Either way I don't see it breaking more things that are not already
broken in that way in other places - at most it will introduce another
path that exposes the broken state caused by mislabeled functions.

> > I just realised there is one issue with this design: We can't cheaply
> > reset the snapshot during the second table scan:
> > It is critically important that the second scan of R/CIC uses an index
> > contents summary (made with index_bulk_delete) that was created while
> > the current snapshot was already registered.
>
> > So, the "reset the snapshot every so often" trick cannot be applied in
> > phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> > every time we reset the snapshot. Rescanning might be worth the cost
> > (e.g. when using BRIN), but that is very unlikely.
>
> Hm, I think it is still possible. We could just manually recheck the
> tuples we see
> to the snapshot currently used for the scan. If an "old" snapshot can see
> the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
> index summary.
That's an interesting method.

How would this deal with tuples not visible to the old snapshot?
Presumably we can assume they're newer than that snapshot (the old
snapshot didn't have it, but the new one does, so it's committed after
the old snapshot, making them newer), so that backend must have
inserted it into the index already, right?

> HeapTupleSatisfiesHistoricMVCC

That function has this comment marker:
   "Only usable on tuples from catalog tables!"
Is that correct even for this?

Should this deal with any potential XID wraparound, too?
How does this behave when the newly inserted tuple's xmin gets frozen?
This would be allowed to happen during heap page pruning, afaik - no
rules that I know of which are against that - but it would create
issues where normal snapshot visibility rules would indicate it
visible to both snapshots regardless of whether it actually was
visible to the older snapshot when that snapshot was created...

Either way, "Historic snapshot" isn't something I've worked with
before, so that goes onto my "figure out how it works" pile.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

20 December 2023, 16:18:18

Hello!

> How would this deal with tuples not visible to the old snapshot?
> Presumably we can assume they're newer than that snapshot (the old
> snapshot didn't have it, but the new one does, so it's committed after
> the old snapshot, making them newer), so that backend must have
> inserted it into the index already, right?

Yes, exactly.

>> HeapTupleSatisfiesHistoricMVCC
> That function has this comment marker:
> "Only usable on tuples from catalog tables!"
> Is that correct even for this?

Yeah, we just need HeapTupleSatisfiesVisibility (which calls
HeapTupleSatisfiesMVCC) instead.

> Should this deal with any potential XID wraparound, too?

Yeah, looks like we should care about such case somehow.

Possible options here:

1) Skip vac_truncate_clog while CIC is running. In fact, I think it's
not that much worse than the current state - datfrozenxid is still
updated in the catalog and will be considered the next time
vac_update_datfrozenxid is called (the next VACCUM on any table).

2) Delay vac_truncate_clog while CIC is running.
In such a case, if it was skipped, we will need to re-run it using the
index builds backend later.

3) Wait for 64-bit xids :)

4) Any ideas?

In addition, for the first and second options, we need logic to cancel
the second phase in the case of ForceTransactionIdLimitUpdate.
But maybe I'm missing something and the tuples may be frozen, ignoring
the set datfrozenxid values (over some horizon calculated at runtime
based on the xmin backends).

> How does this behave when the newly inserted tuple's xmin gets frozen?
> This would be allowed to happen during heap page pruning, afaik - no
> rules that I know of which are against that - but it would create
> issues where normal snapshot visibility rules would indicate it
> visible to both snapshots regardless of whether it actually was
> visible to the older snapshot when that snapshot was created...

Yes, good catch.
Assuming we have somehow prevented vac_truncate_clog from occurring
during CIC, we can leave frozen and potentially frozen
(xmin<frozenXID) for the second phase.

So, first phase processing items:
* not frozen
* xmin>frozenXID (may not be frozen)
* visible by snapshot

second phase:
* frozen
* xmin>frozenXID (may be frozen)
* not in the index summary
* visible by "old" snapshot

You might also think – why is the first stage needed at all? Just use
batch processing during initial index building?

Best regards,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

20 December 2023, 16:53:27

> Yes, good catch.
> Assuming we have somehow prevented vac_truncate_clog from occurring
> during CIC, we can leave frozen and potentially frozen
> (xmin<frozenXID) for the second phase.

Just realized that we can leave this for the first stage to improve efficiency.
Since the ID is locked, anything that can be frozen will be visible in
the first stage.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

21 December 2023, 13:14:19

Hello.

Realized my last idea is invalid (because tuples are frozen by using
dynamically calculated horizon) - so, don't waste your time on it :)

Need to think a little bit more here.

Thanks,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

25 December 2023, 14:12:41

Hello!

It seems like the idea of "old" snapshot is still a valid one.

> Should this deal with any potential XID wraparound, too?

As far as I understand in our case, we are not affected by this in any way.
Vacuum in our table is not possible because of locking, so, nothing
may be frozen (see below).
In the case of super long index building, transactional limits will
stop new connections using current
regular infrastructure because it is based on relation data (but not
actual xmin of backends).

> How does this behave when the newly inserted tuple's xmin gets frozen?
> This would be allowed to happen during heap page pruning, afaik - no
> rules that I know of which are against that - but it would create
> issues where normal snapshot visibility rules would indicate it
> visible to both snapshots regardless of whether it actually was
> visible to the older snapshot when that snapshot was created...

As I can see, heap_page_prune never freezes any tuples.
In the case of regular vacuum, it used this way: call heap_page_prune
and then call heap_prepare_freeze_tuple and then
heap_freeze_execute_prepared.

Merry Christmas,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

04 January 2024, 11:24:04

On Mon, 25 Dec 2023 at 15:12, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello!
>
> It seems like the idea of "old" snapshot is still a valid one.
>
> > Should this deal with any potential XID wraparound, too?
>
> As far as I understand in our case, we are not affected by this in any way.
> Vacuum in our table is not possible because of locking, so, nothing
> may be frozen (see below).
> In the case of super long index building, transactional limits will
> stop new connections using current
> regular infrastructure because it is based on relation data (but not
> actual xmin of backends).
>
> > How does this behave when the newly inserted tuple's xmin gets frozen?
> > This would be allowed to happen during heap page pruning, afaik - no
> > rules that I know of which are against that - but it would create
> > issues where normal snapshot visibility rules would indicate it
> > visible to both snapshots regardless of whether it actually was
> > visible to the older snapshot when that snapshot was created...
>
> As I can see, heap_page_prune never freezes any tuples.
> In the case of regular vacuum, it used this way: call heap_page_prune
> and then call heap_prepare_freeze_tuple and then
> heap_freeze_execute_prepared.

Correct, but there are changes being discussed where we would freeze
tuples during pruning as well [0], which would invalidate that
implementation detail. And, if I had to choose between improved
opportunistic freezing and improved R/CIC, I'd probably choose
improved freezing over R/CIC.

As an alternative, we _could_ keep track of concurrent index inserts
using a dummy index (with the same predicate) which only holds the
TIDs of the inserted tuples. We'd keep it as an empty index in phase
1, and every time we reset the visibility snapshot we now only need to
scan that index to know what tuples were concurrently inserted. This
should have a significantly lower IO overhead than repeated full index
bulkdelete scans for the new index in the second table scan phase of
R/CIC. However, in a worst case it could still require another
O(tablesize) of storage.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/message-id/CAAKRu_a+g2oe6aHJCbibFtNFiy2aib4E31X9QYJ_qKjxZmZQEg@mail.gmail.com

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

04 January 2024, 12:45:06

Hello!

> Correct, but there are changes being discussed where we would freeze
> tuples during pruning as well [0], which would invalidate that
> implementation detail. And, if I had to choose between improved
> opportunistic freezing and improved R/CIC, I'd probably choose
> improved freezing over R/CIC.

As another option, we could extract a dedicated horizon value for an
opportunistic freezing.
And use some flags in R/CIC backend to keep it at the required value.

Best regards,
Michail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

09 January 2024, 17:00:28

Hello, Melanie!

Sorry to interrupt you, just a quick question.

> Correct, but there are changes being discussed where we would freeze
> tuples during pruning as well [0], which would invalidate that
> implementation detail. And, if I had to choose between improved
> opportunistic freezing and improved R/CIC, I'd probably choose
> improved freezing over R/CIC.

Do you have any patches\threads related to that refactoring
(opportunistic freezing of tuples during pruning) [0]?
This may affect the idea of the current thread (latest version of it
mostly in [1]) - it may be required to disable such a feature for
particular relation temporary or affect horizon used for pruning
(without holding xmin).

Just no sure - is it reasonable to start coding right now, or wait for
some prune-freeze-related patch first?

[0] https://www.postgresql.org/message-id/CAAKRu_a+g2oe6aHJCbibFtNFiy2aib4E31X9QYJ_qKjxZmZQEg@mail.gmail.com
[1]
https://www.postgresql.org/message-id/flat/CANtu0ojRX%3DosoiXL9JJG6g6qOowXVbVYX%2BmDsN%2B2jmFVe%3DeG7w%40mail.gmail.com#a8ff53f23d0fc7edabd446b4d634e7b5

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

01 February 2024, 16:06:28

> > > I just realised there is one issue with this design: We can't cheaply
> > > reset the snapshot during the second table scan:
> > > It is critically important that the second scan of R/CIC uses an index
> > > contents summary (made with index_bulk_delete) that was created while
> > > the current snapshot was already registered.
> >
> > > So, the "reset the snapshot every so often" trick cannot be applied in
> > > phase 3 (the rescan), or we'd have to do an index_bulk_delete call
> > > every time we reset the snapshot. Rescanning might be worth the cost
> > > (e.g. when using BRIN), but that is very unlikely.
> >
> > Hm, I think it is still possible. We could just manually recheck the
> > tuples we see
> > to the snapshot currently used for the scan. If an "old" snapshot can see
> > the tuple also (HeapTupleSatisfiesHistoricMVCC) then search for it in the
> > index summary.
> That's an interesting method.
>
> How would this deal with tuples not visible to the old snapshot?
> Presumably we can assume they're newer than that snapshot (the old
> snapshot didn't have it, but the new one does, so it's committed after
> the old snapshot, making them newer), so that backend must have
> inserted it into the index already, right?

I made a draft of the patch and this idea is not working.

The problem is generally the same:

* reference snapshot sees tuple X
* reference snapshot is used to create index summary (but there is no
tuple X in the index summary)
* tuple X is updated to Y creating a HOT-chain
* we started scan with new temporary snapshot (it sees Y, X is too old for it)
* tuple X is pruned from HOT-chain because it is not protected by any snapshot
* we see tuple Y in the scan with temporary snapshot
    * it is not in the index summary - so, we need to check if
reference snapshot can see it
    * there is no way to understand if the reference snapshot was able
to see tuple X - because we need the full HOT chain (with X tuple) for
that

Best regards,
Michail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

17 February 2024, 21:48:44

On Thu, 1 Feb 2024, 17:06 Michail Nikolaev, <michail.nikolaev@gmail.com> wrote:
>
> > > > I just realised there is one issue with this design: We can't cheaply
> > > > reset the snapshot during the second table scan:
> > > > It is critically important that the second scan of R/CIC uses an index
> > > > contents summary (made with index_bulk_delete) that was created while
> > > > the current snapshot was already registered.

I think the best way for this to work would be an index method that
exclusively stores TIDs, and of which we can quickly determine new
tuples, too. I was thinking about something like GIN's format, but
using (generation number, tid) instead of ([colno, colvalue], tid) as
key data for the internal trees, and would be unlogged (because the
data wouldn't have to survive a crash). Then we could do something
like this for the second table scan phase:

0. index->indisready is set
[...]
1. Empty the "changelog index", resetting storage and the generation number.
2. Take index contents snapshot of new index, store this.
3. Loop until completion:
4a. Take visibility snapshot
4b. Update generation number of the changelog index, store this.
4c. Take index snapshot of "changelog index" for data up to the
current stored generation number. Not including, because we only need
to scan that part of the index that were added before we created our
visibility snapshot, i.e. TIDs labeled with generation numbers between
the previous iteration's generation number (incl.) and this
iteration's generation (excl.).
4d. Combine the current index snapshot with that of the "changelog"
index, and save this.
    Note that this needs to take care to remove duplicates.
4e. Scan segment of table (using the combined index snapshot) until we
need to update our visibility snapshot or have scanned the whole
table.

This should give similar, if not the same, behavour as that which we
have when we RIC a table with several small indexes, without requiring
us to scan a full index of data several times.

Attemp on proving this approach's correctness:
In phase 3, after each step 4b:
All matching tuples of the table that are in the visibility snapshot:
* Were created before scan 1's snapshot, thus in the new index's snapshot, or
* Were created after scan 1's snapshot but before index->indisready,
thus not in the new index's snapshot, nor in the changelog index, or
* Were created after the index was set as indisready, and committed
before the previous iteration's visibility snapshot, thus in the
combined index snapshot, or
* Were created after the index was set as indisready, after the
previous visibility snapshot was taken, but before the current
visibility snapshot was taken, and thus definitely included in the
changelog index.

Because we hold a snapshot, no data in the table that we should see is
removed, so we don't have a chance of broken HOT chains.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

20 February 2024, 23:33:26

Hello!

> I think the best way for this to work would be an index method that
> exclusively stores TIDs, and of which we can quickly determine new
> tuples, too. I was thinking about something like GIN's format, but
> using (generation number, tid) instead of ([colno, colvalue], tid) as
> key data for the internal trees, and would be unlogged (because the
> data wouldn't have to survive a crash)

Yeah, this seems to be a reasonable approach, but there are some
doubts related to it - it needs new index type as well as unlogged
indexes to be introduced - this may make the patch too invasive to be
merged. Also, some way to remove the index from the catalog in case of
a crash may be required.

A few more thoughts:
* it is possible to go without generation number - we may provide a
way to do some kind of fast index lookup (by TID) directly during the
second table scan phase.
* one more option is to maintain a Tuplesorts (instead of an index)
with TIDs as changelog and merge with index snapshot after taking a
new visibility snapshot. But it is not clear how to share the same
Tuplesort with multiple inserting backends.
* crazy idea - what is about to do the scan in the index we are
building? We have tuple, so, we have all the data indexed in the
index. We may try to do an index scan using that data to get all
tuples and find the one with our TID :) Yes, in some cases it may be
too bad because of the huge amount of TIDs we need to scan + also
btree copies whole page despite we need single item. But some
additional index method may help - feels like something related to
uniqueness (but it is only in btree anyway).

Thanks,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

21 February 2024, 08:35:40

One more idea - is just forbid HOT prune while the second phase is
running. It is not possible anyway currently because of snapshot held.

Possible enhancements:
* we may apply restriction only to particular tables
* we may apply restrictions only to part of the tables (not yet
scanned by R/CICs).

Yes, it is not an elegant solution, limited, not reliable in terms of
architecture, but a simple one.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

21 February 2024, 11:01:45

On Wed, 21 Feb 2024 at 00:33, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello!
> > I think the best way for this to work would be an index method that
> > exclusively stores TIDs, and of which we can quickly determine new
> > tuples, too. I was thinking about something like GIN's format, but
> > using (generation number, tid) instead of ([colno, colvalue], tid) as
> > key data for the internal trees, and would be unlogged (because the
> > data wouldn't have to survive a crash)
>
> Yeah, this seems to be a reasonable approach, but there are some
> doubts related to it - it needs new index type as well as unlogged
> indexes to be introduced - this may make the patch too invasive to be
> merged.

I suppose so, though persistence is usually just to keep things
correct in case of crashes, and this "index" is only there to support
processes that don't expect to survive crashes.

> Also, some way to remove the index from the catalog in case of
> a crash may be required.

That's less of an issue though, we already accept that a crash during
CIC/RIC leaves unusable indexes around, so "needs more cleanup" is not
exactly a blocker.

> A few more thoughts:
> * it is possible to go without generation number - we may provide a
> way to do some kind of fast index lookup (by TID) directly during the
> second table scan phase.

While possible, I don't think this would be more performant than the
combination approach, at the cost of potentially much more random IO
when the table is aggressively being updated.

> * one more option is to maintain a Tuplesorts (instead of an index)
> with TIDs as changelog and merge with index snapshot after taking a
> new visibility snapshot. But it is not clear how to share the same
> Tuplesort with multiple inserting backends.

Tuplesort requires the leader process to wait for concurrent backends
to finish their sort before it can start consuming their runs. This
would make it a very bad alternative to the "changelog index" as the
CIC process would require on-demand actions from concurrent backends
(flush of sort state). I'm not convinced that's somehow easier.

> * crazy idea - what is about to do the scan in the index we are
> building? We have tuple, so, we have all the data indexed in the
> index. We may try to do an index scan using that data to get all
> tuples and find the one with our TID :)

We can't rely on that, because we have no guarantee we can find the
tuple quickly enough. Equality-based indexing is very much optional,
and so are TID-based checks (outside the current vacuum-related APIs),
so finding one TID can (and probably will) take O(indexsize) when the
tuple is not in the index, which is one reason for ambulkdelete() to
exist.

Kind regards,

Matthias van de Meent

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

21 February 2024, 11:19:29

On Wed, 21 Feb 2024 at 09:35, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> One more idea - is just forbid HOT prune while the second phase is
> running. It is not possible anyway currently because of snapshot held.
>
> Possible enhancements:
> * we may apply restriction only to particular tables
> * we may apply restrictions only to part of the tables (not yet
> scanned by R/CICs).
>
> Yes, it is not an elegant solution, limited, not reliable in terms of
> architecture, but a simple one.

How do you suppose this would work differently from a long-lived
normal snapshot, which is how it works right now?
Would it be exclusively for that relation? How would this be
integrated with e.g. heap_page_prune_opt?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

21 February 2024, 11:36:48

Hi!

> How do you suppose this would work differently from a long-lived
> normal snapshot, which is how it works right now?

Difference in the ability to take new visibility snapshot periodically
during the second phase with rechecking visibility of tuple according
to the "reference" snapshot (which is taken only once like now).
It is the approach from (1) but with a workaround for the issues
caused by heap_page_prune_opt.

> Would it be exclusively for that relation?
Yes, only for that affected relation. Other relations are unaffected.

> How would this be integrated with e.g. heap_page_prune_opt?
Probably by some flag in RelationData, but not sure here yet.

If the idea looks sane, I could try to extend my POC - it should be
not too hard, likely (I already have tests to make sure it is
correct).

(1):
https://www.postgresql.org/message-id/flat/CANtu0oijWPRGRpaRR_OvT2R5YALzscvcOTFh-%3DuZKUpNJmuZtw%40mail.gmail.com#8141eb2ea177ff560ee713b3f20de404

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

05 March 2024, 20:08:08

On Wed, 21 Feb 2024 at 12:37, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hi!
>
> > How do you suppose this would work differently from a long-lived
> > normal snapshot, which is how it works right now?
>
> Difference in the ability to take new visibility snapshot periodically
> during the second phase with rechecking visibility of tuple according
> to the "reference" snapshot (which is taken only once like now).
> It is the approach from (1) but with a workaround for the issues
> caused by heap_page_prune_opt.
>
> > Would it be exclusively for that relation?
> Yes, only for that affected relation. Other relations are unaffected.

I suppose this could work. We'd also need to be very sure that the
toast relation isn't cleaned up either: Even though that's currently
DELETE+INSERT only and can't apply HOT, it would be an issue if we
couldn't find the TOAST data of a deleted for everyone (but visible to
us) tuple.

Note that disabling cleanup for a relation will also disable cleanup
of tuple versions in that table that are not used for the R/CIC
snapshots, and that'd be an issue, too.

> > How would this be integrated with e.g. heap_page_prune_opt?
> Probably by some flag in RelationData, but not sure here yet.
>
> If the idea looks sane, I could try to extend my POC - it should be
> not too hard, likely (I already have tests to make sure it is
> correct).

I'm not a fan of this approach. Changing visibility and cleanup
semantics to only benefit R/CIC sounds like a pain to work with in
essentially all visibility-related code. I'd much rather have to deal
with another index AM, even if it takes more time: the changes in
semantics will be limited to a new plug in the index AM system and a
behaviour change in R/CIC, rather than behaviour that changes in all
visibility-checking code.

But regardless of second scan snapshots, I think we can worry about
that part at a later moment: The first scan phase is usually the most
expensive and takes the most time of all phases that hold snapshots,
and in the above discussion we agreed that we can already reduce the
time that a snapshot is held during that phase significantly. Sure, it
isn't great that we have to scan the table again with only a single
snapshot, but generally phase 2 doesn't have that much to do (except
when BRIN indexes are involved) so this is likely less of an issue.
And even if it is, we would still have reduced the number of
long-lived snapshots by half.

-Matthias

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

07 March 2024, 18:36:53

Hello!

> I'm not a fan of this approach. Changing visibility and cleanup
> semantics to only benefit R/CIC sounds like a pain to work with in
> essentially all visibility-related code. I'd much rather have to deal
> with another index AM, even if it takes more time: the changes in
> semantics will be limited to a new plug in the index AM system and a
> behaviour change in R/CIC, rather than behaviour that changes in all
> visibility-checking code.

Technically, this does not affect the visibility logic, only the
clearing semantics.
All visibility related code remains untouched.
But yes, still an inelegant and a little strange-looking option.

At the same time, perhaps it can be dressed in luxury
somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
a list of blocks to clear some relations.

> But regardless of second scan snapshots, I think we can worry about
> that part at a later moment: The first scan phase is usually the most
> expensive and takes the most time of all phases that hold snapshots,
> and in the above discussion we agreed that we can already reduce the
> time that a snapshot is held during that phase significantly. Sure, it
> isn't great that we have to scan the table again with only a single
> snapshot, but generally phase 2 doesn't have that much to do (except
> when BRIN indexes are involved) so this is likely less of an issue.
> And even if it is, we would still have reduced the number of
> long-lived snapshots by half.

Hmm, but it looks like we don't have the infrastructure to "update" xmin
propagating to the horizon after the first snapshot in a transaction is taken.

One option I know of is to reuse the
d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
But if this is the case, then there is no point in re-taking the
snapshot again during the first
phase - just apply this "if" only for the first phase - and you're done.

Do you know any less-hacky way? Or is it a nice way to go?

[1]:
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

12 March 2024, 11:50:24

On Thu, 7 Mar 2024 at 19:37, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello!
>
> > I'm not a fan of this approach. Changing visibility and cleanup
> > semantics to only benefit R/CIC sounds like a pain to work with in
> > essentially all visibility-related code. I'd much rather have to deal
> > with another index AM, even if it takes more time: the changes in
> > semantics will be limited to a new plug in the index AM system and a
> > behaviour change in R/CIC, rather than behaviour that changes in all
> > visibility-checking code.
>
> Technically, this does not affect the visibility logic, only the
> clearing semantics.
> All visibility related code remains untouched.

Yeah, correct. But it still needs to update the table relations'
information after finishing creating the indexes, which I'd rather not
have to do.

> But yes, still an inelegant and a little strange-looking option.
>
> At the same time, perhaps it can be dressed in luxury
> somehow - for example, add as a first class citizen in ComputeXidHorizonsResult
> a list of blocks to clear some relations.

Not sure what you mean here, but I don't think
ComputeXidHorizonsResult should have anything to do with actual
relations.

> > But regardless of second scan snapshots, I think we can worry about
> > that part at a later moment: The first scan phase is usually the most
> > expensive and takes the most time of all phases that hold snapshots,
> > and in the above discussion we agreed that we can already reduce the
> > time that a snapshot is held during that phase significantly. Sure, it
> > isn't great that we have to scan the table again with only a single
> > snapshot, but generally phase 2 doesn't have that much to do (except
> > when BRIN indexes are involved) so this is likely less of an issue.
> > And even if it is, we would still have reduced the number of
> > long-lived snapshots by half.
>
> Hmm, but it looks like we don't have the infrastructure to "update" xmin
> propagating to the horizon after the first snapshot in a transaction is taken.

We can just release the current snapshot, and get a new one, right? I
mean, we don't actually use the transaction for much else than
visibility during the first scan, and I don't think there is a need
for an actual transaction ID until we're ready to mark the index entry
with indisready.

> One option I know of is to reuse the
> d9d076222f5b94a85e0e318339cfc44b8f26022d (1) approach.
> But if this is the case, then there is no point in re-taking the
> snapshot again during the first
> phase - just apply this "if" only for the first phase - and you're done.

Not a fan of that, as it is too sensitive to abuse. Note that
extensions will also have access to these tools, and I think we should
build a system here that's not easy to break, rather than one that is.

> Do you know any less-hacky way? Or is it a nice way to go?

I suppose we could be resetting the snapshot every so often? Or use
multiple successive TID range scans with a new snapshot each?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

04 May 2024, 15:51:20

Hello, Matthias!

> We can just release the current snapshot, and get a new one, right? I
> mean, we don't actually use the transaction for much else than
> visibility during the first scan, and I don't think there is a need
> for an actual transaction ID until we're ready to mark the index entry
> with indisready.

> I suppose we could be resetting the snapshot every so often? Or use
> multiple successive TID range scans with a new snapshot each?

It seems like it is not so easy in that case. Because we still need to hold catalog snapshot xmin, releasing the snapshot which used for the scan does not affect xmin propagated to the horizon.
That's why d9d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the data horizon, but not the catalog's one.

So, in such a situation, we may:

1) starts scan from scratch with some TID range multiple times. But such an approach feels too complex and error-prone for me.

2) split horizons propagated by `MyProc` to data-related xmin and catalog-related xmin. Like `xmin` and `catalogXmin`. We may just mark snapshots as affecting some of the horizons, or both. Such a change feels easy to be done but touches pretty core logic, so we need someone's approval for such a proposal, probably.

3) provide some less invasive (but less non-kludge) way: add some kind of process flag like `PROC_IN_SAFE_IC_XMIN` and function like `AdvanceIndexSafeXmin` which changes the way backend affect horizon calculation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons` uses value from `proc->safeIcXmin` which is updated by `AdvanceIndexSafeXmin` while switching scan snapshots.

So, with option 2 or 3, we may avoid holding data horizon during the first phase scan by resetting the scan snapshot every so often (and, optionally, using `AdvanceIndexSafeXmin` in case of 3rd approach).

The same will be possible for the second phase (validate).

We may do the same "resetting the snapshot every so often" technique, but there is still the issue with the way we distinguish tuples which were missed by the first phase scan or were inserted into the index after the visibility snapshot was taken.

So, I see two options here:

1) approach with additional index with some custom AM proposed by you.

It looks correct and reliable but feels complex to implement and maintain. Also, it negatively affects performance of table access (because of an additional index) and validation scan (because we need to merge additional index content with visibility snapshot).

2) one more tricky approach.

We may add some boolean flag to `Relation` about information of index building in progress (`indexisbuilding`).

It may be easily calculated using `(index->indisready && !index->indisvalid)`. For a more reliable solution, we also need to somehow check if backend/transaction building the index still in progress. Also, it is better to check if index is building concurrently using the "safe_index" way.

I think there is a non too complex and expensive way to do so, probably by addition of some flag to index catalog record.

Once we have such a flag, we may "legally" prohibit `heap_page_prune_opt` affecting the relation updating `GlobalVisHorizonKindForRel` like this:

if (rel != NULL && rel->rd_indexvalid && rel->rd_indexisbuilding)
return VISHORIZON_CATALOG;

So, in common it works this way:

* backend building the index affects catalog horizon as usual, but data horizon is regularly propagated forward during the scan. So, other relations are processed by vacuum and `heap_page_prune_opt` without any restrictions

* but our relation (with CIC in progress) accessed by `heap_page_prune_opt` (or any other vacuum-like mechanics) with catalog horizon to honor CIC work. Therefore, validating scan may be sure what none of the HOT-chain will be truncated. Even regular vacuum can't affect it (but yes, it can't be anyway because of relation locking).

As a result, we may easily distinguish tuples missed by first phase scan, just by testing them against reference snapshot (which used to take visibility snapshot).

So, for me, this approach feels non-kludge enough, safe and effective and the same time.

I have a prototype of this approach and looks like it works (I have a good test catching issues with index content for CIC).

What do you think about all this?

[1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

05 May 2024, 23:37:20

Hello, Matthias!

I just realized there is a much simpler and safe way to deal with the problem.

So, d9d076222f5b94a85e0e318339cfc44b8f26022d(1) had a bug because the scan was not protected by a snapshot. At the same time, we want this snapshot to affect not all the relations, but only a subset of them. And there is already a proper way to achieve that - different types of visibility horizons!

So, to resolve the issue, we just need to create a separated horizon value for such situation as building an index concurrently.

For now, let's name it `VISHORIZON_BUILD_INDEX_CONCURRENTLY` for example. By default, its value is equal to `VISHORIZON_DATA`. But in some cases it "stops" propagating forward while concurrent index is building, like this:

h->create_index_concurrently_oldest_nonremovable =TransactionIdOlder(h->create_index_concurrently_oldest_nonremovable, xmin);
if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable = TransactionIdOlder(h->data_oldest_nonremovable, xmin);

The `PROC_IN_SAFE_IC` marks backend xmin as ignored by `VISHORIZON_DATA` but not by `VISHORIZON_BUILD_INDEX_CONCURRENTLY`.

After, we need to use appropriate horizon for relations which are processed by `PROC_IN_SAFE_IC` backends. There are a few ways to do it, we may start prototyping with `rd_indexisbuilding` from previous message:

static inline GlobalVisHorizonKind
GlobalVisHorizonKindForRel(Relation rel)
........
if (rel != NULL && rel->rd_indexvalid && rel->rd_indexisbuilding)
return VISHORIZON_BUILD_INDEX_CONCURRENTLY;

There are few more moments need to be considered:

* Does it move the horizon backwards?

It is allowed for the horizon to move backwards (like said in `ComputeXidHorizons`) but anyway - in that case the horizon for particular relations just starts to lag behind the horizon for other relations.
Invariant is like that: `VISHORIZON_BUILD_INDEX_CONCURRENTLY` <= `VISHORIZON_DATA` <= `VISHORIZON_CATALOG` <= `VISHORIZON_SHARED`.

* What is about old cached versions of `Relation` objects without `rd_indexisbuilding` yet set?

This is not a problem because once the backend registers a new index, it waits for all transactions without that knowledge to end (`WaitForLockers`). So, new ones will also get information about new horizon for that particular relation.

* What is about TOAST?
To keep TOAST horizon aligned with relation building the index, we may do the next thing (as first implementation iteration):

else if (rel != NULL && ((rel->rd_indexvalid && rel->rd_indexisbuilding) || IsToastRelation(rel)))
return VISHORIZON_BUILD_INDEX_CONCURRENTLY;

For the normal case, `VISHORIZON_BUILD_INDEX_CONCURRENTLY` is equal to `VISHORIZON_DATA` - nothing is changed at all. But while the concurrent index is building, the TOAST horizon is guaranteed to be aligned with its parent relation. And yes, it is better to find an easy way to affect only TOAST relations related to the relation with index building in progress.

New horizon adds some complexity, but not too much, in my opinion. I am pretty sure it is worth being done because the ability to rebuild indexes without performance degradation is an extremely useful feature.

Things to be improved:
* better way to track relations with concurrent indexes being built (with mechanics to understood what index build was failed)
* better way to affect TOAST tables only related to concurrent index build
* better naming

Patch prototype in attachment.
Also, maybe it is worth committing test separately - it was based on Andrey Borodin work (2). The test fails well in the case of incorrect implementation.

[1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793
[2]: https://github.com/x4m/postgres_g/commit/d0651e7d0d14862d5a4dac076355

Attachment

v1-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

07 May 2024, 12:35:13

Hello, Matthias and others!

Updated WIP in attach.

Changes are:
* Renaming, now it feels better for me
* More reliable approach in `GlobalVisHorizonKindForRel` to make sure we have not missed `rd_safeindexconcurrentlybuilding` by calling `RelationGetIndexList` if required
* Optimization to avoid any additional `RelationGetIndexList` if zero of concurrently indexes are being built
* TOAST moved to TODO, since looks like it is out of scope - but not sure yet, need to dive dipper

TODO:

* TOAST

* docs and comments

* make sure non-data tables are not affected

* Per-database scope of optimization

* Handle index building errors correctly in optimization code

* More tests: create index, multiple re-indexes, multiple tables

Thanks,
Michail.

Attachment

v2-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

07 May 2024, 20:23:23

Hi again!

Made an error in `GlobalVisHorizonKindForRel` logic, and it was caught by a new test.

Fixed version in attach.

Attachment

v3-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

09 May 2024, 13:00:00

Hello, Matthias and others!

Realized new horizon was applied only during validation phase (once index is marked as ready).
Now it applied if index is not marked as valid yet.

Updated version in attach.

--------------------------------------------------

> I think the best way for this to work would be an index method that
> exclusively stores TIDs, and of which we can quickly determine new
> tuples, too. I was thinking about something like GIN's format, but
> using (generation number, tid) instead of ([colno, colvalue], tid) as
> key data for the internal trees, and would be unlogged (because the
> data wouldn't have to survive a crash). Then we could do something
> like this for the second table scan phase:

Regarding that approach to dealing with validation phase and resetting of snapshot:

I was thinking about it and realized: once we go for an additional index - we don't need the second heap scan at all!

We may do it this way:

* create target index, not marked as indisready yet
* create a temporary unlogged index with the same parameters to store tids (optionally with the indexes columns data, see below), marked as indisready (but not indisvalid)
* commit them both in a single transaction
* wait for other transaction to know about them and honor in HOT constraints and new inserts (for temporary index)
* now our temporary index is filled by the tuples inserted to the table
* start building out target index, resetting snapshot every so often (if it is "safe" index)
* finish target index building phase
* mark target index as indisready
* now, start validation of the index:
* take the reference snapshot
* take a visibility snapshot of the target index, sort it (as it done currently)
* take a visibility snapshot of our temporary index, sort it
* start merging loop using two synchronized cursors over both visibility snapshots
* if we encountered tid which is not present in target visibility snapshot
* insert it to target index
* if a temporary index contains the column's data - we may even avoid the tuple fetch
* if temporary index is tid-only - we fetch tuple from the heap, but as plus we are also skipping dead tuples from insertion to the new index (I think it is better option)
* commit everything, release reference snapshot
* wait for transactions older than reference snapshot (as it done currently)
* mark target index as indisvalid, drop temporary index
* done

So, pros:
* just a single heap scan
* snapshot is reset periodically

Cons:
* we need to maintain the additional index during the main building phase
* one more tuplesort

If the temporary index is unlogged, cheap to maintain (just append-only mechanics) this feels like a perfect tradeoff for me.

This approach will work perfectly with low amount of tuple inserts during the building phase. And looks like even in the worst case it still better than the current approach.

What do you think? Have I missed something?

Thanks,
Michail.

Attachment

v4-0001-WIP-fix-d9d076222f5b-VACUUM-ignore-indexing-opera.patch

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

11 June 2024, 08:58:05

Hello.

I did the POC (1) of the method described in the previous email, and it looks promising.

It doesn't block the VACUUM, indexes are built about 30% faster (22 mins vs 15 mins). Additional index is lightweight and does not produce any WAL.

I'll continue the more stress testing for a while. Also, I need to restructure the commits (my path was no direct) into some meaningful and reviewable patches.

[1] https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

06 August 2024, 23:40:51

On Tue, 11 Jun 2024 at 10:58, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello.
>
> I did the POC (1) of the method described in the previous email, and it looks promising.
>
> It doesn't block the VACUUM, indexes are built about 30% faster (22 mins vs 15 mins).

That's a nice improvement.

> Additional index is lightweight and does not produce any WAL.

That doesn't seem to be what I see in the current patchset:

https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-cc3cb8968cf833c4b8498ad2c561c786099c910515c4bf397ba853ae60aa2bf7R311

> I'll continue the more stress testing for a while. Also, I need to restructure the commits (my path was no direct)
intosome meaningful and reviewable patches.
 

While waiting for this, here are some initial comments on the github diffs:

- I notice you've added a new argument to
heapam_index_build_range_scan. I think this could just as well be
implemented by reading the indexInfo->ii_Concurrent field, as the
values should be equivalent, right?

- In heapam_index_build_range_scan, it seems like you're popping the
snapshot and registering a new one while holding a tuple from
heap_getnext(), thus while holding a page lock. I'm not so sure that's
OK, expecially when catalogs are also involved (specifically for
expression indexes, where functions could potentially be updated or
dropped if we re-create the visibility snapshot)

- In heapam_index_build_range_scan, you pop the snapshot before the
returned heaptuple is processed and passed to the index-provided
callback. I think that's incorrect, as it'll change the visibility of
the returned tuple before it's passed to the index's callback. I think
the snapshot manipulation is best added at the end of the loop, if we
add it at all in that function.

- The snapshot reset interval is quite high, at 500ms. Why did you
configure it that low, and didn't you make this configurable?

- You seem to be using WAL in the STIR index, while it doesn't seem
that relevant for the use case of auxiliary indexes that won't return
any data and are only used on the primary. It would imply that the
data is being sent to replicas and more data being written than
strictly necessary, which to me seems wasteful.

- The locking in stirinsert can probably be improved significantly if
we use things like atomic operations on STIR pages. We'd need an
exclusive lock only for page initialization, while share locks are
enough if the page's data is modified without WAL. That should improve
concurrent insert performance significantly, as it would further
reduce the length of the exclusively locked hot path.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

08 August 2024, 13:53:00

Hello, Matthias!

> While waiting for this, here are some initial comments on the github diffs:

Thanks for your review!
While stress testing the POC, I found some issues unrelated to the patch that need to be fixed first.
This is [1] and [2].

>> Additional index is lightweight and does not produce any WAL.
> That doesn't seem to be what I see in the current patchset:

Persistence is passed as parameter [3] and set to RELPERSISTENCE_UNLOGGED for auxiliary indexes [4].

> - I notice you've added a new argument to
> heapam_index_build_range_scan. I think this could just as well be
> implemented by reading the indexInfo->ii_Concurrent field, as the
> values should be equivalent, right?

Not always; currently, it is set by ResetSnapshotsAllowed[5].
We fall back to regular index build if there is a predicate or expression in the index (which should be considered "safe" according to [6]).
However, we may remove this check later.
Additionally, there is no sense in resetting the snapshot if we already have an xmin assigned to the backend for some reason.

> In heapam_index_build_range_scan, it seems like you're popping the
> snapshot and registering a new one while holding a tuple from
> heap_getnext(), thus while holding a page lock. I'm not so sure that's
> OK, expecially when catalogs are also involved (specifically for
> expression indexes, where functions could potentially be updated or
> dropped if we re-create the visibility snapshot)

Yeah, good catch.
Initially, I implemented a different approach by extracting the catalog xmin to a separate horizon [7]. It might be better to return to this option.

> In heapam_index_build_range_scan, you pop the snapshot before the
> returned heaptuple is processed and passed to the index-provided
> callback. I think that's incorrect, as it'll change the visibility of
> the returned tuple before it's passed to the index's callback. I think
> the snapshot manipulation is best added at the end of the loop, if we
> add it at all in that function.

Yes, this needs to be fixed as well.

> The snapshot reset interval is quite high, at 500ms. Why did you
> configure it that low, and didn't you make this configurable?

It is just a random value for testing purposes.
I don't think there is a need to make it configurable.
Getting a new snapshot is a cheap operation now, so we can do it more often if required.
Internally, I was testing it with a 0ms interval.

> You seem to be using WAL in the STIR index, while it doesn't seem
> that relevant for the use case of auxiliary indexes that won't return
> any data and are only used on the primary. It would imply that the
> data is being sent to replicas and more data being written than
> strictly necessary, which to me seems wasteful.

It just looks like an index with WAL, but as mentioned above, it is unlogged in actual usage.

> The locking in stirinsert can probably be improved significantly if
> we use things like atomic operations on STIR pages. We'd need an
> exclusive lock only for page initialization, while share locks are
> enough if the page's data is modified without WAL. That should improve
> concurrent insert performance significantly, as it would further
> reduce the length of the exclusively locked hot path.

Hm, good idea. I'll check it later.

Best regards & thanks again,
Mikhail

[1]: https://www.postgresql.org/message-id/CANtu0ohHmYXsK5bxU9Thcq1FbELLAk0S2Zap0r8AnU3OTmcCOA%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/CANtu0ojga8s9%2BJ89cAgLzn2e-bQgy3L0iQCKaCnTL%3Dppot%3Dqhw%40mail.gmail.com
[3]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-50abc48efcc362f0d3194aceba6969429f46fa1f07a119e555255545e6655933R93
[4]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L1600
[5]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/catalog/index.c#L2657
[6]: https://github.com/michail-nikolaev/postgres/blob/e2698ca7c814a5fa5d4de8a170b7cae83034cade/src/backend/commands/indexcmds.c#L1129
[7]: https://github.com/postgres/postgres/commit/38b243d6cc7358a44cb1a865b919bf9633825b0c

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

02 September 2024, 00:19:00

Hello, Matthias!

Just wanted to update you with some information about the next steps in work.

> In heapam_index_build_range_scan, it seems like you're popping the
> snapshot and registering a new one while holding a tuple from
> heap_getnext(), thus while holding a page lock. I'm not so sure that's
> OK, expecially when catalogs are also involved (specifically for
> expression indexes, where functions could potentially be updated or
> dropped if we re-create the visibility snapshot)

I have returned to the solution with a dedicated catalog_xmin for backends [1].
Additionally, I have added catalog_xmin to pg_stat_activity [2].

> In heapam_index_build_range_scan, you pop the snapshot before the
> returned heaptuple is processed and passed to the index-provided
> callback. I think that's incorrect, as it'll change the visibility of
> the returned tuple before it's passed to the index's callback. I think
> the snapshot manipulation is best added at the end of the loop, if we
> add it at all in that function.

Now it's fixed, and the snapshot is reset between pages [3].

Additionally, I resolved the issue with potential duplicates in unique indexes. It looks a bit clunky, but it works for now [4].

Single commit from [5] also included, just for stable stress testing.

Full diff is available at [6].

Best regards,
Mikhail.

[1]: https://github.com/michail-nikolaev/postgres/commit/01a47623571592c52c7a367f85b1cff9d8b593c0

[2]: https://github.com/michail-nikolaev/postgres/commit/d3345d60bd51fe2e0e4a73806774b828f34ba7b6

[3]: https://github.com/michail-nikolaev/postgres/commit/7d1dd4f971e8d03f38de95f82b730635ffe09aaf

[4]: https://github.com/michail-nikolaev/postgres/commit/4ad56e14dd504d5530657069068c2bdf172e482d

[5]: https://commitfest.postgresql.org/49/5160/

[6]: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach?diff=split&w=

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

25 December 2024, 18:14:00

Hello, Michael!

Thank you for your comments and feedback!

Yes, this patch set contains a significant amount of code, which makes it challenging to review. Some details are explained in the commit messages, but I’m doing my best to structure the patch set in a way that is as committable as possible. Once all the parts are ready, I plan to write a detailed letter explaining everything, including benchmark results and other relevant information.

Meanwhile, here’s a quick overview of the patch structure. If you have suggestions for an alternative decomposition approach, I’d be happy to hear.

The primary goals of the patch set are to:
* Enable the xmin horizon to propagate freely during concurrent index builds
* Build concurrent indexes with a single heap scan

The patch set is split into the following parts. Technically, each part could be committed separately, but all of them are required to achieve the goals.

Part 1: Stress tests

- 0001: Yes, this patch is from another thread and not directly required, it’s included here as a single commit because it’s necessary for stress testing this patch set. Without it, issues with concurrent reindexing and upserts cause failures.

- 0002: Yes, I agree these tests need to be refactored or moved into a separate task. I’ll address this later.

Part 2: During the first phase of concurrently building a index, reset the snapshot used for heap scans between pages, allowing xmin to go forward.

- 0003: Implement such snapshot resetting for non-parallel and non-unique cases

- 0004: Extends snapshot resetting to parallel builds

- 0005: Extends snapshot resetting to unique indexes

Part 3: Build concurrent indexes in a single heap scan

- 0006: Introduces the STIR (Short-Term Index Replacement) access method, a specialized method for auxiliary indexes during concurrent builds

- 0007: Implements the auxiliary index approach, enabling concurrent index builds to use a single heap scan.

In a few words, it works like this: create an empty auxiliary STIR index to track new tuples, scan heap and build new index, merge STIR tuples into new index, drop auxiliary index.

- 0008: Enhances the auxiliary index approach by resetting snapshots during the merge phase, allowing xmin to propagate

Part 4: This part depends on all three previous parts being committed to make sense (other parts are possible to apply separately).

- 0009: Remove PROC_IN_SAFE_IC logic, as it is no more required

I have a plan to add a few additional small things (optimizations) and then do some scaled stress-testing and benchmarking. I think that without it, no one is going to spend his time for such an amount of code :)

Merry Christmas,

Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Matthias van de Meent

Date:

04 January, 04:12:58

On Wed, 1 Jan 2025 at 17:17, Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:
>
> Hello, everyone!
>
> I’ve added several updates to the patch set:
>
> * Automatic auxiliary index removal where applicable.
> * Documentation updates to reflect recent changes.
> * Optimization for STIR indexes: skipping datum setup, as they store only TIDs.
> * Numerous assertions to ensure that MyProc->xmin is invalid where necessary.
>
> I’d like to share some initial benchmark results (see attached graphs).
> This involves building a B-tree index on (aid, abalance) in a pgbench setup with scale 2000 (with WAL), while running
aconcurrent pgbench workload. 
>
> The patched version built the index in 68 seconds, compared to 117 seconds with the master branch (mostly because of
asingle heap scan). 
> There appears to be no effect on the throughput of the concurrent pgbench.
> The maximum snapshot age remains near zero.

Thank you for continuing working on this, these are some nice results.
I'm sorry I can't spend the time I want on this every time, but I
still think it's important this can eventually get committed, so thank
you for your work.

> (mostly because of a single heap scan).

Isn't there a second heap scan, or do you consider that an index scan?

> I am going to continue to benchmark with different options: different HOT setup, unique index, different index types
andDB size (100+ GB). 
> If someone has some ideas about possible benchmark scenarios - please share.

I think a good benchmark could show how bloat is actually prevented,
i.e. through result table size comparisons on an update-heavy
workload, both with and without the patch.
I think it shouldn't be too difficult to show how such workloads
quickly regress to always extending the table as no cleanup can
happen, while patched they'd have much more leeway due to page
pruning. Presumably a table with a fillfactor <100 would show the best
results.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

20 February, 17:56:20

Hello, everyone.

Just rebased.

Also, this is Discord thread: https://discordapp.com/channels/1258108670710124574/1259884843165155471/1334565506149253150

Attachment

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Michail Nikolaev

Date:

08 March, 01:58:51

Hello, everyone!

Rebased + new parallel GIN builds supported.

Best regards,

MIkhail.

Attachment

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

07 April, 02:45:00

Hello, everyone!

Rebased, updated accordingly to some changes.

Best regards,

Mikhail.

Attachment

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

30 April, 23:01:24

Hello, Andres!

This is a gentle ping [1] about the patch related to optimization of
RE|CREATE INDEX CONCURRENTLY. Below is an explanation of the patch
set.

QUICK INTRO
What patch set does in a few words: "CIC/RIC are 2x-3x faster and does
not prevent xmin horizon to advance, without any dirty tricks, even
with removing one of them".
How it works in a few words: "Reset snapshot between pages during the
first phase. Replaces the second phase using a special auxiliary index
to collect TIDs of tuples that need to be inserted into the target
index after the first phase".
What are drawbacks: "some additional complexity + additional auxiliary
index-like structure involved."

SOME HISTORY
In 2021 Álvaro proposed [2] and committed [3] the feature: VACUUM
ignores snapshots involved into concurrent indexing operations. This
was a great feature in PG14.
But in 2022 a bug related to the tuples missing in indexes was
detected, and a little bit later explained by Andres [4]. As a result,
feature was reverted [5] with Álvaro's comment[6]:

> Deciding to revert makes me sad, because this feature is extremely
> valuable for users. However, I understand the danger and I don't
> disagree with the rationale so I can't really object.

There were some proposals, like introducing a special horizon for
HOT-pruning or stopping it during the CIC, but after some discussions
Andres said [7]:

> I'm also doubtful it's the right approach.
> The problem here comes from needing a snapshot for the entire duration of the validation scan
> ISTM that we should work on not needing that snapshot, rather than trying to reduce the consequences of holding that
snapshot.
> I think it might be possible to avoid it.

So, given these two ideas I began the work on the patch.

STRUCTURE

Patch itself contains 4 parts, some of them may be reviewed/committed
separately. All commit messages are detailed and contain additional
explanation of changes.

To not confuse CFBot, commits are presented in the following way: part
1, 2, 3 and 4. If you want only part 3 to test/review – check the
files with "patch_" extensions. They differ a little bit, but changes
are minor.

PART 1
Test set (does not depend on anything)

This is a set of stress tests and some fixes required for those tests
to reliably succeed (even on current master branch). That part is not
required for any other parts – its goal is to make sure everything is
still working correctly while applying other parts/commits.
It includes:
- fixes related to races in case of ON CONFLICT UPDATE + REINDEX
CONCURRENTLY (issue was discovered during testing of that patch) [8]
- fixes in bt_index_parent_check (issue was discovered during testing
of that patch with enormous amount of pain – I was looking for months
for error in patch because of single fail of bt_index_parent_check but
it was an issue with bt_index_parent_check itself) [9].

PART 2
Resetting snapshots during the first phase of CIC (does not depend on anything)

It is based on Matthias' idea [10] - to just reset snapshots every so
often during a concurrent index build. It may work only during the
first scan (because we'll miss some tuples anyway during validation
scan with such approach).
Logic is simple – since the index built by the first scan already
misses a lot of tuples – we may not worry to miss a few more – the
validation phase is going to fix it anyway. Of course it is not so
simple in case of unique indexes, but still possible.

Once committed: xmin is advanced at least during the first phase of
concurrent index build.

Commits are:
- Reset snapshots periodically in non-unique non-parallel concurrent
index builds

Apply this technique to the simplest case – non-unique and
non-parallel. Snapshot is changed "between" pages.
One possible place here to worry about – to ensure xmin advanced we
need to call InvalidateCatalogSnapshot during each snapshot switch.
So, theoretically it may cause some issues, but the table is locked to
changes during the process. At least commit [3] (which ignored xmin of
CIC backend) did the same thing actually.
Another more "clear" option here - we may just extract a separate
catalog snapshot horizon (one more field near xmin specially only for
catalog snapshot), it seems to be a pretty straightforward change).

- Support snapshot resets in parallel concurrent index builds

Extend that technique to parallel builds. It is mostly about ensuring
workers have an initial snapshot restored from the leader before the
leader goes to reset it.

- Support snapshot resets in concurrent builds of unique indexes

The most tricky commit in the second part – apply that to unique
indexes. Changing of snapshots may cause issues with validation of
unique constraints. Currently validation is done during the sorting of
tuples, but that doesn't work with tuples read with different
snapshots (some of them are dead already). To deal with it:
- in case we see two identical tuples during tuplesort – ignore if
some of them are dead according to SnapshotSelf, but fail if two are
alive. It is not a required part, it is just mechanics for fail-fast
behavior and may be removed.
- to provide the guarantee – during _bt_load compare the inserted
index value with previously inserted. If they are equal – make sure
only a single SnapshotSelf alive tuple exists in the whole equal
"group" (it may include more than two tuples in general).

Theoretically it may affect performance of _bt_load because of
_bt_keep_natts(_fast) call for each tuple, but I was unable to notice
any significant difference here. Anyway it is compensated by Part 3
for sure.

PART 3
STIR-based validation phase CIC (does not depend on anything)

That part is about a way to replace the second phase of CIC in a more
effective way (and with the ability to allow horizon advance as an
additional bonus).

The role of the second phase is to find tuples which are not present
in the index built by the first scan, because:
- some of them were too new for the snapshot used during the first phase
- even if we were to use SnapshotSelf to accept all alive tuples –
some of them may be inserted in pages already visited by the scan

The main idea is:
- before starting the first scan lets prepare a special auxiliary
super-lightweight index (it is not even an index or access method,
just pretends to be) with the same columns, expressions and predicates
- that access method (Short Term Index Replacement – STIR) just
appends TID of new coming tuples, without WAL, minimum locking,
simplest append-only structure, without actual indexed data
- it remembers all new TIDs inserted to the table during the first phase
- once our main (target) index receives updates itself we may safely
clear "ready" flag on STIR
- if our first phase scan missed something – it is guaranteed to be
present in that STIR index
- so, instead of requirement to compare the whole table to the index,
we need only to compare to TIDs stored in the STIR
- as a bonus we may reset snapshots during the comparison without risk
of any issues caused by HOT pruning (the issue [4] caused revert of
[3]).

That approach provides a significant performance boost in terms of
time required to build the index. STIR itself theoretically causes
some performance impact, but I was not able to detect it. Also, some
optimizations are applied to it (see below). Details of benchmarks are
presented below as well.

Commits are:
- Add STIR access method and flags related to auxiliary indexes

This one adds STIR code and some flags to distinguish real and
auxiliary indexes.

- Add Datum storage support to tuplestore

Add ability to store Datum in tuplestore. It is used by the following
commits to leverage performance boost from prefetching of the pages
during validation phase.

- Use auxiliary indexes for concurrent index operations

The main part is here. It contains all the logic for creation of
auxiliary index, managing its lifecycle, new validation phase and so
on (including progress reporting, some documentation updates, ability
to have unlogged index for logged table, etc). At the same time it
still relies on a single referenced snapshot during the validation
phase.

- Track and drop auxiliary indexes in DROP/REINDEX

That commit adds different techniques to avoid any additional
administration requirements to deal with auxiliary indexes in case of
error during the index build (junk auxiliary indexes). It adds
dependency tracking, special logic for handling REINDEX calls and
other small things to make administrator's life a little bit easier.

- Optimize auxiliary index handling

Since the STIR index does not contain any actual data we may skip
preparation of that during tuple insert. Commit implements such
optimization.

- Refresh snapshot periodically during index validation

Adds logic to the new validation phase to reset the snapshot every so
often. Currently it does it every 4096 pages visited.

PART 4 (depends on part 2 and part 3)

Commits are:

- Remove PROC_IN_SAFE_IC optimization

This is a small part which makes sense in case both parts 2 and 3 were
applied. Once it's done – CIC does not prevent the horizon from
advancing regularly.
It makes the PROC_IN_SAFE_IC optimization [11] obsolete, because one
CIC now has no issue waiting for the xmin of the other (because it
advances regularly).

BENCHMARKS

I have spent a lot of time benchmarking the patch in different
environments (local SSD, local SSD with 1ms delay, io2 AWS) and the
results look impressive.
I can't measure any performance (or significant space usage)
degradation because of STIR index presence, but performance boost
because of new validation phases gives up to 3x -4x time boost. And
without any VACUUM-related issues during that time (so, other
operations on the databases will benefit from that easily compensating
additional STIR-related cost).

Description of benchmarks are available here [12].

Some results are here: [13] and here [14], code is here [15].

There is also a Discord thread here [16].

Feel free to ask any question and request benchmarks for some scenarios.

Best regards,
Mikhail.

[1]: https://discord.com/channels/1258108670710124574/1334565506149253150/1339368558408372264
[2]:
https://www.postgresql.org/message-id/flat/20210115142926.GA19300%40alvherre.pgsql#0988173cb0cf4b8eb710a6cdaa88fcac
[3]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d
[4]:
https://www.postgresql.org/message-id/flat/20220524190133.j6ee7zh4f5edt5je%40alap3.anarazel.de#1781408f40034c414ad6738140c118ef
[5]: https://github.com/postgres/postgres/commit/e28bb885196916b0a3d898ae4f2be0e38108d81b
[6]:
https://www.postgresql.org/message-id/flat/202205251643.2py5jjpaw7wy%40alvherre.pgsql#589508d30b480b905619dcb93dac8cf8
[7]:
https://www.postgresql.org/message-id/flat/20220525170821.rf6r4dnbbu4baujp%40alap3.anarazel.de#accf63164fd552b6ad68ff0a870e60c0
[8]: https://commitfest.postgresql.org/patch/5160/
[9]: https://commitfest.postgresql.org/patch/5438/
[10]:
https://www.postgresql.org/message-id/flat/CAEze2WgW6pj48xJhG_YLUE1QS%2Bn9Yv0AZQwaWeb-r%2BX%3DHAxU_g%40mail.gmail.com#b3809c158de4481bb1b29894aaa63fae
[11]: https://github.com/postgres/postgres/commit/c98763bf51bf610b3ee7e209fc76c3ff9a6b3163
[12]:
https://www.postgresql.org/message-id/flat/CANtu0ojHAputNCH73TEYN_RUtjLGYsEyW1aSXmsXyvwf%3D3U4qQ%40mail.gmail.com#b18fb8efab086bc22af1cb015e187cb7
[13]:
https://www.postgresql.org/message-id/flat/CANtu0ojiVez054rKvwZzKNhneS2R69UXLnw8N9EdwQwqfEoFdQ%40mail.gmail.com#02b4ee45a5a5c1ccac6d168052173952
[14]: https://docs.google.com/spreadsheets/d/1UYaqpsWSfYdZdQxaqY4gVo0RW6KrT0d-U1VDNJB8lVk/edit?usp=sharing
[15]: https://gist.github.com/michail-nikolaev/b33fb0ac1f35729388c89f72db234b0f
[16]: https://discord.com/channels/1258108670710124574/1259884843165155471/1334565506149253150

Attachment

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

18 May, 19:09:29

Hello, Álvaro!

> I didn't understand why you have a few "v19" patches and also a separate
> series of "v19-only-part-3-" patches.  Is there duplication?  How do
> people know which series comes first?

This was explained in the previous email [0]:

> Patch itself contains 4 parts, some of them may be reviewed/committed
> separately. All commit messages are detailed and contain additional
> explanation of changes.

> To not confuse CFBot, commits are presented in the following way: part
> 1, 2, 3 and 4. If you want only part 3 to test/review – check the
> files with "patch_" extensions. They differ a little bit, but changes
> are minor.

If you have an idea of a better way to handle it, please share. Yes,
the current approach is a bit odd.

> I think it would be better to get the PDF poster in a wiki page ... in
> fact I would suggest to Andrey that he could start a wiki page with all
> the PDFs presented at the conference.  Distributing a bunch of 2 MB pdf
> via the mailing list doesn't sound too great an idea to me.  A few
> people are having trouble with email quotas in cloud services, and the
> list server gets bothered because of it.  Kindly don't do that anymore.

Oh, you're right—I just didn't think of that. My bad, sorry about that.

Best regards,
Mikhail.

[0]:
https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%40mail.gmail.com#52c97e004b8f628473124c05e3bf2da1

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Sergey Sargsyan

Date:

16 June, 19:17:33

Hey Mihail,

I've started looking at the patches today, mostly the STIR part. Seems solid, but I've got a question about validation. Why are we still grabbing tids from the main index and sorting them?

I think it's to avoid duplicate errors when adding tuples from STIP to the main index, but couldn't we just suppress that error during validation and skip the new tuple insertion if it already exists?

The main index may get huge after building, and iterating over it in a single thread and then sorting tids can be time consuming.

At least I guess one can skip it when STIP is empty. But, I think we could skip it altogether by figuring out what to do with duplicates, making concurrent and non-concurrent index creation almost identical in speed (only locking and atomicity would differ).

p.s. I noticed that `stip.c` has a lot of functions that don't follow the Postgres coding style of return type on separate line.

On Mon, Jun 16, 2025, 6:41 PM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Hello, everyone!

Rebased, patch structure and comments available here [0]. Quick
introduction poster - here [1].

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%40mail.gmail.com#52c97e004b8f628473124c05e3bf2da1
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

16 June, 23:00:59

Hello, Sergey!

> I think it's to avoid duplicate errors when adding tuples from STIP to the main index,
> but couldn't we just suppress that error during validation and skip the new tuple insertion if it already exists?

In some cases, it is not possible:
– Some index types (GiST, GIN, BRIN) do not provide an easy way to
detect such duplicates.
– When we are building a unique index, we cannot simply skip
duplicates, because doing so would also skip the rows that should
prevent the unique index from being created (unless we add extra logic
for B-tree indexes to compare TIDs as well).

> The main index may get huge after building, and iterating over it in a single thread and then sorting tids can be
timeconsuming. 
My tests indicate that the overhead is minor compared with the time
spent scanning the heap and building the index itself.

> At least I guess one can skip it when STIP is empty.
Yes, that’s a good idea; I’ll add it later.

> p.s. I noticed that `stip.c` has a lot of functions that don't follow the Postgres coding style of return type on
separateline. 
Hmm... I’ll fix that as well.

Best regards,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

18 June, 13:49:51

Hello, Sergey!

> In patch v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch, within the "StirMarkAsSkipInserts"
function,a critical section appears to be left unclosed. This resulted in an assertion failure during ANALYZE of a
tablecontaining a leftover STIR index. 
Thanks, good catch. I'll add it to batch fix with the other things.

Best regards,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Sergey Sargsyan

Date:

18 June, 19:33:07

Hi,

Today I encountered a segmentation fault caused by the patch v20-0007-Add-Datum-storage-support-to-tuplestore.patch. During the merge phase, I inserted some tuples into the table so that STIR would have data for the validation phase. The segfault occurred during a call to tuplestore_end().

The root cause is that a few functions in the tuplestore code still assume that all stored data is a pointer and thus attempt to pfree it. This assumption breaks when datumByVal is used, as the data is stored directly and not as a pointer. In particular, tuplestore_end(), tuplestore_trim(), and tuplestore_clear() incorrectly try to free such values.

When addressing this, please also ensure that context memory accounting is handled properly: we should not increment or decrement the remaining context memory size when cleaning or trimming datumByVal entries, since no actual memory was allocated for them.

Interestingly, I’m surprised you haven’t hit this segfault yourself. Are you perhaps testing on an older system where INT8OID is passed by reference? Or is your STIR always empty during the validation phase?

One more point: I noticed you modified the index_create() function signature. You added the relpersistence parameter, which seems unnecessary—this can be determined internally by checking if it’s an auxiliary index, in which case the index should be marked as unlogged. You also added an auxiliaryIndexOfOid argument (do not remember exact naming, but was used for dependency). It might be cleaner to pass this via the IndexInfo structure instead. index_create() already has dozens of mouthful arguments, and external extensions (like pg_squeeze) still rely on the old signature, so minimizing changes to the function interface would improve compatibility.

Best regards,

Sergey

On Wed, Jun 18, 2025, 1:50 PM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Hello, Sergey!

> In patch v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch, within the "StirMarkAsSkipInserts" function, a critical section appears to be left unclosed. This resulted in an assertion failure during ANALYZE of a table containing a leftover STIR index.
Thanks, good catch. I'll add it to batch fix with the other things.

Best regards,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

19 June, 00:15:28

Hello, Sergey!

> Today I encountered a segmentation fault caused by the patch v20-0007-Add-Datum-storage-support-to-tuplestore.patch.
Duringthe merge phase, I inserted some tuples into the table so that STIR would have data for the validation phase. The
segfaultoccurred during a call to tuplestore_end(). 
>
> The root cause is that a few functions in the tuplestore code still assume that all stored data is a pointer and thus
attemptto pfree it. This assumption breaks when datumByVal is used, as the data is stored directly and not as a
pointer.In particular, tuplestore_end(), tuplestore_trim(), and tuplestore_clear() incorrectly try to free such values. 
>
> When addressing this, please also ensure that context memory accounting is handled properly: we should not increment
ordecrement the remaining context memory size when cleaning or trimming datumByVal entries, since no actual memory was
allocatedfor them. 
>
> Interestingly, I’m surprised you haven’t hit this segfault yourself. Are you perhaps testing on an older system where
INT8OIDis passed by reference? Or is your STIR always empty during the validation phase? 

Thanks for pointing that out. It looks like tuplestore_trim and
tuplestore_clear are broken, while tuplestore_end seems to be correct
but fails due to previous heap corruption.
In my case, tuplestore_trim and tuplestore_clear aren't called at all
- that's why the issue wasn't detected. I'm not sure why; perhaps some
recent changes in your codebase are affecting that?

Please run a stress test (if you've already applied the in-place fix
for the tuplestore):
         ninja && meson test --suite setup && meson test
--print-errorlogs --suite pg_amcheck *006*

This will help ensure everything else is working correctly on your system.

> One more point: I noticed you modified the index_create() function signature. You added the relpersistence parameter,
whichseems unnecessary— 
> this can be determined internally by checking if it’s an auxiliary index, in which case the index should be marked as
unlogged.You also added an 
> auxiliaryIndexOfOid argument (do not remember exact naming, but was used for dependency). It might be cleaner to pass
thisvia the IndexInfo structure 
> instead. index_create() already has dozens of mouthful arguments, and external extensions
> (like pg_squeeze) still rely on the old signature, so minimizing changes to the function interface would improve
compatibility.

Yes, that’s probably a good idea. I was trying to keep it simple from
the perspective of parameters to avoid dealing with some of the tricky
internal logic.
But you're right - it’s better to stick with the old signature.

Best regards,
Mikhail.

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

21 June, 23:32:44

Hello, Sergey!

I have addressed your comments:
* skip TID scan in case of empty STIR index
* fix for critical section
* formatting
* index_create signature

Rebased, patch structure and comments available here [0]. Quick
introduction poster - here [1].

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%40mail.gmail.com#52c97e004b8f628473124c05e3bf2da1
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf

Attachment

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

From

Mihail Nikalayeu

Date:

03 July, 03:23:09

Hello!

Rebased again, patch structure and comments available here [0]. Quick
introduction poster - here [1].

Best regards,
Mikhail.

[0]:
https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%40mail.gmail.com#52c97e004b8f628473124c05e3bf2da1
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf