Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements - Mailing list pgsql-hackers

From Mihail Nikalayeu
Subject Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Date
Msg-id CADzfLwVOcZ9mg8gOG+KXWurt=MHRcqNv3XSECYoXyM3ENrxyfQ@mail.gmail.com
Whole thread Raw
In response to Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements  (Mihail Nikalayeu <mihailnikalayeu@gmail.com>)
List pgsql-hackers
Hello, Andres!

This is a gentle ping [1] about the patch related to optimization of
RE|CREATE INDEX CONCURRENTLY. Below is an explanation of the patch
set.

QUICK INTRO
What patch set does in a few words: "CIC/RIC are 2x-3x faster and does
not prevent xmin horizon to advance, without any dirty tricks, even
with removing one of them".
How it works in a few words: "Reset snapshot between pages during the
first phase. Replaces the second phase using a special auxiliary index
to collect TIDs of tuples that need to be inserted into the target
index after the first phase".
What are drawbacks: "some additional complexity + additional auxiliary
index-like structure involved."

SOME HISTORY
In 2021 Álvaro proposed [2] and committed [3] the feature: VACUUM
ignores snapshots involved into concurrent indexing operations. This
was a great feature in PG14.
But in 2022 a bug related to the tuples missing in indexes was
detected, and a little bit later explained by Andres [4]. As a result,
feature was reverted [5] with Álvaro's comment[6]:

> Deciding to revert makes me sad, because this feature is extremely
> valuable for users.  However, I understand the danger and I don't
> disagree with the rationale so I can't really object.

There were some proposals, like introducing a special horizon for
HOT-pruning or stopping it during the CIC, but after some discussions
Andres said [7]:

> I'm also doubtful it's the right approach.
> The problem here comes from needing a snapshot for the entire duration of the validation scan
> ISTM that we should work on not needing that snapshot, rather than trying to reduce the consequences of holding that
snapshot.
> I think it might be possible to avoid it.

So, given these two ideas I began the work on the patch.

STRUCTURE

Patch itself contains 4 parts, some of them may be reviewed/committed
separately. All commit messages are detailed and contain additional
explanation of changes.

To not confuse CFBot, commits are presented in the following way: part
1, 2, 3 and 4. If you want only part 3 to test/review – check the
files with "patch_" extensions. They differ a little bit, but changes
are minor.

PART 1
Test set (does not depend on anything)

This is a set of stress tests and some fixes required for those tests
to reliably succeed (even on current master branch). That part is not
required for any other parts – its goal is to make sure everything is
still working correctly while applying other parts/commits.
It includes:
- fixes related to races in case of ON CONFLICT UPDATE + REINDEX
CONCURRENTLY (issue was discovered during testing of that patch) [8]
- fixes in bt_index_parent_check (issue was discovered during testing
of that patch with enormous amount of pain – I was looking for months
for error in patch because of single fail of bt_index_parent_check but
it was an issue with bt_index_parent_check itself) [9].

PART 2
Resetting snapshots during the first phase of CIC (does not depend on anything)

It is based on Matthias' idea [10] - to just reset snapshots every so
often during a concurrent index build. It may work only during the
first scan (because we'll miss some tuples anyway during validation
scan with such approach).
Logic is simple – since the index built by the first scan already
misses a lot of tuples – we may not worry to miss a few more – the
validation phase is going to fix it anyway. Of course it is not so
simple in case of unique indexes, but still possible.

Once committed: xmin is advanced at least during the first phase of
concurrent index build.

Commits are:
- Reset snapshots periodically in non-unique non-parallel concurrent
index builds

Apply this technique to the simplest case – non-unique and
non-parallel. Snapshot is changed "between" pages.
One possible place here to worry about – to ensure xmin advanced we
need to call InvalidateCatalogSnapshot during each snapshot switch.
So, theoretically it may cause some issues, but the table is locked to
changes during the process. At least commit [3] (which ignored xmin of
CIC backend) did the same thing actually.
Another more "clear" option here - we may just extract a separate
catalog snapshot horizon (one more field near xmin specially only for
catalog snapshot), it seems to be a pretty straightforward change).

- Support snapshot resets in parallel concurrent index builds

Extend that technique to parallel builds. It is mostly about ensuring
workers have an initial snapshot restored from the leader before the
leader goes to reset it.

- Support snapshot resets in concurrent builds of unique indexes

The most tricky commit in the second part – apply that to unique
indexes. Changing of snapshots may cause issues with validation of
unique constraints. Currently validation is done during the sorting of
tuples, but that doesn't work with tuples read with different
snapshots (some of them are dead already). To deal with it:
- in case we see two identical tuples during tuplesort – ignore if
some of them are dead according to SnapshotSelf, but fail if two are
alive. It is not a required part, it is just mechanics for fail-fast
behavior and may be removed.
- to provide the guarantee – during _bt_load compare the inserted
index value with previously inserted. If they are equal – make sure
only a single SnapshotSelf alive tuple exists in the whole equal
"group" (it may include more than two tuples in general).

Theoretically it may affect performance of _bt_load because of
_bt_keep_natts(_fast) call for each tuple, but I was unable to notice
any significant difference here. Anyway it is compensated by Part 3
for sure.

PART 3
STIR-based validation phase CIC (does not depend on anything)

That part is about a way to replace the second phase of CIC in a more
effective way (and with the ability to allow horizon advance as an
additional bonus).

The role of the second phase is to find tuples which are not present
in the index built by the first scan, because:
- some of them were too new for the snapshot used during the first phase
- even if we were to use SnapshotSelf to accept all alive tuples –
some of them may be inserted in pages already visited by the scan

The main idea is:
- before starting the first scan lets prepare a special auxiliary
super-lightweight index (it is not even an index or access method,
just pretends to be) with the same columns, expressions and predicates
- that access method (Short Term Index Replacement – STIR) just
appends TID of new coming tuples, without WAL, minimum locking,
simplest append-only structure, without actual indexed data
- it remembers all new TIDs inserted to the table during the first phase
- once our main (target) index receives updates itself we may safely
clear "ready" flag on STIR
- if our first phase scan missed something – it is guaranteed to be
present in that STIR index
- so, instead of requirement to compare the whole table to the index,
we need only to compare to TIDs stored in the STIR
- as a bonus we may reset snapshots during the comparison without risk
of any issues caused by HOT pruning (the issue [4] caused revert of
[3]).

That approach provides a significant performance boost in terms of
time required to build the index. STIR itself theoretically causes
some performance impact, but I was not able to detect it. Also, some
optimizations are applied to it (see below). Details of benchmarks are
presented below as well.

Commits are:
- Add STIR access method and flags related to auxiliary indexes

This one adds STIR code and some flags to distinguish real and
auxiliary indexes.

- Add Datum storage support to tuplestore

Add ability to store Datum in tuplestore. It is used by the following
commits to leverage performance boost from prefetching of the pages
during validation phase.

- Use auxiliary indexes for concurrent index operations

The main part is here. It contains all the logic for creation of
auxiliary index, managing its lifecycle, new validation phase and so
on (including progress reporting, some documentation updates, ability
to have unlogged index for logged table, etc). At the same time it
still relies on a single referenced snapshot during the validation
phase.

- Track and drop auxiliary indexes in DROP/REINDEX

That commit adds different techniques to avoid any additional
administration requirements to deal with auxiliary indexes in case of
error during the index build (junk auxiliary indexes). It adds
dependency tracking, special logic for handling REINDEX calls and
other small things to make administrator's life a little bit easier.

- Optimize auxiliary index handling

Since the STIR index does not contain any actual data we may skip
preparation of that during tuple insert. Commit implements such
optimization.

- Refresh snapshot periodically during index validation

Adds logic to the new validation phase to reset the snapshot every so
often. Currently it does it every 4096 pages visited.

PART 4 (depends on part 2 and part 3)

Commits are:

- Remove PROC_IN_SAFE_IC optimization

This is a small part which makes sense in case both parts 2 and 3 were
applied. Once it's done – CIC does not prevent the horizon from
advancing regularly.
It makes the PROC_IN_SAFE_IC optimization [11] obsolete, because one
CIC now has no issue waiting for the xmin of the other (because it
advances regularly).

BENCHMARKS

I have spent a lot of time benchmarking the patch in different
environments (local SSD, local SSD with 1ms delay, io2 AWS) and the
results look impressive.
I can't measure any performance (or significant space usage)
degradation because of STIR index presence, but performance boost
because of new validation phases gives up to 3x -4x time boost. And
without any VACUUM-related issues during that time (so, other
operations on the databases will benefit from that easily compensating
additional STIR-related cost).

Description of benchmarks are available here [12].

Some results are here: [13] and here [14], code is here [15].

There is also a Discord thread here [16].

Feel free to ask any question and request benchmarks for some scenarios.

Best regards,
Mikhail.

[1]: https://discord.com/channels/1258108670710124574/1334565506149253150/1339368558408372264
[2]:
https://www.postgresql.org/message-id/flat/20210115142926.GA19300%40alvherre.pgsql#0988173cb0cf4b8eb710a6cdaa88fcac
[3]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d
[4]:
https://www.postgresql.org/message-id/flat/20220524190133.j6ee7zh4f5edt5je%40alap3.anarazel.de#1781408f40034c414ad6738140c118ef
[5]: https://github.com/postgres/postgres/commit/e28bb885196916b0a3d898ae4f2be0e38108d81b
[6]:
https://www.postgresql.org/message-id/flat/202205251643.2py5jjpaw7wy%40alvherre.pgsql#589508d30b480b905619dcb93dac8cf8
[7]:
https://www.postgresql.org/message-id/flat/20220525170821.rf6r4dnbbu4baujp%40alap3.anarazel.de#accf63164fd552b6ad68ff0a870e60c0
[8]: https://commitfest.postgresql.org/patch/5160/
[9]: https://commitfest.postgresql.org/patch/5438/
[10]:
https://www.postgresql.org/message-id/flat/CAEze2WgW6pj48xJhG_YLUE1QS%2Bn9Yv0AZQwaWeb-r%2BX%3DHAxU_g%40mail.gmail.com#b3809c158de4481bb1b29894aaa63fae
[11]: https://github.com/postgres/postgres/commit/c98763bf51bf610b3ee7e209fc76c3ff9a6b3163
[12]:
https://www.postgresql.org/message-id/flat/CANtu0ojHAputNCH73TEYN_RUtjLGYsEyW1aSXmsXyvwf%3D3U4qQ%40mail.gmail.com#b18fb8efab086bc22af1cb015e187cb7
[13]:
https://www.postgresql.org/message-id/flat/CANtu0ojiVez054rKvwZzKNhneS2R69UXLnw8N9EdwQwqfEoFdQ%40mail.gmail.com#02b4ee45a5a5c1ccac6d168052173952
[14]: https://docs.google.com/spreadsheets/d/1UYaqpsWSfYdZdQxaqY4gVo0RW6KrT0d-U1VDNJB8lVk/edit?usp=sharing
[15]: https://gist.github.com/michail-nikolaev/b33fb0ac1f35729388c89f72db234b0f
[16]: https://discord.com/channels/1258108670710124574/1259884843165155471/1334565506149253150

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: AIO v2.5
Next
From: Robert Haas
Date:
Subject: Re: pgsql: Add function to get memory context stats for processes