Thread: WIP: parallel GiST index builds

WIP: parallel GiST index builds

From
Tomas Vondra
Date:
Hi,

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I knew
next to nothing about how GiST works, but I gave it a shot and here's
what I have - the attached patch allows parallel GiST builds for the
"unsorted" case (i.e. when the opclass does not include sortsupport),
and does not support buffered builds.


unsorted builds only
--------------------

Addressing only the unsorted case may seem a bit weird, but I did it
this way for two reasons - parallel sort is a solved problem, and adding
this to the patch seems quite straightforward. It's what btree does, for
example. But I also was not very sure how common this is - we do have
sort for points, but I have no idea if the PostGIS indexes define
sorting etc. My guess was "no" but I've been told that's no longer true,
so I guess sorted builds are more widely applicable than I thought.

In any case, I'm not in a rush to parallelize sorted builds. It can be
added later, as an improvement, IMHO. In fact, it's a well isolated part
of the patch, which might make it a good choice for someone looking for
an idea for their first patch ...


buffered builds
---------------

The lack of support for buffered builds is a very different thing. The
basic idea is that we don't push the index entries all the way to the
leaf pages right away, but accumulate them in buffers half-way through.
This combines writes and reduces random I/O, which is nice.

Unfortunately, the way it's implemented does not work with parallel
builds at all - all the state is in private memory, and it assumes the
worker is the only possible backend that can split the page (at which
point the buffers need to be split too, etc.). But for parallel builds
this is obviously not true.

I'm not saying parallel builds can't do similar buffering, but it
requires moving the buffers into shared memory, and introducing locking
to coordinate accesses to the buffers. (Or perhaps it might be enough to
only "notify" the workers about page splits, with buffers still in
private memory?). Anyway, it seems far too complicated for v1.

In fact, I'm not sure the buffering is entirely necessary - maybe the
increase in amount of RAM makes this less of an issue? If the index can
fit into shared buffers (or at least page cache), maybe the amount of
extra I/O is not that bad? I'm sure there may be cases really affected
by this, but maybe it's OK to tell people to disable parallel builds in
those cases?


gistGetFakeLSN
--------------

One more thing - GiST disables WAL-logging during the build, and only
logs it once at the end. For serial builds this is fine, because there
are no concurrent splits, and so we don't need to rely on page LSNs to
detect these cases (in fact, is uses a bogus value).

But for parallel builds this would not work - we need page LSNs that
actually change, otherwise we'd miss page splits, and the index build
would either fail or produce a broken index. But the existing is_build
flag affects both things, so I had to introduce a new "is_parallel" flag
which only affects the page LSN part, using the gistGetFakeLSN()
function, previously used only for unlogged indexes.

This means we'll produce WAL during the index build (because
gistGetFakeLSN() writes a trivial message into WAL). Compared to the
serial builds this produces maybe 25-75% more WAL, but it's an order of
magnitude less than with "full" WAL logging (is_build=false).

For example, serial build of 5GB index needs ~5GB of WAL. A parallel
build may need ~7GB, while a parallel build with "full" logging would
use 50GB. I think this is a reasonable trade off.

There's one "strange" thing, though - the amount of WAL decreases with
the number of parallel workers. Consider for example an index on a
numeric field, where the index is ~9GB, but the amount of WAL changes
like this (0 workers means serial builds):

  parallel workers      0      1      3      5      7
          WAL (GB)    5.7    9.2    7.6    7.0    6.8

The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN
determines if it needs to actually assign a new LSN (and write stuff to
WAL) by comparing the last LSN assigned (in a given worker) to the
current insert LSN. But the current insert LSN might have been updated
by some other worker, in which case we simply use that. Which means that
multiple workers may use the same fake LSN, and the likelihood increases
with the number of workers - and this is consistent with the observed
behavior of the WAL decreasing as the number of workers increases
(because more workers use the same LSN).

I'm not entirely sure if this is OK or a problem. I was worried two
workers might end up using the same LSN for the same page, leading to
other workers not noticing the split. But after a week of pretty
intensive stress testing, I haven't seen a single such failure ...

If this turns out to be a problem, the fix is IMHO quite simple - it
should be enough to force gistGetFakeLSN() to produce a new fake LSN
every time when is_parallel=true.


performance
-----------

Obviously, the primary goal of the patch is to speed up the builds, so
does it actually do that? For indexes of different sizes I got this
timings (in seconds):

   scale      type           0         1        3        5        7
  ------------------------------------------------------------------
   small      inet          13         7        4        4        2
              numeric      239       122       67       46       36
              oid           15         8        5        3        2
              text          71        35       19       13       10
   medium     inet         207       111       59       42       32
              numeric     3409      1714      885      618      490
              oid          214       114       60       43       33
              text         940       479      247      180      134
   large      inet        2167      1459      865      632      468
              numeric    38125     20256    10454     7487     5846
              oid         2647      1490      808      594      475
              text       10987      6298     3376     2462     1961

Here small is ~100-200MB index, medium is 1-2GB and large 10-20GB index,
depending on the data type.

The raw duration is not particularly easy to interpret, so let's look at
the "actual speedup" which is calculated as

   (serial duration) / (parallel duration)

and the table looks like this:

   scale        type         1          3          5          7
  --------------------------------------------------------------
   small        inet       1.9        3.3        3.3        6.5
                numeric    2.0        3.6        5.2        6.6
                oid        1.9        3.0        5.0        7.5
                text       2.0        3.7        5.5        7.1
   medium       inet       1.9        3.5        4.9        6.5
                numeric    2.0        3.9        5.5        7.0
                oid        1.9        3.6        5.0        6.5
                text       2.0        3.8        5.2        7.0
   large        inet       1.5        2.5        3.4        4.6
                numeric    1.9        3.6        5.1        6.5
                oid        1.8        3.3        4.5        5.6
                text       1.7        3.3        4.5        5.6

Ideally (if the build scaled linearly with the number of workers), we'd
get the number of workers + 1 (because the leader participates too).
Obviously, it's not that great - for example for text with 3 workers we
get 3.3 instead of 4.0, and 5.6 vs. 8 with 7 workers.

But I think those numbers are actually pretty good - I'd definitely not
complain if my index builds got 5x faster.

But those are synthetic tests on random data, using the btree_gist
opclasses. It'd be interesting if people could do their own testing on
real-world data sets.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:
Hi,

I've done a number of experiments with the GiST parallel builds, both
with the sorted and unsorted cases, so let me share some of the results
and conclusions from that.

In the first post I did some benchmarks using btree_gist, but that
seemed not very realistic - there certainly are much more widely used
GiST indexes in the GIS world. So this time I used OpenStreetMap, loaded
using osm2pgsql, with two dataset sizes:

- small - "north america" (121GB without indexes)
- large - "planet" (688GB without indexes)

And then I created indexes using either gist_geometry_ops_2d (with sort)
or gist_geometry_ops_nd (no sorting).


On 6/7/24 19:41, Tomas Vondra wrote:
> Hi,
> 
> After looking into parallel builds for BRIN and GIN indexes, I was
> wondering if there's a way to do parallel builds for GiST too. I knew
> next to nothing about how GiST works, but I gave it a shot and here's
> what I have - the attached patch allows parallel GiST builds for the
> "unsorted" case (i.e. when the opclass does not include sortsupport),
> and does not support buffered builds.
> 
> 
> unsorted builds only
> --------------------
> 
> Addressing only the unsorted case may seem a bit weird, but I did it
> this way for two reasons - parallel sort is a solved problem, and adding
> this to the patch seems quite straightforward. It's what btree does, for
> example. But I also was not very sure how common this is - we do have
> sort for points, but I have no idea if the PostGIS indexes define
> sorting etc. My guess was "no" but I've been told that's no longer true,
> so I guess sorted builds are more widely applicable than I thought.


For sorted builds, I made the claim that parallelizing sorted builds is
"solved problem" because we can use a parallel tuplesort. I was thinking
that maybe it'd be better to do that in the initial patch, and only then
introduce the more complex stuff in the unsorted case, so I gave this a
try, and it turned to be rather pointless.

Yes, parallel tuplesort does improve the duration, but it's not a very
significant improvement - maybe 10% or so. Most of the build time is
spent in gist_indexsortbuild(), so this is the part that would need to
be parallelized for any substantial improvement. Only then is it useful
to improve the tuplesort, I think.

And parallelizing gist_indexsortbuild() is not trivial - most of the
time is spent in gist_indexsortbuild_levelstate_flush() / gistSplit(),
so ISTM a successful parallel implementation would need to divide this
work between multiple workers. I don't have a clear idea how, though.

I do have a PoC/WIP patch doing the paralle tuplesort in my github
branch at [1] (and then also some ugly experiments on top of that), but
I'm not going to attach it here because of the reasons I just explained.
It'd be just a pointless distraction.

> In any case, I'm not in a rush to parallelize sorted builds. It can be
> added later, as an improvement, IMHO. In fact, it's a well isolated part
> of the patch, which might make it a good choice for someone looking for
> an idea for their first patch ...
> 

I still think this assessment is correct - it's fine to not parallelize
sorted builds. It can be improved in the future, or even not at all.

> 
> buffered builds
> ---------------
> 
> The lack of support for buffered builds is a very different thing. The
> basic idea is that we don't push the index entries all the way to the
> leaf pages right away, but accumulate them in buffers half-way through.
> This combines writes and reduces random I/O, which is nice.
> 
> Unfortunately, the way it's implemented does not work with parallel
> builds at all - all the state is in private memory, and it assumes the
> worker is the only possible backend that can split the page (at which
> point the buffers need to be split too, etc.). But for parallel builds
> this is obviously not true.
> 
> I'm not saying parallel builds can't do similar buffering, but it
> requires moving the buffers into shared memory, and introducing locking
> to coordinate accesses to the buffers. (Or perhaps it might be enough to
> only "notify" the workers about page splits, with buffers still in
> private memory?). Anyway, it seems far too complicated for v1.
> 
> In fact, I'm not sure the buffering is entirely necessary - maybe the
> increase in amount of RAM makes this less of an issue? If the index can
> fit into shared buffers (or at least page cache), maybe the amount of
> extra I/O is not that bad? I'm sure there may be cases really affected
> by this, but maybe it's OK to tell people to disable parallel builds in
> those cases?
> 

For unsorted builds, here's the results from one of the machines for
duration of CREATE INDEX with the requested number of workers (0 means
serial build) for different tables in the OSM databases:

  db         type   size (MB)  |     0      1      2      3      4
  -----------------------------|----------------------------------
  small      line        4889  |   811    429    294    223    186
            point        2625  |   485    262    179    141    125
          polygon        7644  |  1230    623    418    318    261
            roads         273  |    40     22     16     14     12
  -----------------------------|----------------------------------
  large      line       20592  |  3916   2157   1479   1137    948
            point       13080  |  2636   1442    981    770    667
          polygon       50598  | 10990   5648   3860   2991   2504
            roads        1322  |   228    123     85     67     56

There's also the size of the index. If we calculate the speedup compared
to serial build, we get this:

  db         type  |    1        2        3        4
  -----------------|--------------------------------
  small      line  |  1.9      2.8      3.6      4.4
            point  |  1.9      2.7      3.4      3.9
          polygon  |  2.0      2.9      3.9      4.7
            roads  |  1.8      2.5      2.9      3.3
  -----------------|--------------------------------
  large      line  |  1.8      2.6      3.4      4.1
            point  |  1.8      2.7      3.4      4.0
          polygon  |  1.9      2.8      3.7      4.4
            roads  |  1.9      2.7      3.4      4.1

Remember, the leader participates in the build, so K workers means K+1
processes are doing the work. And the speedup is pretty close to the
ideal speedup.

There's the question about buffering, though - as mentioned in the first
patch, the parallel builds do not support buffering, so the question is
how bad is the impact of that. Clearly, the duration improves a lot, so
that's good, but maybe it did write out far more buffers and the NVMe
drive handled it well?

So I used pg_stat_statements to track the number of buffer writes
(shared_blks_written) for the CREATE INDEX, and for the large data set
it looks like this (this is in MBs written):

             size |       0        1        2        3        4
  ----------------|--------------------------------------------
     line   20592 |   43577    47580    49574    50388    50734
    point   13080 |   23331    25721    26399    26745    26889
  polygon   50598 |  113108   125095   129599   130170   131249
    roads    1322 |    1322     1310     1305     1300     1295

The serial builds (0 workers) are buffered, but the buffering only
applies for indexes that exceed effective_cache_size (4GB). Which means
the "roads" buffer is too small to activate buffering, and there should
be very little differences - which is the case (but the index also fits
into shared buffers in this case).

The other indexes do activate buffering, so the question is how many
more buffers get written out compared to serial builds (with buffering).
And the comparison looks like this:

                   1       2       3       4
  ------------------------------------------
     line       109%    114%    116%    116%
    point       110%    113%    115%    115%
  polygon       111%    115%    115%    116%
    roads        99%     99%     98%     98%

So it writes about 15-20% more buffers during the index build, which is
not that much IMHO. I was wondering if this might change with smaller
shared buffers, so I tried building indexes on the smaller data set with
128MB shared buffers, but the difference remained to be ~15-20%.

My conclusion from this is that it's OK to have parallel builds without
buffering, and then maybe improve that later. The thing I'm not sure
about is how this should interact with the "buffering" option. Right now
we just ignore that entirely if we decide to do parallel build. But
maybe it'd be better to disable parallel builds when the user specifies
"buffering=on" (and only allow parallel builds with off/auto)?

I did check how parallelism affects the amount of WAL produced, but
that's pretty much exactly how I described that in the initial message,
including the "strange" decrease with more workers due to reusing the
fake LSN etc.


regards


[1] https://github.com/tvondra/postgres/tree/parallel-gist-20240625

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:
Hi Tomas!

> On 7 Jun 2024, at 20:41, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> After looking into parallel builds for BRIN and GIN indexes, I was
> wondering if there's a way to do parallel builds for GiST too. I knew
> next to nothing about how GiST works, but I gave it a shot and here's
> what I have - the attached patch allows parallel GiST builds for the
> "unsorted" case (i.e. when the opclass does not include sortsupport),
> and does not support buffered builds.

I think this totally makes sense. I've took a look into tuples partitioning (for sorted build) in your Github and I see
thatit's complicated feature. So, probably, we can do it later. 
I'm trying to review the patch as it is now. Currently I have some questions about code.

1. Do I get it right that is_parallel argument for gistGetFakeLSN() is only needed for assertion? And this assertion
canbe ensured just by inspecting code. Is it really necessary? 
2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be not used anywhere. AFAIK it's only needed to
bufferedbuild. 
3. I think we need a test that reliably triggers parallel and serial builds.

As far as I know, there's a well known trick to build better GiST over PostGIS data: randomize input. I think parallel
scanis just what is needed, it will shuffle tuples enough... 

Thanks for working on this!


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:
On 7/21/24 21:31, Andrey M. Borodin wrote:
> Hi Tomas!
> 
>> On 7 Jun 2024, at 20:41, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>> 
>> After looking into parallel builds for BRIN and GIN indexes, I was 
>> wondering if there's a way to do parallel builds for GiST too. I
>> knew next to nothing about how GiST works, but I gave it a shot and
>> here's what I have - the attached patch allows parallel GiST builds
>> for the "unsorted" case (i.e. when the opclass does not include
>> sortsupport), and does not support buffered builds.
> 
> I think this totally makes sense. I've took a look into tuples
> partitioning (for sorted build) in your Github and I see that it's
> complicated feature. So, probably, we can do it later. I'm trying to
> review the patch as it is now. Currently I have some questions about
> code.

OK. I'm not even sure partitioning is the right approach for sorted
builds. Or how to do it, exactly.

> 
> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
> is only needed for assertion? And this assertion can be ensured just
> by inspecting code. Is it really necessary?

Yes, in the patch it's only used for an assert. But it's actually
incorrect - just as I speculated in my initial message (in the section
about gistGetFakeLSN), it sometimes fails to detect a page split. I
noticed that while testing the patch adding GiST to amcheck, and I
reported that in that thread [1]. But I apparently forgot to post an
updated version of this patch :-(

I'll post a new version tomorrow, but in short it needs to update the
fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
annoying this means we generate a new fake LSN on every call, and I'm
not sure that's actually necessary. But I've been unable to come up with
a better condition when to generate a new LSN.

[1]
https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com

> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
> not used anywhere. AFAIK it's only needed to buffered build. 3. I
> think we need a test that reliably triggers parallel and serial
> builds.
> 

Yeah, it's possible the variable is unused. Agreed on the testing.

> As far as I know, there's a well known trick to build better GiST
> over PostGIS data: randomize input. I think parallel scan is just
> what is needed, it will shuffle tuples enough...
> 

I'm not sure I understand this comment. What do you mean by "better
GiST" or what does that mean for this patch?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:

> On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>>
>> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
>> is only needed for assertion? And this assertion can be ensured just
>> by inspecting code. Is it really necessary?
>
> Yes, in the patch it's only used for an assert. But it's actually
> incorrect - just as I speculated in my initial message (in the section
> about gistGetFakeLSN), it sometimes fails to detect a page split. I
> noticed that while testing the patch adding GiST to amcheck, and I
> reported that in that thread [1]. But I apparently forgot to post an
> updated version of this patch :-(

Oops, I just though that it's a version with solved FakeLSN problem.

>
> I'll post a new version tomorrow, but in short it needs to update the
> fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
> annoying this means we generate a new fake LSN on every call, and I'm
> not sure that's actually necessary. But I've been unable to come up with
> a better condition when to generate a new LSN.

Why don't we just use an atomic counter withtin shared build state?

>
> [1]
> https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com
Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here.
Youdid a great review, and I want to address all issues there wholistically. Thank you! 

>> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
>> not used anywhere. AFAIK it's only needed to buffered build. 3. I
>> think we need a test that reliably triggers parallel and serial
>> builds.
>>
>
> Yeah, it's possible the variable is unused. Agreed on the testing.
>
>> As far as I know, there's a well known trick to build better GiST
>> over PostGIS data: randomize input. I think parallel scan is just
>> what is needed, it will shuffle tuples enough...
>>
>
> I'm not sure I understand this comment. What do you mean by "better
> GiST" or what does that mean for this patch?

I think parallel build indexes will have faster IndexScans.


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:
On 7/22/24 11:00, Andrey M. Borodin wrote:
> 
> 
>> On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>>>
>>> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
>>> is only needed for assertion? And this assertion can be ensured just
>>> by inspecting code. Is it really necessary?
>>
>> Yes, in the patch it's only used for an assert. But it's actually
>> incorrect - just as I speculated in my initial message (in the section
>> about gistGetFakeLSN), it sometimes fails to detect a page split. I
>> noticed that while testing the patch adding GiST to amcheck, and I
>> reported that in that thread [1]. But I apparently forgot to post an
>> updated version of this patch :-(
> 
> Oops, I just though that it's a version with solved FakeLSN problem.
> 
>>
>> I'll post a new version tomorrow, but in short it needs to update the
>> fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
>> annoying this means we generate a new fake LSN on every call, and I'm
>> not sure that's actually necessary. But I've been unable to come up with
>> a better condition when to generate a new LSN.
> 
> Why don't we just use an atomic counter withtin shared build state? 
> 

I don't understand how would that solve the problem, can you elaborate?
Which of the values are you suggesting should be replaced with the
shared counter? lastlsn?

>>
>> [1]
>> https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com
> Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here.
Youdid a great review, and I want to address all issues there wholistically. Thank you!
 
> 
>>> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
>>> not used anywhere. AFAIK it's only needed to buffered build. 3. I
>>> think we need a test that reliably triggers parallel and serial
>>> builds.
>>>
>>
>> Yeah, it's possible the variable is unused. Agreed on the testing.
>>
>>> As far as I know, there's a well known trick to build better GiST
>>> over PostGIS data: randomize input. I think parallel scan is just
>>> what is needed, it will shuffle tuples enough...
>>>
>>
>> I'm not sure I understand this comment. What do you mean by "better
>> GiST" or what does that mean for this patch?
> 
> I think parallel build indexes will have faster IndexScans.
> 

That's interesting. I haven't thought about measuring stuff like that
(and it hasn't occurred to me it might have this benefit, or why would
that be the case).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:

> On 22 Jul 2024, at 12:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> I don't understand how would that solve the problem, can you elaborate?
> Which of the values are you suggesting should be replaced with the
> shared counter? lastlsn?

I think during build we should consider index unlogged and always use GetFakeLSNForUnloggedRel() or something similar.
Anywaywe will calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end. 


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:

On 7/22/24 13:08, Andrey M. Borodin wrote:
> 
> 
>> On 22 Jul 2024, at 12:26, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>> 
>> I don't understand how would that solve the problem, can you
>> elaborate? Which of the values are you suggesting should be
>> replaced with the shared counter? lastlsn?
> 
> I think during build we should consider index unlogged and always use
> GetFakeLSNForUnloggedRel() or something similar. Anyway we will
> calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end.
> 

But that doesn't update the page LSN, which GiST uses to detect
concurrent splits, no?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:

> On 22 Jul 2024, at 14:53, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 7/22/24 13:08, Andrey M. Borodin wrote:
>>
>>
>>> On 22 Jul 2024, at 12:26, Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> I don't understand how would that solve the problem, can you
>>> elaborate? Which of the values are you suggesting should be
>>> replaced with the shared counter? lastlsn?
>>
>> I think during build we should consider index unlogged and always use
>> GetFakeLSNForUnloggedRel() or something similar. Anyway we will
>> calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end.
>>
>
> But that doesn't update the page LSN, which GiST uses to detect
> concurrent splits, no?

During inserting tuples we need NSN on page. For NSN we can use just a counter, generated by gistGetFakeLSN() which in
turnwill call GetFakeLSNForUnloggedRel(). Or any other shared counter. 
After inserting tuples we call log_newpage_range() to actually WAL-log pages.
All NSNs used during build must be less than LSNs used to insert new tuples after index is built.


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Andreas Karlsson
Date:
On 7/22/24 2:08 PM, Andrey M. Borodin wrote:
> During inserting tuples we need NSN on page. For NSN we can use just a counter, generated by gistGetFakeLSN() which
inturn will call GetFakeLSNForUnloggedRel(). Or any other shared counter.
 
> After inserting tuples we call log_newpage_range() to actually WAL-log pages.
> All NSNs used during build must be less than LSNs used to insert new tuples after index is built.

I feel the tricky part about doing that is that we need to make sure the 
fake LSNs are all less than the current real LSN when the index build 
completes and while that normally should be the case we will have a 
almost never exercised code path for when the fake LSN becomes bigger 
than the real LSN which may contain bugs. Is that really worth it to 
optimize.

But if we are going to use fake LSN: since the index being built is not 
visible to any scans we do not have to use GetFakeLSNForUnloggedRel() 
but could use an own counter in shared memory in the GISTShared struct 
for this specific index which starts at FirstNormalUnloggedLSN. This 
would give us slightly less contention plus decrease the risk (for good 
and bad) of the fake LSN being larger than the real LSN.

Andreas



Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:

> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:
>
> I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current real
LSNwhen the index build completes and while that normally should be the case we will have a almost never exercised code
pathfor when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize. 
>
> But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use
GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific
indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for
goodand bad) of the fake LSN being larger than the real LSN. 

+1 for atomic counter in GISTShared.
BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:

On 7/26/24 14:13, Andrey M. Borodin wrote:
> 
> 
>> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:
>>
>> I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current
realLSN when the index build completes and while that normally should be the case we will have a almost never exercised
codepath for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to
optimize.
>>
>> But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use
GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific
indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for
goodand bad) of the fake LSN being larger than the real LSN.
 
> 
> +1 for atomic counter in GISTShared.

I tried implementing this, see the attached 0002 patch that replaces the
fake LSN with an atomic counter in shared memory. It seems to work (more
testing needed), but I can't say I'm very happy with the code :-(

The way it passes the shared counter to places that actually need it is
pretty ugly. The thing is - the counter needs to be in shared memory,
but places like gistplacetopage() have no idea/need of that. I chose to
simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is
there's a better way to do this?

I thought maybe we could simply increment the counter before each call
and pass the LSN value - 64bits should be enough, not sure about the
overhead. But gistplacetopage() also uses the LSN twice, and I'm not
sure it'd be legal to use the same value twice.

Any better ideas?


> BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().
> 

Why would the reset be necessary? Doesn't log_newpage_range() set page
LSN to current insert LSN? So why would reset that?

I'm not sure about the discussion about NSN and the need to handle the
case when NSN / fake LSN values get ahead of LSN. Is that really a
problem? If the values generated from the counter are private to the
index build, and log_newpage_range() replaces them with current LSN, do
we still need to worry about it?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:

> On 30 Jul 2024, at 14:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 7/26/24 14:13, Andrey M. Borodin wrote:
>>
>>
>>> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:
>>>
>>> I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current
realLSN when the index build completes and while that normally should be the case we will have a almost never exercised
codepath for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to
optimize.
>>>
>>> But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use
GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific
indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for
goodand bad) of the fake LSN being larger than the real LSN. 
>>
>> +1 for atomic counter in GISTShared.
>
> I tried implementing this, see the attached 0002 patch that replaces the
> fake LSN with an atomic counter in shared memory. It seems to work (more
> testing needed), but I can't say I'm very happy with the code :-(

I agree. Passing this pointer everywhere seems ugly.

>
> The way it passes the shared counter to places that actually need it is
> pretty ugly. The thing is - the counter needs to be in shared memory,
> but places like gistplacetopage() have no idea/need of that. I chose to
> simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is
> there's a better way to do this?
>
> I thought maybe we could simply increment the counter before each call
> and pass the LSN value - 64bits should be enough, not sure about the
> overhead. But gistplacetopage() also uses the LSN twice, and I'm not
> sure it'd be legal to use the same value twice.
>
> Any better ideas?

How about global pointer to fake LSN?
Just set it to correct pointer when in parallel build, or NULL either way.

>> BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().
>>
>
> Why would the reset be necessary? Doesn't log_newpage_range() set page
> LSN to current insert LSN? So why would reset that?
>
> I'm not sure about the discussion about NSN and the need to handle the
> case when NSN / fake LSN values get ahead of LSN. Is that really a
> problem? If the values generated from the counter are private to the
> index build, and log_newpage_range() replaces them with current LSN, do
> we still need to worry about it?

Stamping pages with new real LSN will do the trick. I didn’t know that log_newpage_range() is already doing so.

How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about
FakeLSN>RealLSN.


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:

On 7/30/24 11:39, Andrey M. Borodin wrote:
> 
> 
>> On 30 Jul 2024, at 14:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>>
>> On 7/26/24 14:13, Andrey M. Borodin wrote:
>>>
>>>
>>>> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:
>>>>
>>>> I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current
realLSN when the index build completes and while that normally should be the case we will have a almost never exercised
codepath for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to
optimize.
>>>>
>>>> But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use
GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific
indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for
goodand bad) of the fake LSN being larger than the real LSN.
 
>>>
>>> +1 for atomic counter in GISTShared.
>>
>> I tried implementing this, see the attached 0002 patch that replaces the
>> fake LSN with an atomic counter in shared memory. It seems to work (more
>> testing needed), but I can't say I'm very happy with the code :-(
> 
> I agree. Passing this pointer everywhere seems ugly.
> 

Yeah.

>>
>> The way it passes the shared counter to places that actually need it is
>> pretty ugly. The thing is - the counter needs to be in shared memory,
>> but places like gistplacetopage() have no idea/need of that. I chose to
>> simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is
>> there's a better way to do this?
>>
>> I thought maybe we could simply increment the counter before each call
>> and pass the LSN value - 64bits should be enough, not sure about the
>> overhead. But gistplacetopage() also uses the LSN twice, and I'm not
>> sure it'd be legal to use the same value twice.
>>
>> Any better ideas?
> 
> How about global pointer to fake LSN?
> Just set it to correct pointer when in parallel build, or NULL either way.
> 

I'm not sure adding a global variable is pretty either. What if there's
some error, for example? Will it get reset to NULL?

>>> BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().
>>>
>>
>> Why would the reset be necessary? Doesn't log_newpage_range() set page
>> LSN to current insert LSN? So why would reset that?
>>
>> I'm not sure about the discussion about NSN and the need to handle the
>> case when NSN / fake LSN values get ahead of LSN. Is that really a
>> problem? If the values generated from the counter are private to the
>> index build, and log_newpage_range() replaces them with current LSN, do
>> we still need to worry about it?
> 
> Stamping pages with new real LSN will do the trick. I didn’t know that log_newpage_range() is already doing so.
> 

I believe it does, or at least that's what I believe this code at the
end is meant to do:

    recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);

    for (i = 0; i < nbufs; i++)
    {
        PageSetLSN(BufferGetPage(bufpack[i]), recptr);
        UnlockReleaseBuffer(bufpack[i]);
    }

Unless I misunderstood what this does.

> How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
> Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about
FakeLSN>RealLSN.
> 

I'm confused. How is this related to unloggedLSN at all?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:

> On 30 Jul 2024, at 14:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>>
>> How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
>> Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about
FakeLSN>RealLSN.
>>
>
> I'm confused. How is this related to unloggedLSN at all?

Parallel build should work for both logged and unlogged indexes.
If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build.
Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory.

In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”.


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:

On 7/30/24 13:31, Andrey M. Borodin wrote:
> 
> 
>> On 30 Jul 2024, at 14:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>>>
>>> How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if
necessary?
>>> Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about
FakeLSN>RealLSN.
>>>
>>
>> I'm confused. How is this related to unloggedLSN at all?
> 
> Parallel build should work for both logged and unlogged indexes.

Agreed, no argument here.

> If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build.
> Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory.
> 

Ah, right. For unlogged relations we won't invoke log_newpage_range(),
so we'd end up with the bogus page LSNs ...

> In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”.
> 

Interesting idea. IIRC you suggested this earlier, but I didn't realize
it has the benefit of already using an atomic counter - which solves the
"ugliness" of my patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: parallel GiST index builds

From
Andreas Karlsson
Date:
On 7/30/24 1:31 PM, Andrey M. Borodin wrote:>> On 30 Jul 2024, at 14:57, 
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>>>
>>> How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if
necessary?
>>> Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about
FakeLSN>RealLSN.
>>>
>>
>> I'm confused. How is this related to unloggedLSN at all?
> 
> Parallel build should work for both logged and unlogged indexes.
> If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build.
> Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory.
> 
> In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”.

Yeah,

Great point, given the ugliness of passing around the fakelsn we might 
as well just use GetFakeLSNForUnloggedRel().

Andreas



Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:
Hi,

Here's an updated patch using GetFakeLSNForUnloggedRel() instead of the
atomic counter. I think this looks much nicer and less invasive, as it
simply uses XLogCtl shared memory (instead of having to pass a new
pointer everywhere).

We still need to pass the is_parallel flag, though. I wonder if we could
get rid of that too, and just use GetFakeLSNForUnloggedRel() for both
parallel and non-parallel builds? Why wouldn't that work?

I've spent quite a bit of time testing this, but mostly for correctness.
I haven't redone the benchmarks, that's on my TODO.


regards

-- 
Tomas Vondra
Attachment

Re: WIP: parallel GiST index builds

From
"Andrey M. Borodin"
Date:

> On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote:
>
> Here's an updated patch adding the queryid.

I've took another round of looking through the patch.
Everything I see seems correct to me. It just occurred to me that we will have: buffered build, parallel build, sorting
build.All 3 aiming to speed things up. I really hope that we will find a way to parallel sorting build, because it will
befastest for sure. 


Currently we have some instances of such code...or similar... or related code.

if (is_build)
{
  if (is_parallel)
    recptr = GetFakeLSNForUnloggedRel();
  else
    recptr = GistBuildLSN;
}
else
{
  if (RelationNeedsWAL(rel))
  {
    recptr = actuallyWriteWAL();
  }
  else
    recptr = gistGetFakeLSN(rel);
}
// Use recptr

In previous review I was proponent of not adding arguments to gistGetFakeLSN(). But now I see that now logic of
choosingLSN source is sprawling across various places... 
Now I do not have a strong point of view on this. Do you think if something like following would be clearer?
if (!is_build && RelationNeedsWAL(rel))
  {
    recptr = actuallyWriteWAL();
  }
  else
    recptr = gistGetFakeLSN(rel, is_build, is_parallel);

Just as an idea.

I'm mostly looking on GiST-specific parts of the patch, while things around entering parallel mode seems much more
complicated.But as far as I can see, large portions of this code are taken from B-tree\BRIN. 


Best regards, Andrey Borodin.


Re: WIP: parallel GiST index builds

From
Tomas Vondra
Date:
On 10/31/24 19:05, Andrey M. Borodin wrote:
> 
> 
>> On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Here's an updated patch adding the queryid.
> 
> I've took another round of looking through the patch.
> Everything I see seems correct to me. It just occurred to me that we
> will have: buffered build, parallel build, sorting build. All 3 aiming
> to speed things up. I really hope that we will find a way to parallel
> sorting build, because it will be fastest for sure.
> 

The number of different ways to build a GiST index worries me. When we
had just buffered vs. sorted builds, that was pretty easy decision,
because buffered builds are much faster 99% of the time.

But for parallel builds it's not that clear - it can easily happen that
we use much more CPU, without speeding anything up. We just start as
many parallel workers as possible, given the maintenance_work_mem value,
and hope for the best.

Maybe with parallel buffered builds it's be clearer ... but I'm not
sufficiently familiar with that code, and I don't have time to study
that at the moment because of other patches. Someone else would have to
take a stab at that ...

> 
> Currently we have some instances of such code...or similar... or
> related code.
> 
> if (is_build)
> {
>   if (is_parallel)
>     recptr = GetFakeLSNForUnloggedRel();
>   else
>     recptr = GistBuildLSN;
> }
> else
> {
>   if (RelationNeedsWAL(rel))
>   {
>     recptr = actuallyWriteWAL();
>   }
>   else
>     recptr = gistGetFakeLSN(rel);
> }
> // Use recptr
> 
> In previous review I was proponent of not adding arguments to 
> gistGetFakeLSN(). But now I see that now logic of choosing LSN
> source is sprawling across various places...
>
> Now I do not have a strong point of view on this. Do you think if
> something like following would be clearer?
> if (!is_build && RelationNeedsWAL(rel))
>   {
>     recptr = actuallyWriteWAL();
>   }
>   else
>     recptr = gistGetFakeLSN(rel, is_build, is_parallel);
> 
> Just as an idea.
> 
> I'm mostly looking on GiST-specific parts of the patch, while things
> around entering parallel mode seems much more complicated. But as far as
> I can see, large portions of this code are taken from B-tree\BRIN.
> 

I agree the repeated code is pretty tedious, and it's also easy to miss
one of the places when changing the logic, so I think wrapping that in
some function makes sense.


regards

-- 
Tomas Vondra