Re: WIP: parallel GiST index builds - Mailing list pgsql-hackers

From Andrey M. Borodin
Subject Re: WIP: parallel GiST index builds
Date
Msg-id CACAE478-9276-461D-970C-6A04D414BC8B@yandex-team.ru
Whole thread Raw
In response to Re: WIP: parallel GiST index builds  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: WIP: parallel GiST index builds
List pgsql-hackers

> On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
>>
>> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
>> is only needed for assertion? And this assertion can be ensured just
>> by inspecting code. Is it really necessary?
>
> Yes, in the patch it's only used for an assert. But it's actually
> incorrect - just as I speculated in my initial message (in the section
> about gistGetFakeLSN), it sometimes fails to detect a page split. I
> noticed that while testing the patch adding GiST to amcheck, and I
> reported that in that thread [1]. But I apparently forgot to post an
> updated version of this patch :-(

Oops, I just though that it's a version with solved FakeLSN problem.

>
> I'll post a new version tomorrow, but in short it needs to update the
> fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
> annoying this means we generate a new fake LSN on every call, and I'm
> not sure that's actually necessary. But I've been unable to come up with
> a better condition when to generate a new LSN.

Why don't we just use an atomic counter withtin shared build state?

>
> [1]
> https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com
Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here.
Youdid a great review, and I want to address all issues there wholistically. Thank you! 

>> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
>> not used anywhere. AFAIK it's only needed to buffered build. 3. I
>> think we need a test that reliably triggers parallel and serial
>> builds.
>>
>
> Yeah, it's possible the variable is unused. Agreed on the testing.
>
>> As far as I know, there's a well known trick to build better GiST
>> over PostGIS data: randomize input. I think parallel scan is just
>> what is needed, it will shuffle tuples enough...
>>
>
> I'm not sure I understand this comment. What do you mean by "better
> GiST" or what does that mean for this patch?

I think parallel build indexes will have faster IndexScans.


Best regards, Andrey Borodin.


pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: schema variables
Next
From: Nazir Bilal Yavuz
Date:
Subject: Re: Use read streams in CREATE DATABASE command when the strategy is wal_log