Re: WIP: parallel GiST index builds - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: parallel GiST index builds
Date
Msg-id 67bbf244-63f8-4de0-bad7-562b5c979fbd@enterprisedb.com
Whole thread Raw
In response to Re: WIP: parallel GiST index builds  ("Andrey M. Borodin" <x4mmm@yandex-team.ru>)
Responses Re: WIP: parallel GiST index builds
List pgsql-hackers
On 7/22/24 11:00, Andrey M. Borodin wrote:
> 
> 
>> On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>>>
>>> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
>>> is only needed for assertion? And this assertion can be ensured just
>>> by inspecting code. Is it really necessary?
>>
>> Yes, in the patch it's only used for an assert. But it's actually
>> incorrect - just as I speculated in my initial message (in the section
>> about gistGetFakeLSN), it sometimes fails to detect a page split. I
>> noticed that while testing the patch adding GiST to amcheck, and I
>> reported that in that thread [1]. But I apparently forgot to post an
>> updated version of this patch :-(
> 
> Oops, I just though that it's a version with solved FakeLSN problem.
> 
>>
>> I'll post a new version tomorrow, but in short it needs to update the
>> fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
>> annoying this means we generate a new fake LSN on every call, and I'm
>> not sure that's actually necessary. But I've been unable to come up with
>> a better condition when to generate a new LSN.
> 
> Why don't we just use an atomic counter withtin shared build state? 
> 

I don't understand how would that solve the problem, can you elaborate?
Which of the values are you suggesting should be replaced with the
shared counter? lastlsn?

>>
>> [1]
>> https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com
> Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here.
Youdid a great review, and I want to address all issues there wholistically. Thank you!
 
> 
>>> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
>>> not used anywhere. AFAIK it's only needed to buffered build. 3. I
>>> think we need a test that reliably triggers parallel and serial
>>> builds.
>>>
>>
>> Yeah, it's possible the variable is unused. Agreed on the testing.
>>
>>> As far as I know, there's a well known trick to build better GiST
>>> over PostGIS data: randomize input. I think parallel scan is just
>>> what is needed, it will shuffle tuples enough...
>>>
>>
>> I'm not sure I understand this comment. What do you mean by "better
>> GiST" or what does that mean for this patch?
> 
> I think parallel build indexes will have faster IndexScans.
> 

That's interesting. I haven't thought about measuring stuff like that
(and it hasn't occurred to me it might have this benefit, or why would
that be the case).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Next
From: Tomas Vondra
Date:
Subject: Re: Make reorder buffer max_changes_in_memory adjustable?