Re: WIP: parallel GiST index builds - Mailing list pgsql-hackers
From | Kirill Reshke |
---|---|
Subject | Re: WIP: parallel GiST index builds |
Date | |
Msg-id | CALdSSPi5HLxOdopuOtCA8ENGfnZDjrpu4t7VRGpD+MJA=KnxCQ@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: parallel GiST index builds (Tomas Vondra <tomas@vondra.me>) |
List | pgsql-hackers |
On Fri, 8 Nov 2024 at 19:53, Tomas Vondra <tomas@vondra.me> wrote: > > On 10/31/24 19:05, Andrey M. Borodin wrote: > > > > > >> On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote: > >> > >> Here's an updated patch adding the queryid. > > > > I've took another round of looking through the patch. > > Everything I see seems correct to me. It just occurred to me that we > > will have: buffered build, parallel build, sorting build. All 3 aiming > > to speed things up. I really hope that we will find a way to parallel > > sorting build, because it will be fastest for sure. > > > > The number of different ways to build a GiST index worries me. When we > had just buffered vs. sorted builds, that was pretty easy decision, > because buffered builds are much faster 99% of the time. > > But for parallel builds it's not that clear - it can easily happen that > we use much more CPU, without speeding anything up. We just start as > many parallel workers as possible, given the maintenance_work_mem value, > and hope for the best. > > Maybe with parallel buffered builds it's be clearer ... but I'm not > sufficiently familiar with that code, and I don't have time to study > that at the moment because of other patches. Someone else would have to > take a stab at that ... > > > > > Currently we have some instances of such code...or similar... or > > related code. > > > > if (is_build) > > { > > if (is_parallel) > > recptr = GetFakeLSNForUnloggedRel(); > > else > > recptr = GistBuildLSN; > > } > > else > > { > > if (RelationNeedsWAL(rel)) > > { > > recptr = actuallyWriteWAL(); > > } > > else > > recptr = gistGetFakeLSN(rel); > > } > > // Use recptr > > > > In previous review I was proponent of not adding arguments to > > gistGetFakeLSN(). But now I see that now logic of choosing LSN > > source is sprawling across various places... > > > > Now I do not have a strong point of view on this. Do you think if > > something like following would be clearer? > > if (!is_build && RelationNeedsWAL(rel)) > > { > > recptr = actuallyWriteWAL(); > > } > > else > > recptr = gistGetFakeLSN(rel, is_build, is_parallel); > > > > Just as an idea. > > > > I'm mostly looking on GiST-specific parts of the patch, while things > > around entering parallel mode seems much more complicated. But as far as > > I can see, large portions of this code are taken from B-tree\BRIN. > > > > I agree the repeated code is pretty tedious, and it's also easy to miss > one of the places when changing the logic, so I think wrapping that in > some function makes sense. > > > regards > > -- > Tomas Vondra Hi! Here is a rebased version of your patch. I have tested this patch on a very biased dataset. My test was: yes 2 | awk '{print $1","$1}'| head -400000000 > data.dat create table t (p point); copy t from '....data.dat'; and then build GiST index on this. I did get 3x time speedup for buffered build: create index on t using gist(p) with (buffering=on); But this was still longer than the sorted build. Should we consider speeding up sorted builds with parallel sorting? I also checked GiST indexes with amcheck from [0]. It did not complain. [0] https://www.postgresql.org/message-id/CALdSSPhs4sCd5S_euKxTufk8sOA739ydBwhoGFYQenya7YZyiA%40mail.gmail.com -- Best regards, Kirill Reshke
Attachment
pgsql-hackers by date: