Thread: WIP: parallel GiST index builds
Hi, After looking into parallel builds for BRIN and GIN indexes, I was wondering if there's a way to do parallel builds for GiST too. I knew next to nothing about how GiST works, but I gave it a shot and here's what I have - the attached patch allows parallel GiST builds for the "unsorted" case (i.e. when the opclass does not include sortsupport), and does not support buffered builds. unsorted builds only -------------------- Addressing only the unsorted case may seem a bit weird, but I did it this way for two reasons - parallel sort is a solved problem, and adding this to the patch seems quite straightforward. It's what btree does, for example. But I also was not very sure how common this is - we do have sort for points, but I have no idea if the PostGIS indexes define sorting etc. My guess was "no" but I've been told that's no longer true, so I guess sorted builds are more widely applicable than I thought. In any case, I'm not in a rush to parallelize sorted builds. It can be added later, as an improvement, IMHO. In fact, it's a well isolated part of the patch, which might make it a good choice for someone looking for an idea for their first patch ... buffered builds --------------- The lack of support for buffered builds is a very different thing. The basic idea is that we don't push the index entries all the way to the leaf pages right away, but accumulate them in buffers half-way through. This combines writes and reduces random I/O, which is nice. Unfortunately, the way it's implemented does not work with parallel builds at all - all the state is in private memory, and it assumes the worker is the only possible backend that can split the page (at which point the buffers need to be split too, etc.). But for parallel builds this is obviously not true. I'm not saying parallel builds can't do similar buffering, but it requires moving the buffers into shared memory, and introducing locking to coordinate accesses to the buffers. (Or perhaps it might be enough to only "notify" the workers about page splits, with buffers still in private memory?). Anyway, it seems far too complicated for v1. In fact, I'm not sure the buffering is entirely necessary - maybe the increase in amount of RAM makes this less of an issue? If the index can fit into shared buffers (or at least page cache), maybe the amount of extra I/O is not that bad? I'm sure there may be cases really affected by this, but maybe it's OK to tell people to disable parallel builds in those cases? gistGetFakeLSN -------------- One more thing - GiST disables WAL-logging during the build, and only logs it once at the end. For serial builds this is fine, because there are no concurrent splits, and so we don't need to rely on page LSNs to detect these cases (in fact, is uses a bogus value). But for parallel builds this would not work - we need page LSNs that actually change, otherwise we'd miss page splits, and the index build would either fail or produce a broken index. But the existing is_build flag affects both things, so I had to introduce a new "is_parallel" flag which only affects the page LSN part, using the gistGetFakeLSN() function, previously used only for unlogged indexes. This means we'll produce WAL during the index build (because gistGetFakeLSN() writes a trivial message into WAL). Compared to the serial builds this produces maybe 25-75% more WAL, but it's an order of magnitude less than with "full" WAL logging (is_build=false). For example, serial build of 5GB index needs ~5GB of WAL. A parallel build may need ~7GB, while a parallel build with "full" logging would use 50GB. I think this is a reasonable trade off. There's one "strange" thing, though - the amount of WAL decreases with the number of parallel workers. Consider for example an index on a numeric field, where the index is ~9GB, but the amount of WAL changes like this (0 workers means serial builds): parallel workers 0 1 3 5 7 WAL (GB) 5.7 9.2 7.6 7.0 6.8 The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN determines if it needs to actually assign a new LSN (and write stuff to WAL) by comparing the last LSN assigned (in a given worker) to the current insert LSN. But the current insert LSN might have been updated by some other worker, in which case we simply use that. Which means that multiple workers may use the same fake LSN, and the likelihood increases with the number of workers - and this is consistent with the observed behavior of the WAL decreasing as the number of workers increases (because more workers use the same LSN). I'm not entirely sure if this is OK or a problem. I was worried two workers might end up using the same LSN for the same page, leading to other workers not noticing the split. But after a week of pretty intensive stress testing, I haven't seen a single such failure ... If this turns out to be a problem, the fix is IMHO quite simple - it should be enough to force gistGetFakeLSN() to produce a new fake LSN every time when is_parallel=true. performance ----------- Obviously, the primary goal of the patch is to speed up the builds, so does it actually do that? For indexes of different sizes I got this timings (in seconds): scale type 0 1 3 5 7 ------------------------------------------------------------------ small inet 13 7 4 4 2 numeric 239 122 67 46 36 oid 15 8 5 3 2 text 71 35 19 13 10 medium inet 207 111 59 42 32 numeric 3409 1714 885 618 490 oid 214 114 60 43 33 text 940 479 247 180 134 large inet 2167 1459 865 632 468 numeric 38125 20256 10454 7487 5846 oid 2647 1490 808 594 475 text 10987 6298 3376 2462 1961 Here small is ~100-200MB index, medium is 1-2GB and large 10-20GB index, depending on the data type. The raw duration is not particularly easy to interpret, so let's look at the "actual speedup" which is calculated as (serial duration) / (parallel duration) and the table looks like this: scale type 1 3 5 7 -------------------------------------------------------------- small inet 1.9 3.3 3.3 6.5 numeric 2.0 3.6 5.2 6.6 oid 1.9 3.0 5.0 7.5 text 2.0 3.7 5.5 7.1 medium inet 1.9 3.5 4.9 6.5 numeric 2.0 3.9 5.5 7.0 oid 1.9 3.6 5.0 6.5 text 2.0 3.8 5.2 7.0 large inet 1.5 2.5 3.4 4.6 numeric 1.9 3.6 5.1 6.5 oid 1.8 3.3 4.5 5.6 text 1.7 3.3 4.5 5.6 Ideally (if the build scaled linearly with the number of workers), we'd get the number of workers + 1 (because the leader participates too). Obviously, it's not that great - for example for text with 3 workers we get 3.3 instead of 4.0, and 5.6 vs. 8 with 7 workers. But I think those numbers are actually pretty good - I'd definitely not complain if my index builds got 5x faster. But those are synthetic tests on random data, using the btree_gist opclasses. It'd be interesting if people could do their own testing on real-world data sets. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, I've done a number of experiments with the GiST parallel builds, both with the sorted and unsorted cases, so let me share some of the results and conclusions from that. In the first post I did some benchmarks using btree_gist, but that seemed not very realistic - there certainly are much more widely used GiST indexes in the GIS world. So this time I used OpenStreetMap, loaded using osm2pgsql, with two dataset sizes: - small - "north america" (121GB without indexes) - large - "planet" (688GB without indexes) And then I created indexes using either gist_geometry_ops_2d (with sort) or gist_geometry_ops_nd (no sorting). On 6/7/24 19:41, Tomas Vondra wrote: > Hi, > > After looking into parallel builds for BRIN and GIN indexes, I was > wondering if there's a way to do parallel builds for GiST too. I knew > next to nothing about how GiST works, but I gave it a shot and here's > what I have - the attached patch allows parallel GiST builds for the > "unsorted" case (i.e. when the opclass does not include sortsupport), > and does not support buffered builds. > > > unsorted builds only > -------------------- > > Addressing only the unsorted case may seem a bit weird, but I did it > this way for two reasons - parallel sort is a solved problem, and adding > this to the patch seems quite straightforward. It's what btree does, for > example. But I also was not very sure how common this is - we do have > sort for points, but I have no idea if the PostGIS indexes define > sorting etc. My guess was "no" but I've been told that's no longer true, > so I guess sorted builds are more widely applicable than I thought. For sorted builds, I made the claim that parallelizing sorted builds is "solved problem" because we can use a parallel tuplesort. I was thinking that maybe it'd be better to do that in the initial patch, and only then introduce the more complex stuff in the unsorted case, so I gave this a try, and it turned to be rather pointless. Yes, parallel tuplesort does improve the duration, but it's not a very significant improvement - maybe 10% or so. Most of the build time is spent in gist_indexsortbuild(), so this is the part that would need to be parallelized for any substantial improvement. Only then is it useful to improve the tuplesort, I think. And parallelizing gist_indexsortbuild() is not trivial - most of the time is spent in gist_indexsortbuild_levelstate_flush() / gistSplit(), so ISTM a successful parallel implementation would need to divide this work between multiple workers. I don't have a clear idea how, though. I do have a PoC/WIP patch doing the paralle tuplesort in my github branch at [1] (and then also some ugly experiments on top of that), but I'm not going to attach it here because of the reasons I just explained. It'd be just a pointless distraction. > In any case, I'm not in a rush to parallelize sorted builds. It can be > added later, as an improvement, IMHO. In fact, it's a well isolated part > of the patch, which might make it a good choice for someone looking for > an idea for their first patch ... > I still think this assessment is correct - it's fine to not parallelize sorted builds. It can be improved in the future, or even not at all. > > buffered builds > --------------- > > The lack of support for buffered builds is a very different thing. The > basic idea is that we don't push the index entries all the way to the > leaf pages right away, but accumulate them in buffers half-way through. > This combines writes and reduces random I/O, which is nice. > > Unfortunately, the way it's implemented does not work with parallel > builds at all - all the state is in private memory, and it assumes the > worker is the only possible backend that can split the page (at which > point the buffers need to be split too, etc.). But for parallel builds > this is obviously not true. > > I'm not saying parallel builds can't do similar buffering, but it > requires moving the buffers into shared memory, and introducing locking > to coordinate accesses to the buffers. (Or perhaps it might be enough to > only "notify" the workers about page splits, with buffers still in > private memory?). Anyway, it seems far too complicated for v1. > > In fact, I'm not sure the buffering is entirely necessary - maybe the > increase in amount of RAM makes this less of an issue? If the index can > fit into shared buffers (or at least page cache), maybe the amount of > extra I/O is not that bad? I'm sure there may be cases really affected > by this, but maybe it's OK to tell people to disable parallel builds in > those cases? > For unsorted builds, here's the results from one of the machines for duration of CREATE INDEX with the requested number of workers (0 means serial build) for different tables in the OSM databases: db type size (MB) | 0 1 2 3 4 -----------------------------|---------------------------------- small line 4889 | 811 429 294 223 186 point 2625 | 485 262 179 141 125 polygon 7644 | 1230 623 418 318 261 roads 273 | 40 22 16 14 12 -----------------------------|---------------------------------- large line 20592 | 3916 2157 1479 1137 948 point 13080 | 2636 1442 981 770 667 polygon 50598 | 10990 5648 3860 2991 2504 roads 1322 | 228 123 85 67 56 There's also the size of the index. If we calculate the speedup compared to serial build, we get this: db type | 1 2 3 4 -----------------|-------------------------------- small line | 1.9 2.8 3.6 4.4 point | 1.9 2.7 3.4 3.9 polygon | 2.0 2.9 3.9 4.7 roads | 1.8 2.5 2.9 3.3 -----------------|-------------------------------- large line | 1.8 2.6 3.4 4.1 point | 1.8 2.7 3.4 4.0 polygon | 1.9 2.8 3.7 4.4 roads | 1.9 2.7 3.4 4.1 Remember, the leader participates in the build, so K workers means K+1 processes are doing the work. And the speedup is pretty close to the ideal speedup. There's the question about buffering, though - as mentioned in the first patch, the parallel builds do not support buffering, so the question is how bad is the impact of that. Clearly, the duration improves a lot, so that's good, but maybe it did write out far more buffers and the NVMe drive handled it well? So I used pg_stat_statements to track the number of buffer writes (shared_blks_written) for the CREATE INDEX, and for the large data set it looks like this (this is in MBs written): size | 0 1 2 3 4 ----------------|-------------------------------------------- line 20592 | 43577 47580 49574 50388 50734 point 13080 | 23331 25721 26399 26745 26889 polygon 50598 | 113108 125095 129599 130170 131249 roads 1322 | 1322 1310 1305 1300 1295 The serial builds (0 workers) are buffered, but the buffering only applies for indexes that exceed effective_cache_size (4GB). Which means the "roads" buffer is too small to activate buffering, and there should be very little differences - which is the case (but the index also fits into shared buffers in this case). The other indexes do activate buffering, so the question is how many more buffers get written out compared to serial builds (with buffering). And the comparison looks like this: 1 2 3 4 ------------------------------------------ line 109% 114% 116% 116% point 110% 113% 115% 115% polygon 111% 115% 115% 116% roads 99% 99% 98% 98% So it writes about 15-20% more buffers during the index build, which is not that much IMHO. I was wondering if this might change with smaller shared buffers, so I tried building indexes on the smaller data set with 128MB shared buffers, but the difference remained to be ~15-20%. My conclusion from this is that it's OK to have parallel builds without buffering, and then maybe improve that later. The thing I'm not sure about is how this should interact with the "buffering" option. Right now we just ignore that entirely if we decide to do parallel build. But maybe it'd be better to disable parallel builds when the user specifies "buffering=on" (and only allow parallel builds with off/auto)? I did check how parallelism affects the amount of WAL produced, but that's pretty much exactly how I described that in the initial message, including the "strange" decrease with more workers due to reusing the fake LSN etc. regards [1] https://github.com/tvondra/postgres/tree/parallel-gist-20240625 -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi Tomas! > On 7 Jun 2024, at 20:41, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > After looking into parallel builds for BRIN and GIN indexes, I was > wondering if there's a way to do parallel builds for GiST too. I knew > next to nothing about how GiST works, but I gave it a shot and here's > what I have - the attached patch allows parallel GiST builds for the > "unsorted" case (i.e. when the opclass does not include sortsupport), > and does not support buffered builds. I think this totally makes sense. I've took a look into tuples partitioning (for sorted build) in your Github and I see thatit's complicated feature. So, probably, we can do it later. I'm trying to review the patch as it is now. Currently I have some questions about code. 1. Do I get it right that is_parallel argument for gistGetFakeLSN() is only needed for assertion? And this assertion canbe ensured just by inspecting code. Is it really necessary? 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be not used anywhere. AFAIK it's only needed to bufferedbuild. 3. I think we need a test that reliably triggers parallel and serial builds. As far as I know, there's a well known trick to build better GiST over PostGIS data: randomize input. I think parallel scanis just what is needed, it will shuffle tuples enough... Thanks for working on this! Best regards, Andrey Borodin.
On 7/21/24 21:31, Andrey M. Borodin wrote: > Hi Tomas! > >> On 7 Jun 2024, at 20:41, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >> >> After looking into parallel builds for BRIN and GIN indexes, I was >> wondering if there's a way to do parallel builds for GiST too. I >> knew next to nothing about how GiST works, but I gave it a shot and >> here's what I have - the attached patch allows parallel GiST builds >> for the "unsorted" case (i.e. when the opclass does not include >> sortsupport), and does not support buffered builds. > > I think this totally makes sense. I've took a look into tuples > partitioning (for sorted build) in your Github and I see that it's > complicated feature. So, probably, we can do it later. I'm trying to > review the patch as it is now. Currently I have some questions about > code. OK. I'm not even sure partitioning is the right approach for sorted builds. Or how to do it, exactly. > > 1. Do I get it right that is_parallel argument for gistGetFakeLSN() > is only needed for assertion? And this assertion can be ensured just > by inspecting code. Is it really necessary? Yes, in the patch it's only used for an assert. But it's actually incorrect - just as I speculated in my initial message (in the section about gistGetFakeLSN), it sometimes fails to detect a page split. I noticed that while testing the patch adding GiST to amcheck, and I reported that in that thread [1]. But I apparently forgot to post an updated version of this patch :-( I'll post a new version tomorrow, but in short it needs to update the fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit annoying this means we generate a new fake LSN on every call, and I'm not sure that's actually necessary. But I've been unable to come up with a better condition when to generate a new LSN. [1] https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com > 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be > not used anywhere. AFAIK it's only needed to buffered build. 3. I > think we need a test that reliably triggers parallel and serial > builds. > Yeah, it's possible the variable is unused. Agreed on the testing. > As far as I know, there's a well known trick to build better GiST > over PostGIS data: randomize input. I think parallel scan is just > what is needed, it will shuffle tuples enough... > I'm not sure I understand this comment. What do you mean by "better GiST" or what does that mean for this patch? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> >> 1. Do I get it right that is_parallel argument for gistGetFakeLSN() >> is only needed for assertion? And this assertion can be ensured just >> by inspecting code. Is it really necessary? > > Yes, in the patch it's only used for an assert. But it's actually > incorrect - just as I speculated in my initial message (in the section > about gistGetFakeLSN), it sometimes fails to detect a page split. I > noticed that while testing the patch adding GiST to amcheck, and I > reported that in that thread [1]. But I apparently forgot to post an > updated version of this patch :-( Oops, I just though that it's a version with solved FakeLSN problem. > > I'll post a new version tomorrow, but in short it needs to update the > fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit > annoying this means we generate a new fake LSN on every call, and I'm > not sure that's actually necessary. But I've been unable to come up with > a better condition when to generate a new LSN. Why don't we just use an atomic counter withtin shared build state? > > [1] > https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here. Youdid a great review, and I want to address all issues there wholistically. Thank you! >> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be >> not used anywhere. AFAIK it's only needed to buffered build. 3. I >> think we need a test that reliably triggers parallel and serial >> builds. >> > > Yeah, it's possible the variable is unused. Agreed on the testing. > >> As far as I know, there's a well known trick to build better GiST >> over PostGIS data: randomize input. I think parallel scan is just >> what is needed, it will shuffle tuples enough... >> > > I'm not sure I understand this comment. What do you mean by "better > GiST" or what does that mean for this patch? I think parallel build indexes will have faster IndexScans. Best regards, Andrey Borodin.
On 7/22/24 11:00, Andrey M. Borodin wrote: > > >> On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >>> >>> 1. Do I get it right that is_parallel argument for gistGetFakeLSN() >>> is only needed for assertion? And this assertion can be ensured just >>> by inspecting code. Is it really necessary? >> >> Yes, in the patch it's only used for an assert. But it's actually >> incorrect - just as I speculated in my initial message (in the section >> about gistGetFakeLSN), it sometimes fails to detect a page split. I >> noticed that while testing the patch adding GiST to amcheck, and I >> reported that in that thread [1]. But I apparently forgot to post an >> updated version of this patch :-( > > Oops, I just though that it's a version with solved FakeLSN problem. > >> >> I'll post a new version tomorrow, but in short it needs to update the >> fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit >> annoying this means we generate a new fake LSN on every call, and I'm >> not sure that's actually necessary. But I've been unable to come up with >> a better condition when to generate a new LSN. > > Why don't we just use an atomic counter withtin shared build state? > I don't understand how would that solve the problem, can you elaborate? Which of the values are you suggesting should be replaced with the shared counter? lastlsn? >> >> [1] >> https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com > Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here. Youdid a great review, and I want to address all issues there wholistically. Thank you! > >>> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be >>> not used anywhere. AFAIK it's only needed to buffered build. 3. I >>> think we need a test that reliably triggers parallel and serial >>> builds. >>> >> >> Yeah, it's possible the variable is unused. Agreed on the testing. >> >>> As far as I know, there's a well known trick to build better GiST >>> over PostGIS data: randomize input. I think parallel scan is just >>> what is needed, it will shuffle tuples enough... >>> >> >> I'm not sure I understand this comment. What do you mean by "better >> GiST" or what does that mean for this patch? > > I think parallel build indexes will have faster IndexScans. > That's interesting. I haven't thought about measuring stuff like that (and it hasn't occurred to me it might have this benefit, or why would that be the case). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 22 Jul 2024, at 12:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I don't understand how would that solve the problem, can you elaborate? > Which of the values are you suggesting should be replaced with the > shared counter? lastlsn? I think during build we should consider index unlogged and always use GetFakeLSNForUnloggedRel() or something similar. Anywaywe will calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end. Best regards, Andrey Borodin.
On 7/22/24 13:08, Andrey M. Borodin wrote: > > >> On 22 Jul 2024, at 12:26, Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >> >> I don't understand how would that solve the problem, can you >> elaborate? Which of the values are you suggesting should be >> replaced with the shared counter? lastlsn? > > I think during build we should consider index unlogged and always use > GetFakeLSNForUnloggedRel() or something similar. Anyway we will > calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end. > But that doesn't update the page LSN, which GiST uses to detect concurrent splits, no? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 22 Jul 2024, at 14:53, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 7/22/24 13:08, Andrey M. Borodin wrote: >> >> >>> On 22 Jul 2024, at 12:26, Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>> >>> I don't understand how would that solve the problem, can you >>> elaborate? Which of the values are you suggesting should be >>> replaced with the shared counter? lastlsn? >> >> I think during build we should consider index unlogged and always use >> GetFakeLSNForUnloggedRel() or something similar. Anyway we will >> calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end. >> > > But that doesn't update the page LSN, which GiST uses to detect > concurrent splits, no? During inserting tuples we need NSN on page. For NSN we can use just a counter, generated by gistGetFakeLSN() which in turnwill call GetFakeLSNForUnloggedRel(). Or any other shared counter. After inserting tuples we call log_newpage_range() to actually WAL-log pages. All NSNs used during build must be less than LSNs used to insert new tuples after index is built. Best regards, Andrey Borodin.
On 7/22/24 2:08 PM, Andrey M. Borodin wrote: > During inserting tuples we need NSN on page. For NSN we can use just a counter, generated by gistGetFakeLSN() which inturn will call GetFakeLSNForUnloggedRel(). Or any other shared counter. > After inserting tuples we call log_newpage_range() to actually WAL-log pages. > All NSNs used during build must be less than LSNs used to insert new tuples after index is built. I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current real LSN when the index build completes and while that normally should be the case we will have a almost never exercised code path for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize. But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel() but could use an own counter in shared memory in the GISTShared struct for this specific index which starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for good and bad) of the fake LSN being larger than the real LSN. Andreas
> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote: > > I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current real LSNwhen the index build completes and while that normally should be the case we will have a almost never exercised code pathfor when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize. > > But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for goodand bad) of the fake LSN being larger than the real LSN. +1 for atomic counter in GISTShared. BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range(). Best regards, Andrey Borodin.
On 7/26/24 14:13, Andrey M. Borodin wrote: > > >> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote: >> >> I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current realLSN when the index build completes and while that normally should be the case we will have a almost never exercised codepath for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize. >> >> But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for goodand bad) of the fake LSN being larger than the real LSN. > > +1 for atomic counter in GISTShared. I tried implementing this, see the attached 0002 patch that replaces the fake LSN with an atomic counter in shared memory. It seems to work (more testing needed), but I can't say I'm very happy with the code :-( The way it passes the shared counter to places that actually need it is pretty ugly. The thing is - the counter needs to be in shared memory, but places like gistplacetopage() have no idea/need of that. I chose to simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is there's a better way to do this? I thought maybe we could simply increment the counter before each call and pass the LSN value - 64bits should be enough, not sure about the overhead. But gistplacetopage() also uses the LSN twice, and I'm not sure it'd be legal to use the same value twice. Any better ideas? > BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range(). > Why would the reset be necessary? Doesn't log_newpage_range() set page LSN to current insert LSN? So why would reset that? I'm not sure about the discussion about NSN and the need to handle the case when NSN / fake LSN values get ahead of LSN. Is that really a problem? If the values generated from the counter are private to the index build, and log_newpage_range() replaces them with current LSN, do we still need to worry about it? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On 30 Jul 2024, at 14:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 7/26/24 14:13, Andrey M. Borodin wrote: >> >> >>> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote: >>> >>> I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current realLSN when the index build completes and while that normally should be the case we will have a almost never exercised codepath for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize. >>> >>> But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for goodand bad) of the fake LSN being larger than the real LSN. >> >> +1 for atomic counter in GISTShared. > > I tried implementing this, see the attached 0002 patch that replaces the > fake LSN with an atomic counter in shared memory. It seems to work (more > testing needed), but I can't say I'm very happy with the code :-( I agree. Passing this pointer everywhere seems ugly. > > The way it passes the shared counter to places that actually need it is > pretty ugly. The thing is - the counter needs to be in shared memory, > but places like gistplacetopage() have no idea/need of that. I chose to > simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is > there's a better way to do this? > > I thought maybe we could simply increment the counter before each call > and pass the LSN value - 64bits should be enough, not sure about the > overhead. But gistplacetopage() also uses the LSN twice, and I'm not > sure it'd be legal to use the same value twice. > > Any better ideas? How about global pointer to fake LSN? Just set it to correct pointer when in parallel build, or NULL either way. >> BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range(). >> > > Why would the reset be necessary? Doesn't log_newpage_range() set page > LSN to current insert LSN? So why would reset that? > > I'm not sure about the discussion about NSN and the need to handle the > case when NSN / fake LSN values get ahead of LSN. Is that really a > problem? If the values generated from the counter are private to the > index build, and log_newpage_range() replaces them with current LSN, do > we still need to worry about it? Stamping pages with new real LSN will do the trick. I didn’t know that log_newpage_range() is already doing so. How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary? Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN. Best regards, Andrey Borodin.
On 7/30/24 11:39, Andrey M. Borodin wrote: > > >> On 30 Jul 2024, at 14:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> >> >> On 7/26/24 14:13, Andrey M. Borodin wrote: >>> >>> >>>> On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote: >>>> >>>> I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current realLSN when the index build completes and while that normally should be the case we will have a almost never exercised codepath for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize. >>>> >>>> But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel()but could use an own counter in shared memory in the GISTShared struct for this specific indexwhich starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for goodand bad) of the fake LSN being larger than the real LSN. >>> >>> +1 for atomic counter in GISTShared. >> >> I tried implementing this, see the attached 0002 patch that replaces the >> fake LSN with an atomic counter in shared memory. It seems to work (more >> testing needed), but I can't say I'm very happy with the code :-( > > I agree. Passing this pointer everywhere seems ugly. > Yeah. >> >> The way it passes the shared counter to places that actually need it is >> pretty ugly. The thing is - the counter needs to be in shared memory, >> but places like gistplacetopage() have no idea/need of that. I chose to >> simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is >> there's a better way to do this? >> >> I thought maybe we could simply increment the counter before each call >> and pass the LSN value - 64bits should be enough, not sure about the >> overhead. But gistplacetopage() also uses the LSN twice, and I'm not >> sure it'd be legal to use the same value twice. >> >> Any better ideas? > > How about global pointer to fake LSN? > Just set it to correct pointer when in parallel build, or NULL either way. > I'm not sure adding a global variable is pretty either. What if there's some error, for example? Will it get reset to NULL? >>> BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range(). >>> >> >> Why would the reset be necessary? Doesn't log_newpage_range() set page >> LSN to current insert LSN? So why would reset that? >> >> I'm not sure about the discussion about NSN and the need to handle the >> case when NSN / fake LSN values get ahead of LSN. Is that really a >> problem? If the values generated from the counter are private to the >> index build, and log_newpage_range() replaces them with current LSN, do >> we still need to worry about it? > > Stamping pages with new real LSN will do the trick. I didn’t know that log_newpage_range() is already doing so. > I believe it does, or at least that's what I believe this code at the end is meant to do: recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI); for (i = 0; i < nbufs; i++) { PageSetLSN(BufferGetPage(bufpack[i]), recptr); UnlockReleaseBuffer(bufpack[i]); } Unless I misunderstood what this does. > How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary? > Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN. > I'm confused. How is this related to unloggedLSN at all? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 30 Jul 2024, at 14:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > >> >> How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary? >> Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN. >> > > I'm confused. How is this related to unloggedLSN at all? Parallel build should work for both logged and unlogged indexes. If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build. Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory. In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”. Best regards, Andrey Borodin.
On 7/30/24 13:31, Andrey M. Borodin wrote: > > >> On 30 Jul 2024, at 14:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >>> >>> How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary? >>> Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN. >>> >> >> I'm confused. How is this related to unloggedLSN at all? > > Parallel build should work for both logged and unlogged indexes. Agreed, no argument here. > If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build. > Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory. > Ah, right. For unlogged relations we won't invoke log_newpage_range(), so we'd end up with the bogus page LSNs ... > In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”. > Interesting idea. IIRC you suggested this earlier, but I didn't realize it has the benefit of already using an atomic counter - which solves the "ugliness" of my patch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/30/24 1:31 PM, Andrey M. Borodin wrote:>> On 30 Jul 2024, at 14:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >>> >>> How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary? >>> Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN. >>> >> >> I'm confused. How is this related to unloggedLSN at all? > > Parallel build should work for both logged and unlogged indexes. > If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build. > Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory. > > In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”. Yeah, Great point, given the ugliness of passing around the fakelsn we might as well just use GetFakeLSNForUnloggedRel(). Andreas
Hi, Here's an updated patch using GetFakeLSNForUnloggedRel() instead of the atomic counter. I think this looks much nicer and less invasive, as it simply uses XLogCtl shared memory (instead of having to pass a new pointer everywhere). We still need to pass the is_parallel flag, though. I wonder if we could get rid of that too, and just use GetFakeLSNForUnloggedRel() for both parallel and non-parallel builds? Why wouldn't that work? I've spent quite a bit of time testing this, but mostly for correctness. I haven't redone the benchmarks, that's on my TODO. regards -- Tomas Vondra
Attachment
> On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote: > > Here's an updated patch adding the queryid. I've took another round of looking through the patch. Everything I see seems correct to me. It just occurred to me that we will have: buffered build, parallel build, sorting build.All 3 aiming to speed things up. I really hope that we will find a way to parallel sorting build, because it will befastest for sure. Currently we have some instances of such code...or similar... or related code. if (is_build) { if (is_parallel) recptr = GetFakeLSNForUnloggedRel(); else recptr = GistBuildLSN; } else { if (RelationNeedsWAL(rel)) { recptr = actuallyWriteWAL(); } else recptr = gistGetFakeLSN(rel); } // Use recptr In previous review I was proponent of not adding arguments to gistGetFakeLSN(). But now I see that now logic of choosingLSN source is sprawling across various places... Now I do not have a strong point of view on this. Do you think if something like following would be clearer? if (!is_build && RelationNeedsWAL(rel)) { recptr = actuallyWriteWAL(); } else recptr = gistGetFakeLSN(rel, is_build, is_parallel); Just as an idea. I'm mostly looking on GiST-specific parts of the patch, while things around entering parallel mode seems much more complicated.But as far as I can see, large portions of this code are taken from B-tree\BRIN. Best regards, Andrey Borodin.
On 10/31/24 19:05, Andrey M. Borodin wrote: > > >> On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote: >> >> Here's an updated patch adding the queryid. > > I've took another round of looking through the patch. > Everything I see seems correct to me. It just occurred to me that we > will have: buffered build, parallel build, sorting build. All 3 aiming > to speed things up. I really hope that we will find a way to parallel > sorting build, because it will be fastest for sure. > The number of different ways to build a GiST index worries me. When we had just buffered vs. sorted builds, that was pretty easy decision, because buffered builds are much faster 99% of the time. But for parallel builds it's not that clear - it can easily happen that we use much more CPU, without speeding anything up. We just start as many parallel workers as possible, given the maintenance_work_mem value, and hope for the best. Maybe with parallel buffered builds it's be clearer ... but I'm not sufficiently familiar with that code, and I don't have time to study that at the moment because of other patches. Someone else would have to take a stab at that ... > > Currently we have some instances of such code...or similar... or > related code. > > if (is_build) > { > if (is_parallel) > recptr = GetFakeLSNForUnloggedRel(); > else > recptr = GistBuildLSN; > } > else > { > if (RelationNeedsWAL(rel)) > { > recptr = actuallyWriteWAL(); > } > else > recptr = gistGetFakeLSN(rel); > } > // Use recptr > > In previous review I was proponent of not adding arguments to > gistGetFakeLSN(). But now I see that now logic of choosing LSN > source is sprawling across various places... > > Now I do not have a strong point of view on this. Do you think if > something like following would be clearer? > if (!is_build && RelationNeedsWAL(rel)) > { > recptr = actuallyWriteWAL(); > } > else > recptr = gistGetFakeLSN(rel, is_build, is_parallel); > > Just as an idea. > > I'm mostly looking on GiST-specific parts of the patch, while things > around entering parallel mode seems much more complicated. But as far as > I can see, large portions of this code are taken from B-tree\BRIN. > I agree the repeated code is pretty tedious, and it's also easy to miss one of the places when changing the logic, so I think wrapping that in some function makes sense. regards -- Tomas Vondra