Thread: PoC: prefetching index leaf pages (for inserts)
Hi, Some time ago I started a thread about prefetching heap pages during index scans [1]. That however only helps when reading rows, not when inserting them. Imagine inserting a row into a table with many indexes - we insert the row into the table, and then into the indexes one by one, synchronously. We walk the index, determine the appropriate leaf page, read it from disk into memory, insert the index tuple, and then do the same thing for the next index. If there are many indexes, and there's not much correlation with the table, this may easily results in I/O happening synchronously with queue depth 1. Hard to make it even slower ... This can be a problem even with a modest number of indexes - imagine bulk-loading data into a table using COPY (say, in 100-row batches). Inserting the rows into heap happens in a bulk, but the indexes are still modified in a loop, as if for single-row inserts. Not great. The with multiple connections the concurrent I/O may be generated that way, but for low-concurrency workloads (e.g. batch jobs) that may not really work. I had an idea what we might do about this - we can walk the index, almost as if we're inserting the index tuple, but only the "inner" non-leaf pages. And instead of descending to the leaf page, we just prefetch it. The non-leaf pages are typically <1% of the index, and hot, so likely already cached, so not worth prefetching those. The attached patch does a PoC of this. It adds a new AM function "amprefetch", with an implementation for btree indexes, mimicking the index lookup, except that it only prefetches the leaf page as explained a bit earlier. In the executor, this is wrapped in ExecInsertPrefetchIndexes() which gets called in various places right before ExecInsertPrefetchIndexes(). I thought about doing that in ExecInsertPrefetchIndexes() directly, but that would not work for COPY, where we want to issue the prefetches for the whole batch, not for individual tuples. This may need various improvements - the prefetch duplicates a couple steps that could be expensive (e.g. evaluation of index predicates, forming index tuples, and so on). Would be nice to improve this, but good enough for PoC I think. Another gap is lack of incremental prefetch (ramp-up). We just prefetch all the indexes, for all tuples. But I think that's OK. We know we'll need those pages, and the number is fairly limited. There's a GUC enable_insert_prefetch, that can be used to enable this insert prefetching. I did a simple test on two machines - one with SATA SSD RAID, one with NVMe SSD. In both cases the data (table+indexes) are an order of magnitude larger than RAM. The indexes are on UUID, so pretty random and there's no correlation. Then batches of 100, 1000 and 10000 rows are inserted, with/without the prefetching. With 5 indexes, the results look like this: SATA SSD RAID ------------- rows no prefetch prefetch 100 176.872 ms 70.910 ms 1000 1035.056 ms 590.495 ms 10000 8494.836 ms 3216.206 ms NVMe ---- rows no prefetch prefetch 100 133.365 ms 72.899 ms 1000 1572.379 ms 829.298 ms 10000 11889.143 ms 3621.981 ms Not bad, I guess. Cutting the time to ~30% is nice. The fewer the indexes, the smaller the difference (with 1 index there is almost no difference), of course. regards [1] https://commitfest.postgresql.org/45/4351/ -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, I had a chance to discuss this patch with Andres off-list a couple days ago, and he suggested he tried sorting the (index) tuples before inserting them, and that yielded significant benefits, possibly comparable to the prefetching. I've been somewhat skeptical the sorting might be very beneficial, but I decided to do some more tests comparing the benefits. The sorting only really works for bulk inserts (e.g. from COPY), when we have multiple index tuples for each index. But I didn't have time to rework the code like that, so I took a simpler approach - do the sort in the INSERT, so that the insert tuples are sorted too. So, something like WITH data AS (SELECT md5(random()::text)::uuid FROM generate_series(1,100) ORDER BY 1) INSERT INTO t SELECT * FROM data; Obviously, this can only sort the rows in one way - if there are multiple indexes, then only one of them will be sorted, limiting the possible benefit of the optimization. In the COPY code we could do a separate sort for each index, so the tuples would be sorted for all indexes (which also has a cost, of course). But as I said, I decided to do the simple SQL-level sort. There are multiple indexes on the same column, so it's a bit as if we sorted the tuples for each index independently. Anyway, the test inserts batches of 100, ..., 100k rows into tables of different sizes (10M - 1B rows), with/without prefetching, and possibly sorts the batches before the insert. See the attached index.sql script. The PDF shows the results, and also compares the different combinations. For example the first 5 lines are results without and with prefetching, followed by speedup, where green=faster and red=slower. For example 50% means prefetching makes it 2x as fast. Similarly, first column has timings without / with sorting, with speedup right under it. Diagonally, we have speedup for enabling both sorting and prefetch. I did that with different table sizes, where 10M easily fits into RAM while 1B certainly exceeds it. And I did that on the usual two machines with different amounts of RAM and storage (SATA SSD vs. NVME). The results are mostly similar on both, I think: * On 10M tables (fits into RAM including indexes), prefetching doesn't really help (assuming the data is not evicted from memory for other reasons), and actually hurts a bit (~20%). Sorting does help, depending on the number of indexes - can be 10-40% faster. * On 1B tables (exceeds RAM), prefetching is a clear winner. Sorting does not make any difference except for a couple rare "blips". * On 100M tables it's a mix/transition of those two cases. So maybe we should try doing both, perhaps with some heuristics to only do the prefetching on sufficiently large/random indexes, and sorting only on smaller ones? Another option would be to make the prefetching smarter so that we don't prefetch data that is already in memory (either in shared buffers or in page cache). That would reduce the overhead/slowdown visible in results on the 10M table. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Seems cfbot was not entirely happy about the patch, for two reasons: 1) enable_insert_prefetching definition was inconsistent (different boot/default values, missing in .conf and so on) 2) stupid bug in execReplication, inserting index entries twice The attached v3 should fix all of that, I believe. As for the path forward, I think the prefetching is demonstrably beneficial. There are cases where it can't help or even harms performance. I think the success depends on three areas: (a) reducing the costs of the prefetching - For example right now we build the index tuples twice (once for prefetch, once for the insert), but maybe there's a way to do that only once? There are also predicate indexes, and so on. (b) being smarter about when to prefetch - For example if we only have one "prefetchable" index, it's somewhat pointless to prefetch (for single-row cases). And so on. (c) not prefetching when already cached - This is somewhat related to the previous case, but perhaps it'd be cheaper to first check if the data is already cached. For shared buffers it should not be difficult, for page cache we could use preadv2 with RWF_NOWAIT flag. The question is if this is cheap enough to be cheaper than just doing posix_fadvise (which however only deals with shared buffers). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 06/11/2023 19:05, Tomas Vondra wrote: > As for the path forward, I think the prefetching is demonstrably > beneficial. There are cases where it can't help or even harms > performance. I think the success depends on three areas: > > (a) reducing the costs of the prefetching - For example right now we > build the index tuples twice (once for prefetch, once for the insert), > but maybe there's a way to do that only once? There are also predicate > indexes, and so on. > > (b) being smarter about when to prefetch - For example if we only have > one "prefetchable" index, it's somewhat pointless to prefetch (for > single-row cases). And so on. > > (c) not prefetching when already cached - This is somewhat related to > the previous case, but perhaps it'd be cheaper to first check if the > data is already cached. For shared buffers it should not be difficult, > for page cache we could use preadv2 with RWF_NOWAIT flag. The question > is if this is cheap enough to be cheaper than just doing posix_fadvise > (which however only deals with shared buffers). I don't like this approach. It duplicates the tree-descend code, and it also duplicates the work of descending the tree at runtime. And it only addresses index insertion; there are a lot of places that could benefit from prefetching or async execution like this. I think we should think of this as async execution rather than prefetching. We don't have the general infrastructure for writing async code, but if we did, this would be much simpler. In an async programming model, like you have in many other languages like Rust, python or javascript, there would be no separate prefetching function. Instead, aminsert() would return a future that can pause execution if it needs to do I/O. Something like this: aminsert_futures = NIL; /* create a future for each index insert */ for (<all indexes>) { aminsert_futures = lappend(aminsert_futures, aminsert(...)); } /* wait for all the futures to finish */ await aminsert_futures; The async-aware aminsert function would run to completion quickly if all the pages are already in cache. If you get a cache miss, it would start an async I/O read for the page, and yield to the other insertions until the I/O completes. We already support async execution of FDWs now, with the ForeignAsyncRequest() and ForeignAsyncConfigureWait() callbacks. Can we generalize that? -- Heikki Linnakangas Neon (https://neon.tech)
On 11/23/23 14:26, Heikki Linnakangas wrote: > On 06/11/2023 19:05, Tomas Vondra wrote: >> As for the path forward, I think the prefetching is demonstrably >> beneficial. There are cases where it can't help or even harms >> performance. I think the success depends on three areas: >> >> (a) reducing the costs of the prefetching - For example right now we >> build the index tuples twice (once for prefetch, once for the insert), >> but maybe there's a way to do that only once? There are also predicate >> indexes, and so on. >> >> (b) being smarter about when to prefetch - For example if we only have >> one "prefetchable" index, it's somewhat pointless to prefetch (for >> single-row cases). And so on. >> >> (c) not prefetching when already cached - This is somewhat related to >> the previous case, but perhaps it'd be cheaper to first check if the >> data is already cached. For shared buffers it should not be difficult, >> for page cache we could use preadv2 with RWF_NOWAIT flag. The question >> is if this is cheap enough to be cheaper than just doing posix_fadvise >> (which however only deals with shared buffers). > > I don't like this approach. It duplicates the tree-descend code, and it > also duplicates the work of descending the tree at runtime. And it only > addresses index insertion; there are a lot of places that could benefit > from prefetching or async execution like this. > Yeah, I think that's a fair assessment, although I think the amount of duplicate code is pretty small (and perhaps it could be refactored to a common function, which I chose not to do in the PoC patch). > I think we should think of this as async execution rather than > prefetching. We don't have the general infrastructure for writing async > code, but if we did, this would be much simpler. In an async programming > model, like you have in many other languages like Rust, python or > javascript, there would be no separate prefetching function. Instead, > aminsert() would return a future that can pause execution if it needs to > do I/O. Something like this: > > aminsert_futures = NIL; > /* create a future for each index insert */ > for (<all indexes>) > { > aminsert_futures = lappend(aminsert_futures, aminsert(...)); > } > /* wait for all the futures to finish */ > await aminsert_futures; > > The async-aware aminsert function would run to completion quickly if all > the pages are already in cache. If you get a cache miss, it would start > an async I/O read for the page, and yield to the other insertions until > the I/O completes. > > We already support async execution of FDWs now, with the > ForeignAsyncRequest() and ForeignAsyncConfigureWait() callbacks. Can we > generalize that? > Interesting idea. I kinda like it in principle, however I'm not very familiar with how our async execution works (and perhaps even with async implementations in general), so I can't quite say how difficult would it be to do something like that in an AM (instead of an executor). Where exactly would be the boundary between who "creates" and "executes" the requests, what would be the flow of execution? For the FDW it seems fairly straightforward, because the boundary is local/remote, and the async request is executed in a separate process. But here everything would happen locally, so how would that work? Imagine we're inserting a tuple into two indexes. There's a bunch of index pages we may need to read for each index - first some internal pages, then some leafs. Presumably we'd want to do each page read as asynchronous, and allow transfer of control to the other index. IIUC the async-aware aminsert() would execute an async request for the first page it needs to read, with a callback that essentially does the next step of index descent - reads the page, determines the next page to read, and then do another async request. Then it'd sleep, which would allow transfer of control to the aminsert() on the other index. And then we'd do a similar thing for the leaf pages. Or do I imagine things wrong? The thing I like about this async approach is that it would allow prefetching all index pages, while my PoC patch simply assumes all internal pages are in cache and prefetches only the first leaf page. That's much simpler in terms of control flow, but has clear limits. I however wonder if there are concurrency issues. Imagine there's a COPY, i.e. we're inserting a batch of tuples. Can you run the aminsert() for all the tuples concurrently? Won't that have issues with the different "async threads" modifying the index for the other threads? If those concurrent "insert threads" would be an issue, maybe we could make amprefetch() async-aware ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, 6 Nov 2023 at 22:36, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Seems cfbot was not entirely happy about the patch, for two reasons: > > 1) enable_insert_prefetching definition was inconsistent (different > boot/default values, missing in .conf and so on) > > 2) stupid bug in execReplication, inserting index entries twice > > The attached v3 should fix all of that, I believe. > > > As for the path forward, I think the prefetching is demonstrably > beneficial. There are cases where it can't help or even harms > performance. I think the success depends on three areas: > > (a) reducing the costs of the prefetching - For example right now we > build the index tuples twice (once for prefetch, once for the insert), > but maybe there's a way to do that only once? There are also predicate > indexes, and so on. > > (b) being smarter about when to prefetch - For example if we only have > one "prefetchable" index, it's somewhat pointless to prefetch (for > single-row cases). And so on. > > (c) not prefetching when already cached - This is somewhat related to > the previous case, but perhaps it'd be cheaper to first check if the > data is already cached. For shared buffers it should not be difficult, > for page cache we could use preadv2 with RWF_NOWAIT flag. The question > is if this is cheap enough to be cheaper than just doing posix_fadvise > (which however only deals with shared buffers). CFBot shows that the patch does not apply anymore as in [1]: === Applying patches on top of PostgreSQL commit ID 7014c9a4bba2d1b67d60687afb5b2091c1d07f73 === === applying patch ./0001-insert-prefetch-v3.patch patching file src/backend/access/brin/brin.c Hunk #1 FAILED at 117. 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/brin/brin.c.rej patching file src/backend/access/gin/ginutil.c Hunk #1 FAILED at 64. 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/gin/ginutil.c.rej patching file src/backend/access/gist/gist.c Hunk #1 FAILED at 86. 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/gist/gist.c.rej patching file src/backend/access/hash/hash.c Hunk #1 FAILED at 83. 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/hash/hash.c.rej Please post an updated version for the same. [1] - http://cfbot.cputube.org/patch_46_4622.log Regards, Vignesh