Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | efac3238-6f34-41ea-a393-26cc0441b506@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: index prefetching
Re: index prefetching |
List | pgsql-hackers |
Hi, here's an improved (rebased + updated) version of the patch series, with some significant fixes and changes. The patch adds infrastructure and modifies btree indexes to do prefetching - and AFAIK it passes all tests (no results, correct results). There's still a fair amount of work to be done, of course - the btree changes are not very polished, more time needs to be spent on profiling and optimization, etc. And I'm sure that while the patch passes tests, there certainly are bugs. Compared to the last patch version [1] shared on list (in November), there's a number of significant design changes - a lot of this is based on a number of off-list discussions I had with Peter Geoghegan, which was very helpful. Let me try to sum the main conclusions and changes: 1) patch now relies on read_stream The November patch still relied on sync I/O and PrefetchBuffer(). At some point I added a commit switching it to read_stream - which turned out non-trivial, especially for index-only scans. But it works, and for a while I kept it separate - with PrefetchBuffer first, and a switch to read_stream later. But then I realized it does not make much sense to keep the first part - why would we introduce a custom fadvise-based prefetch, only to immediately rip it out and replace it with with read_stream code with a comparable amount of complexity, right? So I squashed these two parts, and the patch now does read_stream (for the table reads) from the beginning. 2) two new index AM callbacks - amgetbatch + amfreebatch The [1] patch introduced a new callback for reading a "batch" (essentially a leaf page) from the index. But there was a limitation of only allowing a single batch at a time, which was causing trouble with prefetch distance and read_stream stalls at the end of the batch, etc. Based on the discussions with Peter I decided to make this a bit more ambitious, moving the whole batch management from the index AM to the indexam.c level. So now there are two callbacks - amgetbatch and amfreebatch, and it's up to indexam.c to manage the batches - decide how many batches to allow, etc. The index AM is responsible merely for loading the next batch, but does not decide when to load or free a batch, how many to keep in memory, etc. There's a section in indexam.c with a more detailed description of the design, I'm not going to explain all the design details here. In a way, this design is a compromise between the initial AM-level approach I presented as a PoC at pgconf.dev 2023, and the executor level approach I shared a couple months back. Each of those "extreme" cases had it's issues with either happening "too deep" or "too high" - being too integrated in the AM, or not having enough info about the AM. I think the indexam.c is a sensible layer for this. I was hoping doing this at the "executor level" would mean no need for AM code changes, but that turned out not possible - the AM clearly needs to know about the batch boundaries, so that it can e.g. do killtuples, etc. That's why we need the two callbacks (not just the "amgetbatch" one). At least this way it's "hidden" by the indexam.c API, like index_getnext_slot(). (You could argue indexam.c is "executor" and maybe it is - I don't know where exactly to draw the line. I don't think it matters, really. The "hidden in indexam API" is the important bit.) 3) btree prefetch The patch implements the new callbacks only for btree indexes, and it's not very pretty / clean - it's mostly a massaged version of the old code backing amgettuple(). This needs cleanup/improvements, and maybe refactoring to allow reusing more of the code, etc.. Or maybe we should even rip out the amgettuple() entirely, and only support one of those for each AM? That's what Peter suggested, but I'm not convinced we should do that. For now it was very useful to be able to flip between the APIs by setting a GUC, and I left prefetching disabled in some places (e.g. when accessing catalogs, ...) that are unlikely to benefit. But more importantly, I'm not 100% we want to require the index AMs to support prefetching for all cases - if we do, a single "can't prefetch" case would mean we can't prefetch anything for that AM. In particular, I'm thinking about GiST / SP-GiST and indexes ordered by distance, which don't return items in leaf pages but sort them through a binary heap. Maybe we can do prefetch for that, but if we can't it would be silly if it meant we can't do prefetch for any other SP-GiST queries. Anyway, the current patch only implements prefetch for btree. I expect it won't be difficult to do this for other index AMs, considering how similar the design usually is to btree. This is one of the next things on my TODO. I want to be able to validate the design works for multiple AMs, not just btree. 4) duplicate blocks While working on the patch, I realized the old index_fetch_heap code skips reads for duplicate blocks - index the TID matches the immediately preceding block, ReleaseAndReadBuffer() skips most of the work. But read_stream() doesn't do that - if the callback returns the same block, it starts a new read for it, pins it, etc. That can be quite expensive, and I've seen a couple cases where the impact was not negligible (correlated index, fits in memory, ...). I've speculated that maybe read_stream_next_buffer() should detect and handle these cases better - not unlike it detects sequential reads. It might even keep a small cache of already requested reads, etc. so that it can handle a wider range of workloads, not just perfect duplicates. But it does not do that, and I'm not sure if/when that will happen. So for now I simply reproduced the "skip duplicate blocks" behavior. It's not as simple with read_stream, because this logic needs to happen in two places - in the callback (when generating reads), and then also when reading the blocks from the stream - if these places get "out of sync" the stream won't return the blocks expected by the reader. But it does work, and it's not that complex. But there's an issue with prefetch distance ... 5) prefetch distance Traditionally, we measure distance in "tuples" - e.g. in bitmap heap scan, we make sure we prefetched pages for X tuples ahead. But that's not what read_stream does for prefetching - it works with pages. That can cause various issues. Consider for example the "skip duplicate blocks" optimization described in (4). And imagine a perfectly correlated index, with ~200 items per leaf page. The heap tuples are likely wider, let's say we have 50 of them per page. That means that for each leaf page, we have only ~4 blocks per leaf page. With effective_io_concurrency=16 the read_stream will try to prefetch 16 heap pages, that's 3200 index entries. Is that what we want? I'm not quite sure, maybe it's OK? It sure is not quite what I expected. But now imagine an index-only scan on nearly all-visible table. If the fraction of index entries that don't pass the visibility check is very low, we can quickly get into a situation when the read_stream has to read a lot of leaf pages to get the next block number. Sure, we'd need to read that block number eventually, but doing it this early means we may need to keep the batch (leaf page) - a lot of them, actually. Essentially, pick a number and I can construct an IOS that needs to keep more batches. I think this is a consequence of read_stream having an internal idea how far ahead to prefetch, based on the number of requests it got so far, measured in heap blocks. It has not idea about the context (how that maps to index entries, batches we need to keep in memory, ...). Ideally, we'd be able to give this feedback to read_stream in some way, say by "pausing" it when we get too far ahead in the index. But we don't have that - the only thing we can do is to return IndalidBlockNumber to the stream, so that it stops. And then we need to "reset" the stream, and let it continue - but only after we consumed all scheduled reads. In principle it's very similar to the "pause/resume" I mentioned, except that it requires completely draining the queue - a pipeline stall. That's not great, but hopefully it's not very common, and more importantly - it only happens when only a tiny fraction of the index items requires a heap block. So that's what the patch does. I think it's acceptable, but some optimizations may be necessary (see next section). 6) performance and optimization It's not difficult to construct cases where the prefetching is a huge improvement - 5-10x speedup for a query is common, depending on the hardware, dataset, etc. But there are also cases where it doesn't (and can't) help very much. For example fully-cached data, or index-only scans of all-visible tables. I've done basic benchmarking based on that (I'll share some results in the coming days), and in various cases I see a consistent regression in the 10-20% range. The queries are very short (~1ms) and there's a fair amount of noise, but it seems fairly consistent. I haven't figured out the root cause(s) yet, but I believe there's a couple contributing factors: (a) read_stream adds a bit of complexity/overhead, but these cases worked great with just the sync API, and can't benefit from that. (b) There's inefficiencies in how I integrated read_stream into the btree AM. For example every batch allocates the same buffer btbeginscan, which turned out to be an issue before [2] - and now we do that for every batch, not just once per scan - that's not great. (c) Possibly the prefetch distance issue from (4) might matter too. regards [1] https://www.postgresql.org/message-id/accd03eb-0379-416d-9936-41a4de3c47ef%40vondra.me [2] https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com regards -- Tomas Vondra
Attachment
pgsql-hackers by date: