Re: index prefetching - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: index prefetching |
| Date | |
| Msg-id | 95b6c6ef-4ba1-44f6-b1ac-eef0bbb5327a@vondra.me Whole thread Raw |
| In response to | Re: index prefetching (Konstantin Knizhnik <knizhnik@garret.ru>) |
| List | pgsql-hackers |
On 12/28/25 21:30, Konstantin Knizhnik wrote:
>
> On 28/12/2025 8:08 PM, Tomas Vondra wrote:
>> On 12/25/25 16:39, Konstantin Knizhnik wrote:
>>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote:
>>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
>>>>> Attached is v4.
>>>> Attached is v5. Changes from v4:
>>>>
>>>> * Simplified and optimized index-only scans, with a particular
>>>> emphasis on avoiding regressions with nested loop joins with an inner
>>>> index-only scan.
>>>>
>>>> There were quite a number of small problems/dead code related to
>>>> index-only scans fixed by this new v5. Overall, I'm quite a bit
>>>> happier with the state of index-only scans, which I'd not paid too
>>>> much attention to before now.
>>>>
>>>> * Added Valgrind instrumentation to the hash index patch, which was
>>>> required to fix some false positives.
>>>>
>>>> The generic indexam_util_batch_unlock routine had Valgrind
>>>> instrumentation in earlier versions, just to keep nbtree's buffer
>>>> locking checks from generating similar false positives. Some time
>>>> later, when I added the hashgetbatch patch, there were new Valgrind
>>>> false positives during hash index scans -- which I missed at first.
>>>> This new v5 revisions adds similar Valgrind checks to hash itself
>>>> (changes that add code that is more or less a direct port of the stuff
>>>> added to nbtree by commit 4a70f829), which fixes the false positives,
>>>> and is independently useful.
>>>>
>>>> The rule for amgetbatch-based index AMs is that they must have similar
>>>> buffer locking instrumentation. That seems like a good thing.
>>>>
>>>> --
>>>> Peter Geoghegan
>>> I the previous mail I shared results of my experiments with different
>>> prefetch distance.
>>> I think that we should start prefetching of heap tuples not from the
>>> second batch, but after some number of proceeded tids.
>>>
>>> Attached please find a patch which implements this approach.
>>> And below are updated results:
>>>
>>> limit\prefetch on off always inc threshold
>>> 1 12074 12765 3146 3282 12394
>>> 2 5912 6198 2463 2438 6124
>>> 4 2919 3047 1334 1964 2910
>>> 8 1554 1496 1166 1409 1588
>>> 16 815 775 947 940 600
>>> 32 424 403 687 695 478
>>> 64 223 208 446 453 358
>>> 128 115 106 258 270 232
>>> 256 68 53 138 149 131
>>> 512 43 27 72 78 71
>>> 1024 28 13 38 40 38
>>>
>>> Last column is result of prefetch with read_stream_threshold=10.
>>>
>> That's great, but it only works for cases that can (and do) benefit from
>> the prefetching. Try running the benchmark with a data set that fits
>> into shared buffers (or RAM), which makes prefetching useless.
>>
>> I tried that with your test, comparing master, v5 and v5 + your
>> read_stream_threshold patch. See the attached run.sh script, and the PDF
>> summarizing the results. The last two column groups are comparisons to
>> master, with green=improvement, red=regression. There are no actual
>> improvements (1% delta is just noise). But the read_stream_threshold
>> results have a clear pattern of pretty massive (20-30%) regressions.
>>
>> The difference between v5 and v5-threshold is pretty clear.
>>
>> IIRC cases like this are *exactly* why we ended up with the current
>> heuristics, enabling prefetching only from the second batch. This
>> removes the risk of expensive read_stream init for very fast queries
>> that don't benefit anything. Of course, prefetching may be useless for
>> later batches too (e.g. if all the data is cached), but the query will
>> be expensive enough for the read_stream init cost to be negligible.
>>
>> To put this differently, the more aggressive the heuristics is (enabling
>> prefetching in more case), the more likely it's to cause regressions.
>> We've chosen to be more defensive, i.e. to sacrifice some possible gains
>> in order to not regress plausible workloads. I hope we agree queries on
>> fully cached "hot" data are pretty common / important.
>>
>> We can probably do better in the future. But we'll never know for sure
>> if a given scan benefits from prefetching. It's not just about the
>> number of items in the batch, but also about how many heap pages that
>> translates to, what I/O pattern (random vs. sequential?), how many are
>> already cached. For some queries we don't even know how many items we'll
>> actually need. We can't check all that at the very beginning, because
>> it's simply prohibitively expensive.
>
>
> I tried to reproduce your results, but at Mac I do not see some
> noticeable difference for 250k records, fillfactor=10 and 4GB shared
> buffers
> between `enable_indexscan_prefetch=false` and
> `enable_indexscan_prefetch=true`.
> I can't believe that just adding this checks in `heap_batch_advance_pos`
> can cause 75% degrade of performance (because for limit < 10, no read
> stream is initialized, but still we somewhere loose 25%).
>
> I just commented this fragment of code in heapam_handler.c:
>
>
> #if 0
> proceed_items = ScanDirectionIsForward(direction)
> ? pos->item - batch->firstItem
> : batch->lastItem - pos->item;
> /* Delay initializing stream until proceeding */
> if (proceed_items >= read_stream_threshold
> && !scan->xs_heapfetch->rs
> && !scan->batchqueue->disabled
> && !scan->xs_want_itup /* XXX prefetching disabled for IoS,
> for now */
> && enable_indexscan_prefetch)
> {
> scan->xs_heapfetch->rs =
> read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
> scan->heapRelation, MAIN_FORKNUM,
> scan->heapRelation->rd_tableam->index_getnext_stream,
> scan, 0);
> }
> #endif
>
> and ... see no difference.
>
> I can understand why initializing read stream earlier (not at the second
> batch, but after 10 proceeded items) may have negative impact on
> performance when all data is present i shared buffers for LIMIT>=10.
> But how it can happen with LIMIT 1 and commented fragment above. There
> is nothing else in my patch except adding GUC.
> So I think that it is some "external" factor and wonder if you can
> reproduce this results (just first line).
>
It seems this is due to sending an extra SET (for the new GUC) in the
pgbench script, which is recognized only on the v5+threshold build.
That's a thinko on my side, I should have realized the extra command
might affect this. It doesn't really affect the behavior, because 10 is
the default value for read_stream_threshold. I've fixed the script, will
check fresh results tomorrow.
Still, I think most of what I said about heuristics when to initialize
the read stream, and the risk/benefit tradeoff, still applies.
regards
--
Tomas Vondra
pgsql-hackers by date: