Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 673bc085-4c2d-428c-8541-06863f4327fb@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers

On 8/28/25 21:52, Andres Freund wrote:
> Hi,
> 
> On 2025-08-28 19:08:40 +0200, Tomas Vondra wrote:
>> On 8/28/25 18:16, Andres Freund wrote:
>>>> So I think the IPC overhead with "worker" can be quite significant,
>>>> especially for cases with distance=1. I don't think it's a major issue
>>>> for PG18, because seq/bitmap scans are unlikely to collapse the distance
>>>> like this. And with larger distances the cost amortizes. It's much
>>>> bigger issue for the index prefetching, it seems.
>>>
>>> I couldn't keep up with all the discussion, but is there actually valid I/O
>>> bound cases (i.e. not ones were we erroneously keep the distance short) where
>>> index scans end can't have a higher distance?
>>>
>>
>> I don't know, really.
>>
>> Is the presented exaple really a case of an "erroneously short
>> distance"?
> 
> I think the query isn't actually measuring something particularly useful in
> the general case. You're benchmarking something were the results are never
> looked at - which means the time between two index fetches is unrealistically
> short. That means any tiny latency increase matters a lot more than with
> realistic queries.
> 

Sure, is a "microbenchmark" focusing on index scans.

The point of not looking at the result is to isolate the index scan, and
it's definitely true that if the query did some processing (e.g. feeding
it into an aggregate or something), the relative difference would be
smaller. But the absolute difference would likely remain about the same.

I don't think the submitting the I/O and then not waiting long enough
before actually reading the block is a significant factor here. It does
affect even the "warm" runs (that do no actual I/O), and most of the
difference seems to match the IPC cost. And AFAICS that cost dost not
change if the delay increases, we still need to send two signals.

> And this is, IIUC, on a local SSD. I'd bet that on cloud latencies AIO would
> still be a huge win.
> 

True, but only for cold runs that actually do I/O. The results for the
warm runs show regressions too, although smaller ones. And that would
affect any kind of storage (with buffered I/O).

Also, I'm not sure "On slow storage it does not regress," is a very
strong argument ;-)

> 
>> From the 2x regression (compared to master) it might seem like that, but
>> even with the increased distance it's still slower than master (by 25%). So
>> maybe the "error" is to use AIO in these cases, instead of just switching to
>> I/O done by the backend.
> 
> If it's slower at a higher distance, we're missing something.
> 

There's one weird thing I just realized - I don't think I ever saw more
than a single I/O worker consuming CPU (in top), even with the higher
distance. I'm not 100% sure about it, need to check tomorrow.

IIRC the CPU utilization with "collapsed " distance ~2.0 was about

  backend: 60%
  ioworker: 40%

and with the patches increasing the distance it was more like

  backend: 100%
  ioworker: 50%

But I think it was still just one ioworker. I wonder if that's OK,
intentional, or if it might be an issue ...

> 
>> It may be a bit worse for non-btree indexes, e.g. for for ordered scans
>> on gist indexes (getting the next tuple may require reading many leaf
>> pages, so maybe we can't look too far ahead?). Or for indexes with
>> naturally "fat" tuples, which limits how many tuples we see ahead.
> 
> I am not worried at all about those cases. If you have to read a lot of index
> leaf pages to get a heap fetch, a distance of even just 2 will be fine,
> because the IPC overhead is a neglegible cost compared to the index
> processing. Similarly, if you have to do very deep index traversals due to
> wide index tuples, there's going to be more time between two table fetches.
> 

Most likely, yes.

> 
>>> Obviously you can construct cases with a low distance by having indexes point
>>> to a lot of tiny tuples pointing to perfectly correlated pages, but in that
>>> case IO can't be a significant factor.
>>>
>>
>> It's definitely true the examples the script finds are "adversary", but
>> also not entirely unrealistic.
> 
> I think doing index scans where the results are just thrown out are entirely
> unrealistic...
> 

True, it's a microbenchmark focused on a specific operation. But I don't
think it makes it unrealistic, even though the impact on real-world
queries will be smaller. But I know what you mean.

> 
>> I suppose there will be such cases for any heuristics we come up with.
> 
> Agreed.
> 
> 
>> There's probably more cases like this, where we end up with many hits.
>> Say, a merge join may visit index tuples repeatedly, and so on. But then
>> it's likely in shared buffers, so there won't be any IPC.
> 
> Yea, I'd not expect a meaningful impact of any of this in a workload like
> that.
> 


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: index prefetching
Next
From: Masahiko Sawada
Date:
Subject: Re: doc patch: missing tags in protocol.sgml