Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id 6d59c277-c440-4d1f-a46e-157958c06a5f@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Andres Freund <andres@anarazel.de>)
Responses Re: index prefetching
List pgsql-hackers
On 8/28/25 18:16, Andres Freund wrote:
> Hi,
> 
> On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote:
>> On 8/26/25 17:06, Tomas Vondra wrote:
>> I kept thinking about this, and in the end I decided to try to measure
>> this IPC overhead. The backend/ioworker communicate by sending signals,
>> so I wrote a simple C program that does "signal echo" with two processes
>> (one fork). It works like this:
>>
>> 1) fork a child process
>> 2) send a signal to the child
>> 3) child notices the signal, sends a response signal back
>> 4) after receiving response, go back to (2)
> 
> Nice!
> 
> I think this might under-estimate the IPC cost a bit, because typically the
> parent and child process do not want to run at the same time, probably leading
> to them often being scheduled on the same core. Whereas a shollow IO queue
> will lead to some concurrent activity, just not enough to hide the IPC
> latency...   But I don't think this matters in the grand scheme of things.
> 

Right. I thought about measuring this stuff (different cores, different
NUMA nodes, maybe adding some sleeps to simulate "idle"), but I chose to
keep it simple for now.

> 
>> So I think the IPC overhead with "worker" can be quite significant,
>> especially for cases with distance=1. I don't think it's a major issue
>> for PG18, because seq/bitmap scans are unlikely to collapse the distance
>> like this. And with larger distances the cost amortizes. It's much
>> bigger issue for the index prefetching, it seems.
> 
> I couldn't keep up with all the discussion, but is there actually valid I/O
> bound cases (i.e. not ones were we erroneously keep the distance short) where
> index scans end can't have a higher distance?
> 

I don't know, really.

Is the presented example really a case of an "erroneously short
distance"? From the 2x regression (compared to master) it might seem
like that, but even with the increased distance it's still slower than
master (by 25%). So maybe the "error" is to use AIO in these cases,
instead of just switching to I/O done by the backend.

It may be a bit worse for non-btree indexes, e.g. for for ordered scans
on gist indexes (getting the next tuple may require reading many leaf
pages, so maybe we can't look too far ahead?). Or for indexes with
naturally "fat" tuples, which limits how many tuples we see ahead.

> Obviously you can construct cases with a low distance by having indexes point
> to a lot of tiny tuples pointing to perfectly correlated pages, but in that
> case IO can't be a significant factor.
> 

It's definitely true the examples the script finds are "adversary", but
also not entirely unrealistic. I suppose there will be such cases for
any heuristics we come up with.

There's probably more cases like this, where we end up with many hits.
Say, a merge join may visit index tuples repeatedly, and so on. But then
it's likely in shared buffers, so there won't be any IPC.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: VM corruption on standby
Next
From: Tom Lane
Date:
Subject: Re: misleading error message in ProcessUtilitySlow T_CreateStatsStmt