Home > mailing lists

Re: index prefetching - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: index prefetching
Date	August 28 20:08:40
Msg-id	6d59c277-c440-4d1f-a46e-157958c06a5f@vondra.me Whole thread Raw
In response to	Re: index prefetching (Andres Freund <andres@anarazel.de>)
Responses	Re: index prefetching
List	pgsql-hackers

Tree view

On 8/28/25 18:16, Andres Freund wrote:
> Hi,
> 
> On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote:
>> On 8/26/25 17:06, Tomas Vondra wrote:
>> I kept thinking about this, and in the end I decided to try to measure
>> this IPC overhead. The backend/ioworker communicate by sending signals,
>> so I wrote a simple C program that does "signal echo" with two processes
>> (one fork). It works like this:
>>
>> 1) fork a child process
>> 2) send a signal to the child
>> 3) child notices the signal, sends a response signal back
>> 4) after receiving response, go back to (2)
> 
> Nice!
> 
> I think this might under-estimate the IPC cost a bit, because typically the
> parent and child process do not want to run at the same time, probably leading
> to them often being scheduled on the same core. Whereas a shollow IO queue
> will lead to some concurrent activity, just not enough to hide the IPC
> latency...   But I don't think this matters in the grand scheme of things.
> 

Right. I thought about measuring this stuff (different cores, different
NUMA nodes, maybe adding some sleeps to simulate "idle"), but I chose to
keep it simple for now.

> 
>> So I think the IPC overhead with "worker" can be quite significant,
>> especially for cases with distance=1. I don't think it's a major issue
>> for PG18, because seq/bitmap scans are unlikely to collapse the distance
>> like this. And with larger distances the cost amortizes. It's much
>> bigger issue for the index prefetching, it seems.
> 
> I couldn't keep up with all the discussion, but is there actually valid I/O
> bound cases (i.e. not ones were we erroneously keep the distance short) where
> index scans end can't have a higher distance?
> 

I don't know, really.

Is the presented example really a case of an "erroneously short
distance"? From the 2x regression (compared to master) it might seem
like that, but even with the increased distance it's still slower than
master (by 25%). So maybe the "error" is to use AIO in these cases,
instead of just switching to I/O done by the backend.

It may be a bit worse for non-btree indexes, e.g. for for ordered scans
on gist indexes (getting the next tuple may require reading many leaf
pages, so maybe we can't look too far ahead?). Or for indexes with
naturally "fat" tuples, which limits how many tuples we see ahead.

> Obviously you can construct cases with a low distance by having indexes point
> to a lot of tiny tuples pointing to perfectly correlated pages, but in that
> case IO can't be a significant factor.
> 

It's definitely true the examples the script finds are "adversary", but
also not entirely unrealistic. I suppose there will be such cases for
any heuristics we come up with.

There's probably more cases like this, where we end up with many hits.
Say, a merge join may visit index tuples repeatedly, and so on. But then
it's likely in shared buffers, so there won't be any IPC.

regards

-- 
Tomas Vondra

pgsql-hackers by date:

From: Nathan Bossart
Date: 28 August, 19:33:50
Subject: Re: VM corruption on standby

From: Tom Lane
Date: 28 August, 20:17:24
Subject: Re: misleading error message in ProcessUtilitySlow T_CreateStatsStmt

Re: index prefetching - Mailing list pgsql-hackers

Previous

Next