Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 6d59c277-c440-4d1f-a46e-157958c06a5f@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Andres Freund <andres@anarazel.de>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On 8/28/25 18:16, Andres Freund wrote: > Hi, > > On 2025-08-28 14:45:24 +0200, Tomas Vondra wrote: >> On 8/26/25 17:06, Tomas Vondra wrote: >> I kept thinking about this, and in the end I decided to try to measure >> this IPC overhead. The backend/ioworker communicate by sending signals, >> so I wrote a simple C program that does "signal echo" with two processes >> (one fork). It works like this: >> >> 1) fork a child process >> 2) send a signal to the child >> 3) child notices the signal, sends a response signal back >> 4) after receiving response, go back to (2) > > Nice! > > I think this might under-estimate the IPC cost a bit, because typically the > parent and child process do not want to run at the same time, probably leading > to them often being scheduled on the same core. Whereas a shollow IO queue > will lead to some concurrent activity, just not enough to hide the IPC > latency... But I don't think this matters in the grand scheme of things. > Right. I thought about measuring this stuff (different cores, different NUMA nodes, maybe adding some sleeps to simulate "idle"), but I chose to keep it simple for now. > >> So I think the IPC overhead with "worker" can be quite significant, >> especially for cases with distance=1. I don't think it's a major issue >> for PG18, because seq/bitmap scans are unlikely to collapse the distance >> like this. And with larger distances the cost amortizes. It's much >> bigger issue for the index prefetching, it seems. > > I couldn't keep up with all the discussion, but is there actually valid I/O > bound cases (i.e. not ones were we erroneously keep the distance short) where > index scans end can't have a higher distance? > I don't know, really. Is the presented example really a case of an "erroneously short distance"? From the 2x regression (compared to master) it might seem like that, but even with the increased distance it's still slower than master (by 25%). So maybe the "error" is to use AIO in these cases, instead of just switching to I/O done by the backend. It may be a bit worse for non-btree indexes, e.g. for for ordered scans on gist indexes (getting the next tuple may require reading many leaf pages, so maybe we can't look too far ahead?). Or for indexes with naturally "fat" tuples, which limits how many tuples we see ahead. > Obviously you can construct cases with a low distance by having indexes point > to a lot of tiny tuples pointing to perfectly correlated pages, but in that > case IO can't be a significant factor. > It's definitely true the examples the script finds are "adversary", but also not entirely unrealistic. I suppose there will be such cases for any heuristics we come up with. There's probably more cases like this, where we end up with many hits. Say, a merge join may visit index tuples repeatedly, and so on. But then it's likely in shared buffers, so there won't be any IPC. regards -- Tomas Vondra
pgsql-hackers by date: