Home > mailing lists

Re: index prefetching - Mailing list pgsql-hackers

From	Alexandre Felipe
Subject	Re: index prefetching
Date	February 22 19:23:11
Msg-id	CAE8JnxPqNGReQWc+-fyprdOXwqODymgoxcxvNrC0W+AJRXpRfQ@mail.gmail.com Whole thread
In response to	Re: index prefetching (Tomas Vondra <tomas@vondra.me>)
Responses	Re: index prefetching
List	pgsql-hackers

Tree view

Hi all,

I did some experiments on a few questions that flew over this thread. In the hope to be useful.

DISTANCE CONTROL

I tested different strategies to increase distance. 2*d, 2*d+1, d+2, d+4, and so on. In my head, what would make sense is d + io_combine_limi, but in the end the 2*d gives the best results across different patterns, e.g. (h{200}m{200}) that seems to be a more reasonable pattern, as previous scans would have loaded in blocks. But these are fundamentally the same, as I posted about this a markov model, and the limit will be something like max_distance * sigmoid(h * (p - p0)), what changes is the transient when we go in and out of a cached region.

LIMITING PREFETCH

To avoid prefetch waste with a limit node wouldn't it make sense to send from the executor an estimate of how many rows will be required. A strict lower bound would be limit (+ offset) - returned, but a selectivity factor would be important if most rows are removed, a good start would be (limit - filtered) * selectivity, a simple model for the selectivity would be (unfiltered + 1) / (filtered + 1), the 1 here accounts for the uncertainty about the next row being filtered or not, and avoids a division by zero for the first row, and naturally will cap the distance to the limit when the query starts.

I/O REORDERING

I did an experiment reordering the heap accesses, following a zig-zag pattern, using constant memory like this

fill heap-up with W blocks

direction = up

read-block = pop from heap-up

for each new-block

if new-block > read block

push new-block into heap-up

else

push new-block into heap-down

if heap-up is empty

direction = down

if heap-down is empty

direction = up

if direction = up

read-block = pop from heap-up

else

read-block = pop from heap-down

The heap-up and heap-down will contain at most W elements, so they can be stored in complementary regions of the same array.

With a small adjustmentment (moving one element to the heap behind when the new-block is ahead) we ensure that the delay added to a block doesn't exceed the buffer length, and we can restore the index order with another heap, so this would give an algorithm with O(W) memory O(W * log(W)) complexity, asymptotically constant on the table size. And will reduce the total seek distance by roughly 2/W.

Creating a ~400MB table

CREATE TABLE zigzag_test (
id serial, random_key int, data text,
pad1 char(2000), pad2 char(2000), pad3 char(2000)
);
SELECT setseed(0.42);
INSERT INTO zigzag_test (random_key, data, pad1, pad2, pad3)
SELECT (random() * 100000)::int, 'data_' || g, 'x', 'x', 'x'
FROM generate_series(1, 5000) g;

Then accessing 5% of it using index scan (or the zigzag that breaks the index order)

SELECT sum(id), sum(abs(blk - prev_blk)) as seek_dist FROM (
SELECT id, (ctid::text::point)[0] as blk, lag((ctid::text::point)[0]) over () as prev_blk
FROM zigzag_test WHERE random_key < 50000

Measuring time +/- std-dev, using the I(ndex) order access and the Z(igzag) order (not so exactly as the algorithm described above)

I order (ms) Z order (ms) Diff % Seek (I/Z)
Disk Prefetch
HDD on 250.4 +/- 227.8 168.7 +/- 83.0 -32.6% 78281/1242
off 185.8 +/- 269.9 141.9 +/- 77.4 -23.6% 78281/1242
SSD on 148.2 +/- 69.7 117.6 +/- 39.6 -20.7% 78281/1242
off 66.8 +/- 34.1 121.5 +/- 39.7 +81.8% 78281/1242

The tricky datapoint (and the one with the highest statistical significance) here is SSD without prefetch in index order.

On small rows, and not-so-sparse indices this might help with pinning-unpinning overhead (??), but for that case we already have bitmap scans.

Regards,

Alexandre

On Wed, Feb 18, 2026 at 3:39 PM Tomas Vondra <tomas@vondra.me> wrote:

On 2/18/26 05:21, Andres Freund wrote:
> Hi,
>
> On 2026-02-17 22:36:53 +0100, Tomas Vondra wrote:
>> On 2/17/26 21:16, Peter Geoghegan wrote:
>>> On Tue, Feb 17, 2026 at 2:27 PM Andres Freund <andres@anarazel.de> wrote:
>>>> On 2026-02-17 12:16:23 -0500, Peter Geoghegan wrote:
>>>>> On Mon, Feb 16, 2026 at 11:48 AM Andres Freund <andres@anarazel.de> wrote:
>>>>> I agree that the current heuristics (which were invented recently) are
>>>>> too conservative. I overfit the heuristics to my current set of
>>>>> adversarial queries, as a stopgap measure.
>>>>
>>>> Are you doing any testing on higher latency storage? I found it to be quite
>>>> valuable to use dm_delay to have a disk with reproducible (i.e. not cloud)
>>>> higher latency (i.e. not just a local SSD).
>>>
>>> I sometimes use dm_delay (with the minimum 1ms delay) when testing,
>>> but don't do so regularly. Just because it's inconvenient to do so
>>> (perhaps not a great reason).
>>>
>>>> Low latency NVMe can reduce the
>>>> penalty of not enough readahead so much that it's hard to spot problems...
>>>
>>> I'll keep that in mind.
>>>
>>
>> So, what counts as "higher latency" in this context? What delays should
>> we consider practical/relevant for testing?
>
> 0.5-4ms is the range I've seen in various clouds across their reasonable
> storage products (i.e. not spinning disks or other ver bulk oriented things).
>
> Unfortunately dm_delay doesn't support < 1ms delays, but it's still much
> better than nothing.
>
> I've been wondering about teaching AIO to delay IOs (by adding a sleep to
> workers and linking a IORING_OP_TIMEOUT submission with the actually intended
> IO) to allow testing smaller delays.
>

Could be useful testing facility, if it's done in a way that does not
limit the IO concurrency (i.e. the delay should probably be when
consuming the IO, depending on the timestamp of the IO start).

>
>>> That would make sense. You can already tell when that's happened by
>>> comparing the details shown by EXPLAIN ANALYZE against the same query
>>> execution on master, but that approach is inconvenient. Automating my
>>> microbenchmarks has proven to be important with this project. There's
>>> quite a few competing considerations, and it's too easy to improve one
>>> query at the cost of regressing another.
>>>
>>
>> What counts as "unconsumed IO"? The IOs the stream already started, but
>> then did not consume? That shouldn't be hard, I think.
>
> Yes, the number of IOs that were started but not consumed. Or, even better,
> the number of IOs that completed but were not consumed - but that'd be harder
> to get right now.
>
> I agree that started-but-not-consumed should be pretty easy.
>

I'll try to add it to the EXPLAIN.

regards

--
Tomas Vondra

pgsql-hackers by date:

From: "David G. Johnston"
Date: 22 February, 17:52:12
Subject: Re: Regression failures after changing PostgreSQL blocksize

From: Anthony Hsu
Date: 22 February, 22:20:50
Subject: Re: Set 1s WaitLatch timeout if standby limit has expired in ResolveRecoveryConflictWithBufferPin

Re: index prefetching - Mailing list pgsql-hackers

Previous

Next