Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id cffa6e3e-f9c4-468f-b218-a3a5e172c226@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 7/23/25 17:09, Andres Freund wrote:
> Hi,
> 
> On 2025-07-23 14:50:15 +0200, Tomas Vondra wrote:
>> On 7/23/25 02:59, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-07-23 02:50:04 +0200, Tomas Vondra wrote:
>>>> But I don't see why would this have any effect on the prefetch distance,
>>>> queue depth etc. Or why decreasing INDEX_SCAN_MAX_BATCHES should improve
>>>> that. I'd have expected exactly the opposite behavior.
>>>>
>>>> Could be bug, of course. But it'd be helpful to see the dataset/query.
>>>
>>> Pgbench scale 500, with the simpler query from my message.
>>>
>>
>> I tried to reproduce this, but I'm not seeing behavior. I'm not sure how
>> you monitor the queue depth (presumably iostat?)
> 
> Yes, iostat, since I was looking at what the "actually required" lookahead
> distance is.
> 
> Do you actually get the query to be entirely CPU bound? What amount of IO
> waiting do you see EXPLAIN (ANALYZE, TIMING OFF) with track_io_timing=on
> report?
> 

No, it definitely needs to wait for I/O (FWIW it's on the xeon, with a
single NVMe SSD).

> Ah - I was using a very high effective_io_concurrency. With a high
> effective_io_concurrency value I see a lot of stalls, even at
> INDEX_SCAN_MAX_BATCHES = 64. And a lower prefetch distance, which seems
> somewhat odd.
> 

I think that's a bug in the explain patch. The counters were updated at
the beginning of _next_buffer(), but that's wrong - a single call to
_next_buffer() can prefetch multiple blocks. This skewed the stats, as
the prefetches are not counted with "distance=0". With higher eic this
happens sooner, so the average distance seemed to decrease.

The attached patch does the updates in _get_block(), which I think is
better. And "stall" now means (distance == 1), which I think detects
requests without prefetching.

I also added a separate "Count" for the actual number of prefetched
blocks, and "Skipped" for duplicate blocks skipped (which the read
stream never even sees, because it's skipped in the callback).

> 
> FWIW, in my tests I was just evicting lineitem from shared buffers, since I
> wanted to test the heap prefetching, without stalls induced by blocking on
> index reads. But what I described happens with either.
> 
> ;SET effective_io_concurrency = 256;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs
off,timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
 
> ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                QUERY PLAN                                                │
> ├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Aggregate (actual rows=1.00 loops=1)                                                                     │
> │   Buffers: shared hit=27369 read=164191                                                                  │
> │   I/O Timings: shared read=358.795                                                                       │
> │   ->  Limit (actual rows=10000000.00 loops=1)                                                            │
> │         Buffers: shared hit=27369 read=164191                                                            │
> │         I/O Timings: shared read=358.795                                                                 │
> │         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
> │               Index Searches: 1                                                                          │
> │               Prefetch Distance: 256.989                                                                 │
> │               Prefetch Stalls: 3                                                                         │
> │               Prefetch Resets: 3                                                                         │
> │               Buffers: shared hit=27369 read=164191                                                      │
> │               I/O Timings: shared read=358.795                                                           │
> │ Planning Time: 0.086 ms                                                                                  │
> │ Execution Time: 4194.845 ms                                                                              │
> └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> 
> ;SET effective_io_concurrency = 512;SELECT pg_buffercache_evict_relation('pgbench_accounts'); explain (analyze, costs
off,timing off) SELECT max(abalance) FROM (SELECT * FROM pgbench_accounts ORDER BY aid LIMIT 10000000);
 
> ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                QUERY PLAN                                                │
> ├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Aggregate (actual rows=1.00 loops=1)                                                                     │
> │   Buffers: shared hit=27368 read=164190                                                                  │
> │   I/O Timings: shared read=832.515                                                                       │
> │   ->  Limit (actual rows=10000000.00 loops=1)                                                            │
> │         Buffers: shared hit=27368 read=164190                                                            │
> │         I/O Timings: shared read=832.515                                                                 │
> │         ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual rows=10000000.00 loops=1) │
> │               Index Searches: 1                                                                          │
> │               Prefetch Distance: 56.778                                                                  │
> │               Prefetch Stalls: 160569                                                                    │
> │               Prefetch Resets: 423                                                                       │
> │               Buffers: shared hit=27368 read=164190                                                      │
> │               I/O Timings: shared read=832.515                                                           │
> │ Planning Time: 0.084 ms                                                                                  │
> │ Execution Time: 4413.058 ms                                                                              │
> └──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> 
> Greetings,
> 

The attached v2 explain patch should fix that. I'm also attaching logs
from my explain, for 64 and 16 batches. I think the output makes much
more sense now.


cheers

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: trivial grammar refactor
Next
From: Nathan Bossart
Date:
Subject: Re: Verify predefined LWLocks tranches have entries in wait_event_names.txt