Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id b739c179-25ca-4371-b528-cf80ea111d94@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
Responses Re: index prefetching
List pgsql-hackers

On 8/12/25 23:52, Tomas Vondra wrote:
> 
> On 8/12/25 23:22, Peter Geoghegan wrote:
>> ...
>>
>> It looks like the patch does significantly better with the forwards scan,
>> compared to the backwards scan (though both are improved by a lot).  But that's
>> not the main thing about these results that I find interesting.
>>
>> The really odd thing is that we get "shared hit=6619 read=49933" for the
>> forwards scan, and "shared hit=10350 read=49933" for the backwards scan.  The
>> latter matches master (regardless of the scan direction used on master), while
>> the former just looks wrong.  What explains the "missing buffer hits" seen with
>> the forwards scan?
>>
>> Discrepancies
>> -------------
>>
>> All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear
>> to simply be broken/giving wrong answers.  Might it be that the "Buffers"
>> instrumentation is broken?
>>
> 
> I think a bug in the prefetch patch is more likely. I tried with a patch
> that adds various prefetch-related counters to explain, and I see this:
> 
> 
> test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
> BETWEEN 16336 AND 49103 ORDER BY a;
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan using idx on public.t (actual time=0.682..527.055
> rows=1048576.00 loops=1)
>    Output: a, b
>    Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
>    Index Searches: 1
>    Prefetch Distance: 271.263
>    Prefetch Count: 60888
>    Prefetch Stalls: 1
>    Prefetch Skips: 991211
>    Prefetch Resets: 3
>    Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) =>
> 24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669
>    Buffers: shared hit=5027 read=50872
>    I/O Timings: shared read=33.528
>  Planning:
>    Buffers: shared hit=78 read=23
>    I/O Timings: shared read=2.349
>  Planning Time: 3.686 ms
>  Execution Time: 559.659 ms
> (17 rows)
> 
> 
> test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a
> BETWEEN 16336 AND 49103 ORDER BY a DESC;
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Index Scan Backward using idx on public.t (actual time=1.110..4116.201
> rows=1048576.00 loops=1)
>    Output: a, b
>    Index Cond: ((t.a >= 16336) AND (t.a <= 49103))
>    Index Searches: 1
>    Prefetch Distance: 271.061
>    Prefetch Count: 118806
>    Prefetch Stalls: 1
>    Prefetch Skips: 962515
>    Prefetch Resets: 3
>    Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) =>
> 17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737
>    Buffers: shared hit=30024 read=50872
>    I/O Timings: shared read=581.353
>  Planning:
>    Buffers: shared hit=82 read=23
>    I/O Timings: shared read=3.168
>  Planning Time: 4.289 ms
>  Execution Time: 4185.407 ms
> (17 rows)
> 
> These two parts are interesting:
> 
>    Prefetch Count: 60888
>    Prefetch Skips: 991211
> 
>    Prefetch Count: 118806
>    Prefetch Skips: 962515
> 
> It looks like the backwards scan skips fewer blocks. This is based on
> the lastBlock optimization, i.e. looking for runs of the same block
> number. I don't quite see why would it affect just the backwards scan,
> though. Seems weird.
> 

Actually, this might be a consequence of how backwards scans work (at
least in btree). I logged the block in index_scan_stream_read_next, and
this is what I see in the forward scan (at the beginning):

    index_scan_stream_read_next: block 24891
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893
    index_scan_stream_read_next: block 24894
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24897
    index_scan_stream_read_next: block 24898
    index_scan_stream_read_next: block 24899
    index_scan_stream_read_next: block 24900
    index_scan_stream_read_next: block 24901
    index_scan_stream_read_next: block 24902
    index_scan_stream_read_next: block 24903
    index_scan_stream_read_next: block 24904
    index_scan_stream_read_next: block 24905
    index_scan_stream_read_next: block 24906
    index_scan_stream_read_next: block 24907
    index_scan_stream_read_next: block 24908
    index_scan_stream_read_next: block 24909
    index_scan_stream_read_next: block 24910

while in the backwards scan (at the end) I see this

    index_scan_stream_read_next: block 24910
    index_scan_stream_read_next: block 24911
    index_scan_stream_read_next: block 24908
    index_scan_stream_read_next: block 24909
    index_scan_stream_read_next: block 24906
    index_scan_stream_read_next: block 24907
    index_scan_stream_read_next: block 24908
    index_scan_stream_read_next: block 24905
    index_scan_stream_read_next: block 24906
    index_scan_stream_read_next: block 24903
    index_scan_stream_read_next: block 24904
    index_scan_stream_read_next: block 24905
    index_scan_stream_read_next: block 24902
    index_scan_stream_read_next: block 24903
    index_scan_stream_read_next: block 24900
    index_scan_stream_read_next: block 24901
    index_scan_stream_read_next: block 24902
    index_scan_stream_read_next: block 24899
    index_scan_stream_read_next: block 24900
    index_scan_stream_read_next: block 24897
    index_scan_stream_read_next: block 24898
    index_scan_stream_read_next: block 24899
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24897
    index_scan_stream_read_next: block 24894
    index_scan_stream_read_next: block 24895
    index_scan_stream_read_next: block 24896
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893
    index_scan_stream_read_next: block 24894
    index_scan_stream_read_next: block 24891
    index_scan_stream_read_next: block 24892
    index_scan_stream_read_next: block 24893

These are only the blocks that ended up passes to the read stream, not
the skipped ones. And you can immediately see the backward scan requests
more blocks for (roughly) the same part of the scan - the min/max block
roughly match.

The reason is pretty simple - the table is very correlated, and the
forward scan requests blocks mostly in the right order. Only rarely it
has to jump "back" when progressing to the next value, and so the
lastBlock optimization works nicely.

But with the backwards scan we apparently scan the values backwards, but
then the blocks for each value are accessed in forward direction. So we
do a couple blocks "forward" and then jump to the preceding value - but
that's a couple blocks *back*. And that breaks the lastBlock check.

I believe this applies both to master and the prefetching, except that
master doesn't have read stream - so it only does sync I/O. Could that
hide the extra buffer accesses, somehow?

Anyway, this access pattern in backwards scans seems a bit unfortunate.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: CI failures with Windows - VS2019 jobs
Next
From: Thomas Munro
Date:
Subject: Re: `pg_ctl init` crashes when run concurrently; semget(2) suspected