Re: index prefetching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: index prefetching
Date
Msg-id DC1PBQW5HBEJ.3W4SMGI4LW3VI@bowt.ie
Whole thread Raw
In response to Re: index prefetching  (Tomas Vondra <tomas@vondra.me>)
Responses Re: index prefetching
Re: index prefetching
Re: index prefetching
List pgsql-hackers
On Wed Aug 13, 2025 at 5:19 PM EDT, Tomas Vondra wrote:
> I did investigate this, and I don't think there's anything broken in
> read_stream. It happens because ReadStream has a concept of "ungetting"
> a block, which can happen after hitting some I/O limits.
>
> In that case we "remember" the last block (in read_stream_look_ahead
> calls read_stream_unget_block), and we return it again. It may seem as
> if read_stream_get_block() produced the same block twice, but it's
> really just the block from the last round.

I instrumented this for myself, and I agree: backwards and forwards scan cases
are being fed the same block numbers, as expected (it's just that the order is
precisely backwards, as expected). The only real difference is that the forwards
scan case seems to be passed InvalidBlockNumber quite a bit more often. You were
right: I was confused about the read_stream_unget_block thing.

However, the magnitude of the difference that I see between the forwards and
backwards scan cases just doesn't pass the smell test -- I stand by that part.
I was able to confirm this intuition by performing a simple experiment.

I asked myself a fairly obvious question: if the backwards scan in question
takes about 2.5x as long, just because each group of TIDs for each index value
appears in descending order, then what happens if the order is made random?
Where does that leave the forwards scan case, and where does it leave the
backwards scan case?

I first made the order of the table random, except among groups of index tuples
that have exactly the same value. Those will still point to the same 1 or 2 heap
blocks in virtually all cases, so we have "heap clustering without any heap
correlation" in the newly rewritten table.  To set things up this way, I first
made another index, and then clustered the table using that new index:

pg@regression:5432 [2476413]=# create index on t (hashint8(a));
CREATE INDEX
pg@regression:5432 [2476413]=# cluster t using t_hashint8_idx ;
CLUSTER

Next, I reran the queries in the obvious way (same procedure as yesterday,
though with a very different result):

pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103
ORDERBY a; 
┌────────────────────────────────────────────────────────────┐
│                         QUERY PLAN                         │
├────────────────────────────────────────────────────────────┤
│ Index Scan using idx on t (actual rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))              │
│   Index Searches: 1                                        │
│   Buffers: shared hit=6082 read=77813                      │
│   I/O Timings: shared read=153.672                         │
│ Planning Time: 0.057 ms                                    │
│ Execution Time: 402.735 ms                                 │
└────────────────────────────────────────────────────────────┘
(7 rows)

pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx');
***SNIP***
pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103
ORDERBY a desc; 
┌─────────────────────────────────────────────────────────────────────┐
│                             QUERY PLAN                              │
├─────────────────────────────────────────────────────────────────────┤
│ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │
│   Index Cond: ((a >= 16336) AND (a <= 49103))                       │
│   Index Searches: 1                                                 │
│   Buffers: shared hit=6082 read=77813                               │
│   I/O Timings: shared read=324.305                                  │
│ Planning Time: 0.071 ms                                             │
│ Execution Time: 616.268 ms                                          │
└─────────────────────────────────────────────────────────────────────┘
(7 rows)

Apparently random I/O is twice as fast as sequential I/O in descending order! In
fact, this test case creates the appearance of random I/O being at least
slightly faster than sequential I/O for pages read in _ascending_ order!

Obviously something doesn't add up here.  I'm no closer to explaining what the
underlying problem is than I was yesterday, but I find it _very_ hard to believe
that the inconsistency in performance has anything to do with SSD firmware/OS
implementation details.  It just looks wonky to me.

Also possibly worth noting: I'm pretty sure that "shared hit=6082" is wrong.
Though now it's wrong in the same way with both variants.

Just for context, I'll show what the TIDs for 3 randomly chosen
adjacent-in-index values look like after CLUSTER runs (in case it was unclear
what I meant about "heap clustering without any heap correlation" earlier):

pg@regression:5432 [2476413]=# SELECT ctid, a FROM t WHERE a BETWEEN 20_000 AND 20_002 ORDER BY a;
┌─────────────┬────────┐
│    ctid     │   a    │
├─────────────┼────────┤
│ (142534,3)  │ 20,000 │
│ (142534,4)  │ 20,000 │
│ (142534,5)  │ 20,000 │
│ (142534,6)  │ 20,000 │
│ (142534,7)  │ 20,000 │
│ (142534,8)  │ 20,000 │
│ (142534,9)  │ 20,000 │
│ (142534,10) │ 20,000 │
│ (142534,11) │ 20,000 │
│ (142534,12) │ 20,000 │
│ (142534,13) │ 20,000 │
│ (142534,14) │ 20,000 │
│ (142534,15) │ 20,000 │
│ (142534,16) │ 20,000 │
│ (142534,17) │ 20,000 │
│ (142534,18) │ 20,000 │
│ (142534,19) │ 20,000 │
│ (142534,20) │ 20,000 │
│ (142534,21) │ 20,000 │
│ (142535,1)  │ 20,000 │
│ (142535,2)  │ 20,000 │
│ (142535,3)  │ 20,000 │
│ (142535,4)  │ 20,000 │
│ (142535,5)  │ 20,000 │
│ (142535,6)  │ 20,000 │
│ (142535,7)  │ 20,000 │
│ (142535,8)  │ 20,000 │
│ (142535,9)  │ 20,000 │
│ (142535,10) │ 20,000 │
│ (142535,11) │ 20,000 │
│ (142535,12) │ 20,000 │
│ (142535,13) │ 20,000 │
│ (216406,19) │ 20,001 │
│ (216406,20) │ 20,001 │
│ (216406,21) │ 20,001 │
│ (216407,1)  │ 20,001 │
│ (216407,2)  │ 20,001 │
│ (216407,3)  │ 20,001 │
│ (216407,4)  │ 20,001 │
│ (216407,5)  │ 20,001 │
│ (216407,6)  │ 20,001 │
│ (216407,7)  │ 20,001 │
│ (216407,8)  │ 20,001 │
│ (216407,9)  │ 20,001 │
│ (216407,10) │ 20,001 │
│ (216407,11) │ 20,001 │
│ (216407,12) │ 20,001 │
│ (216407,13) │ 20,001 │
│ (216407,14) │ 20,001 │
│ (216407,15) │ 20,001 │
│ (216407,16) │ 20,001 │
│ (216407,17) │ 20,001 │
│ (216407,18) │ 20,001 │
│ (216407,19) │ 20,001 │
│ (216407,20) │ 20,001 │
│ (216407,21) │ 20,001 │
│ (216408,1)  │ 20,001 │
│ (216408,2)  │ 20,001 │
│ (216408,3)  │ 20,001 │
│ (216408,4)  │ 20,001 │
│ (216408,5)  │ 20,001 │
│ (216408,6)  │ 20,001 │
│ (216408,7)  │ 20,001 │
│ (216408,8)  │ 20,001 │
│ (260993,12) │ 20,002 │
│ (260993,13) │ 20,002 │
│ (260993,14) │ 20,002 │
│ (260993,15) │ 20,002 │
│ (260993,16) │ 20,002 │
│ (260993,17) │ 20,002 │
│ (260993,18) │ 20,002 │
│ (260993,19) │ 20,002 │
│ (260993,20) │ 20,002 │
│ (260993,21) │ 20,002 │
│ (260994,1)  │ 20,002 │
│ (260994,2)  │ 20,002 │
│ (260994,3)  │ 20,002 │
│ (260994,4)  │ 20,002 │
│ (260994,5)  │ 20,002 │
│ (260994,6)  │ 20,002 │
│ (260994,7)  │ 20,002 │
│ (260994,8)  │ 20,002 │
│ (260994,9)  │ 20,002 │
│ (260994,10) │ 20,002 │
│ (260994,11) │ 20,002 │
│ (260994,12) │ 20,002 │
│ (260994,13) │ 20,002 │
│ (260994,14) │ 20,002 │
│ (260994,15) │ 20,002 │
│ (260994,16) │ 20,002 │
│ (260994,17) │ 20,002 │
│ (260994,18) │ 20,002 │
│ (260994,19) │ 20,002 │
│ (260994,20) │ 20,002 │
│ (260994,21) │ 20,002 │
│ (260995,1)  │ 20,002 │
└─────────────┴────────┘
(96 rows)

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: [PATCH] Silence Valgrind about SelectConfigFiles()
Next
From: Michael Paquier
Date:
Subject: Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)