Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | DC1PBQW5HBEJ.3W4SMGI4LW3VI@bowt.ie Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: index prefetching
Re: index prefetching Re: index prefetching |
List | pgsql-hackers |
On Wed Aug 13, 2025 at 5:19 PM EDT, Tomas Vondra wrote: > I did investigate this, and I don't think there's anything broken in > read_stream. It happens because ReadStream has a concept of "ungetting" > a block, which can happen after hitting some I/O limits. > > In that case we "remember" the last block (in read_stream_look_ahead > calls read_stream_unget_block), and we return it again. It may seem as > if read_stream_get_block() produced the same block twice, but it's > really just the block from the last round. I instrumented this for myself, and I agree: backwards and forwards scan cases are being fed the same block numbers, as expected (it's just that the order is precisely backwards, as expected). The only real difference is that the forwards scan case seems to be passed InvalidBlockNumber quite a bit more often. You were right: I was confused about the read_stream_unget_block thing. However, the magnitude of the difference that I see between the forwards and backwards scan cases just doesn't pass the smell test -- I stand by that part. I was able to confirm this intuition by performing a simple experiment. I asked myself a fairly obvious question: if the backwards scan in question takes about 2.5x as long, just because each group of TIDs for each index value appears in descending order, then what happens if the order is made random? Where does that leave the forwards scan case, and where does it leave the backwards scan case? I first made the order of the table random, except among groups of index tuples that have exactly the same value. Those will still point to the same 1 or 2 heap blocks in virtually all cases, so we have "heap clustering without any heap correlation" in the newly rewritten table. To set things up this way, I first made another index, and then clustered the table using that new index: pg@regression:5432 [2476413]=# create index on t (hashint8(a)); CREATE INDEX pg@regression:5432 [2476413]=# cluster t using t_hashint8_idx ; CLUSTER Next, I reran the queries in the obvious way (same procedure as yesterday, though with a very different result): pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx'); ***SNIP*** pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDERBY a; ┌────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├────────────────────────────────────────────────────────────┤ │ Index Scan using idx on t (actual rows=1048576.00 loops=1) │ │ Index Cond: ((a >= 16336) AND (a <= 49103)) │ │ Index Searches: 1 │ │ Buffers: shared hit=6082 read=77813 │ │ I/O Timings: shared read=153.672 │ │ Planning Time: 0.057 ms │ │ Execution Time: 402.735 ms │ └────────────────────────────────────────────────────────────┘ (7 rows) pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx'); ***SNIP*** pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDERBY a desc; ┌─────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├─────────────────────────────────────────────────────────────────────┤ │ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │ │ Index Cond: ((a >= 16336) AND (a <= 49103)) │ │ Index Searches: 1 │ │ Buffers: shared hit=6082 read=77813 │ │ I/O Timings: shared read=324.305 │ │ Planning Time: 0.071 ms │ │ Execution Time: 616.268 ms │ └─────────────────────────────────────────────────────────────────────┘ (7 rows) Apparently random I/O is twice as fast as sequential I/O in descending order! In fact, this test case creates the appearance of random I/O being at least slightly faster than sequential I/O for pages read in _ascending_ order! Obviously something doesn't add up here. I'm no closer to explaining what the underlying problem is than I was yesterday, but I find it _very_ hard to believe that the inconsistency in performance has anything to do with SSD firmware/OS implementation details. It just looks wonky to me. Also possibly worth noting: I'm pretty sure that "shared hit=6082" is wrong. Though now it's wrong in the same way with both variants. Just for context, I'll show what the TIDs for 3 randomly chosen adjacent-in-index values look like after CLUSTER runs (in case it was unclear what I meant about "heap clustering without any heap correlation" earlier): pg@regression:5432 [2476413]=# SELECT ctid, a FROM t WHERE a BETWEEN 20_000 AND 20_002 ORDER BY a; ┌─────────────┬────────┐ │ ctid │ a │ ├─────────────┼────────┤ │ (142534,3) │ 20,000 │ │ (142534,4) │ 20,000 │ │ (142534,5) │ 20,000 │ │ (142534,6) │ 20,000 │ │ (142534,7) │ 20,000 │ │ (142534,8) │ 20,000 │ │ (142534,9) │ 20,000 │ │ (142534,10) │ 20,000 │ │ (142534,11) │ 20,000 │ │ (142534,12) │ 20,000 │ │ (142534,13) │ 20,000 │ │ (142534,14) │ 20,000 │ │ (142534,15) │ 20,000 │ │ (142534,16) │ 20,000 │ │ (142534,17) │ 20,000 │ │ (142534,18) │ 20,000 │ │ (142534,19) │ 20,000 │ │ (142534,20) │ 20,000 │ │ (142534,21) │ 20,000 │ │ (142535,1) │ 20,000 │ │ (142535,2) │ 20,000 │ │ (142535,3) │ 20,000 │ │ (142535,4) │ 20,000 │ │ (142535,5) │ 20,000 │ │ (142535,6) │ 20,000 │ │ (142535,7) │ 20,000 │ │ (142535,8) │ 20,000 │ │ (142535,9) │ 20,000 │ │ (142535,10) │ 20,000 │ │ (142535,11) │ 20,000 │ │ (142535,12) │ 20,000 │ │ (142535,13) │ 20,000 │ │ (216406,19) │ 20,001 │ │ (216406,20) │ 20,001 │ │ (216406,21) │ 20,001 │ │ (216407,1) │ 20,001 │ │ (216407,2) │ 20,001 │ │ (216407,3) │ 20,001 │ │ (216407,4) │ 20,001 │ │ (216407,5) │ 20,001 │ │ (216407,6) │ 20,001 │ │ (216407,7) │ 20,001 │ │ (216407,8) │ 20,001 │ │ (216407,9) │ 20,001 │ │ (216407,10) │ 20,001 │ │ (216407,11) │ 20,001 │ │ (216407,12) │ 20,001 │ │ (216407,13) │ 20,001 │ │ (216407,14) │ 20,001 │ │ (216407,15) │ 20,001 │ │ (216407,16) │ 20,001 │ │ (216407,17) │ 20,001 │ │ (216407,18) │ 20,001 │ │ (216407,19) │ 20,001 │ │ (216407,20) │ 20,001 │ │ (216407,21) │ 20,001 │ │ (216408,1) │ 20,001 │ │ (216408,2) │ 20,001 │ │ (216408,3) │ 20,001 │ │ (216408,4) │ 20,001 │ │ (216408,5) │ 20,001 │ │ (216408,6) │ 20,001 │ │ (216408,7) │ 20,001 │ │ (216408,8) │ 20,001 │ │ (260993,12) │ 20,002 │ │ (260993,13) │ 20,002 │ │ (260993,14) │ 20,002 │ │ (260993,15) │ 20,002 │ │ (260993,16) │ 20,002 │ │ (260993,17) │ 20,002 │ │ (260993,18) │ 20,002 │ │ (260993,19) │ 20,002 │ │ (260993,20) │ 20,002 │ │ (260993,21) │ 20,002 │ │ (260994,1) │ 20,002 │ │ (260994,2) │ 20,002 │ │ (260994,3) │ 20,002 │ │ (260994,4) │ 20,002 │ │ (260994,5) │ 20,002 │ │ (260994,6) │ 20,002 │ │ (260994,7) │ 20,002 │ │ (260994,8) │ 20,002 │ │ (260994,9) │ 20,002 │ │ (260994,10) │ 20,002 │ │ (260994,11) │ 20,002 │ │ (260994,12) │ 20,002 │ │ (260994,13) │ 20,002 │ │ (260994,14) │ 20,002 │ │ (260994,15) │ 20,002 │ │ (260994,16) │ 20,002 │ │ (260994,17) │ 20,002 │ │ (260994,18) │ 20,002 │ │ (260994,19) │ 20,002 │ │ (260994,20) │ 20,002 │ │ (260994,21) │ 20,002 │ │ (260995,1) │ 20,002 │ └─────────────┴────────┘ (96 rows) -- Peter Geoghegan
pgsql-hackers by date: