Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | 7c308720-5cd4-4508-8cc0-6e55a6352731@vondra.me Whole thread Raw |
In response to | Re: index prefetching ("Peter Geoghegan" <pg@bowt.ie>) |
Responses |
Re: index prefetching
Re: index prefetching |
List | pgsql-hackers |
On 8/14/25 01:50, Peter Geoghegan wrote: > On Wed Aug 13, 2025 at 5:19 PM EDT, Tomas Vondra wrote: >> I did investigate this, and I don't think there's anything broken in >> read_stream. It happens because ReadStream has a concept of "ungetting" >> a block, which can happen after hitting some I/O limits. >> >> In that case we "remember" the last block (in read_stream_look_ahead >> calls read_stream_unget_block), and we return it again. It may seem as >> if read_stream_get_block() produced the same block twice, but it's >> really just the block from the last round. > > I instrumented this for myself, and I agree: backwards and forwards scan cases > are being fed the same block numbers, as expected (it's just that the order is > precisely backwards, as expected). The only real difference is that the forwards > scan case seems to be passed InvalidBlockNumber quite a bit more often. You were > right: I was confused about the read_stream_unget_block thing. > > However, the magnitude of the difference that I see between the forwards and > backwards scan cases just doesn't pass the smell test -- I stand by that part. > I was able to confirm this intuition by performing a simple experiment. > > I asked myself a fairly obvious question: if the backwards scan in question > takes about 2.5x as long, just because each group of TIDs for each index value > appears in descending order, then what happens if the order is made random? > Where does that leave the forwards scan case, and where does it leave the > backwards scan case? > > I first made the order of the table random, except among groups of index tuples > that have exactly the same value. Those will still point to the same 1 or 2 heap > blocks in virtually all cases, so we have "heap clustering without any heap > correlation" in the newly rewritten table. To set things up this way, I first > made another index, and then clustered the table using that new index: > > pg@regression:5432 [2476413]=# create index on t (hashint8(a)); > CREATE INDEX > pg@regression:5432 [2476413]=# cluster t using t_hashint8_idx ; > CLUSTER > > Next, I reran the queries in the obvious way (same procedure as yesterday, > though with a very different result): > > pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx'); > ***SNIP*** > pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103ORDER BY a; > ┌────────────────────────────────────────────────────────────┐ > │ QUERY PLAN │ > ├────────────────────────────────────────────────────────────┤ > │ Index Scan using idx on t (actual rows=1048576.00 loops=1) │ > │ Index Cond: ((a >= 16336) AND (a <= 49103)) │ > │ Index Searches: 1 │ > │ Buffers: shared hit=6082 read=77813 │ > │ I/O Timings: shared read=153.672 │ > │ Planning Time: 0.057 ms │ > │ Execution Time: 402.735 ms │ > └────────────────────────────────────────────────────────────┘ > (7 rows) > > pg@regression:5432 [2476413]=# select pg_buffercache_evict_relation('t'); select pg_prewarm('idx'); > ***SNIP*** > pg@regression:5432 [2476413]=# EXPLAIN (ANALYZE ,costs off, timing off) SELECT * FROM t WHERE a BETWEEN 16336 AND 49103ORDER BY a desc; > ┌─────────────────────────────────────────────────────────────────────┐ > │ QUERY PLAN │ > ├─────────────────────────────────────────────────────────────────────┤ > │ Index Scan Backward using idx on t (actual rows=1048576.00 loops=1) │ > │ Index Cond: ((a >= 16336) AND (a <= 49103)) │ > │ Index Searches: 1 │ > │ Buffers: shared hit=6082 read=77813 │ > │ I/O Timings: shared read=324.305 │ > │ Planning Time: 0.071 ms │ > │ Execution Time: 616.268 ms │ > └─────────────────────────────────────────────────────────────────────┘ > (7 rows) > > Apparently random I/O is twice as fast as sequential I/O in descending order! In > fact, this test case creates the appearance of random I/O being at least > slightly faster than sequential I/O for pages read in _ascending_ order! > > Obviously something doesn't add up here. I'm no closer to explaining what the > underlying problem is than I was yesterday, but I find it _very_ hard to believe > that the inconsistency in performance has anything to do with SSD firmware/OS > implementation details. It just looks wonky to me. > > Also possibly worth noting: I'm pretty sure that "shared hit=6082" is wrong. > Though now it's wrong in the same way with both variants. > > Just for context, I'll show what the TIDs for 3 randomly chosen > adjacent-in-index values look like after CLUSTER runs (in case it was unclear > what I meant about "heap clustering without any heap correlation" earlier): > Interesting. It's really surprising random I/O beats the sequential. I investigated this from a different angle, by tracing the I/O request generated. using perf-trace. And the patterns are massively different. What I did is roughly this: 1) restart the instance (with direct I/O) 2) perf trace record -m 128M -a -o $(date +%s).trace 3) run the query, pgrep 'io worker' 4) stop the trace 5) extract pread64 events for the I/O workers from the trace I get these event counts: Q1 ASC - 5395 Q1 DESC - 49969 Q2 ASC - 32804 Q2 DESC - 49958 It's interesting the DESC queries get to do almost exactly the same number of pread calls. Anyway, small samples of the trace look like this: Q1 ASC pread64(fd: 7, buf: 0x7f6011b7f000, count: 81920, pos: 475193344) pread64(fd: 24, buf: 0x7f6011b95000, count: 131072, pos: 475275264) pread64(fd: 7, buf: 0x7f6011bb7000, count: 131072, pos: 475406336) pread64(fd: 24, buf: 0x7f6011bd9000, count: 131072, pos: 475537408) pread64(fd: 7, buf: 0x7f6011bfb000, count: 81920, pos: 475668480) pread64(fd: 24, buf: 0x7f6011c0f000, count: 24576, pos: 475750400) pread64(fd: 24, buf: 0x7f6011c15000, count: 24576, pos: 475774976) pread64(fd: 24, buf: 0x7f6011c1d000, count: 131072, pos: 475799552) pread64(fd: 7, buf: 0x7f6011c3f000, count: 106496, pos: 475930624) pread64(fd: 24, buf: 0x7f6011c59000, count: 24576, pos: 476037120) pread64(fd: 24, buf: 0x7f6011c61000, count: 131072, pos: 476061696) pread64(fd: 7, buf: 0x7f6011c83000, count: 131072, pos: 476192768) pread64(fd: 24, buf: 0x7f6011ca3000, count: 24576, pos: 476323840) pread64(fd: 24, buf: 0x7f6011ca9000, count: 24576, pos: 476348416) pread64(fd: 24, buf: 0x7f6011cb1000, count: 131072, pos: 476372992) pread64(fd: 7, buf: 0x7f6011cd1000, count: 57344, pos: 476504064) Q1 DESC pread64(fd: 24, buf: 0x7fa8c1735000, count: 8192, pos: 230883328) pread64(fd: 7, buf: 0x7fa8c1737000, count: 8192, pos: 230875136) pread64(fd: 6, buf: 0x7fa8c173b000, count: 8192, pos: 230866944) pread64(fd: 24, buf: 0x7fa8c173d000, count: 8192, pos: 230858752) pread64(fd: 7, buf: 0x7fa8c173f000, count: 8192, pos: 230850560) pread64(fd: 6, buf: 0x7fa8c1741000, count: 8192, pos: 230842368) pread64(fd: 24, buf: 0x7fa8c1743000, count: 8192, pos: 230834176) pread64(fd: 7, buf: 0x7fa8c1745000, count: 8192, pos: 230825984) pread64(fd: 24, buf: 0x7fa8c1747000, count: 8192, pos: 230817792) pread64(fd: 6, buf: 0x7fa8c1749000, count: 8192, pos: 230809600) pread64(fd: 7, buf: 0x7fa8c174b000, count: 8192, pos: 230801408) pread64(fd: 24, buf: 0x7fa8c174d000, count: 8192, pos: 230793216) pread64(fd: 6, buf: 0x7fa8c174f000, count: 8192, pos: 230785024) pread64(fd: 7, buf: 0x7fa8c1751000, count: 8192, pos: 230776832) pread64(fd: 24, buf: 0x7fa8c1753000, count: 8192, pos: 230768640) pread64(fd: 7, buf: 0x7fa8c1755000, count: 8192, pos: 230760448) pread64(fd: 6, buf: 0x7fa8c1757000, count: 8192, pos: 230752256) Q2 ASC pread64(fd: 7, buf: 0x7fb8bbf27000, count: 8192, pos: 258695168) pread64(fd: 24, buf: 0x7fb8bbf29000, count: 16384, pos: 258678784) pread64(fd: 7, buf: 0x7fb8bbf2d000, count: 8192, pos: 258670592) pread64(fd: 24, buf: 0x7fb8bbf2f000, count: 16384, pos: 258654208) pread64(fd: 7, buf: 0x7fb8bbf33000, count: 8192, pos: 258646016) pread64(fd: 24, buf: 0x7fb8bbf35000, count: 16384, pos: 258629632) pread64(fd: 7, buf: 0x7fb8bbf39000, count: 8192, pos: 258621440) pread64(fd: 24, buf: 0x7fb8bbf3d000, count: 16384, pos: 258605056) pread64(fd: 7, buf: 0x7fb8bbf41000, count: 8192, pos: 258596864) pread64(fd: 24, buf: 0x7fb8bbf43000, count: 16384, pos: 258580480) pread64(fd: 7, buf: 0x7fb8bbf47000, count: 8192, pos: 258572288) pread64(fd: 24, buf: 0x7fb8bbf49000, count: 16384, pos: 258555904) pread64(fd: 7, buf: 0x7fb8bbf4d000, count: 8192, pos: 258547712) pread64(fd: 24, buf: 0x7fb8bbf4f000, count: 16384, pos: 258531328) pread64(fd: 7, buf: 0x7fb8bbf53000, count: 16384, pos: 258514944) pread64(fd: 24, buf: 0x7fb8bbf57000, count: 8192, pos: 258506752) pread64(fd: 7, buf: 0x7fb8bbf59000, count: 8192, pos: 258498560) pread64(fd: 24, buf: 0x7fb8bbf5b000, count: 16384, pos: 258482176) Q2 DESC pread64(fd: 24, buf: 0x7fdcf0451000, count: 8192, pos: 598974464) pread64(fd: 7, buf: 0x7fdcf0453000, count: 8192, pos: 598999040) pread64(fd: 6, buf: 0x7fdcf0455000, count: 8192, pos: 598990848) pread64(fd: 24, buf: 0x7fdcf0459000, count: 8192, pos: 599007232) pread64(fd: 7, buf: 0x7fdcf045b000, count: 8192, pos: 599023616) pread64(fd: 6, buf: 0x7fdcf045d000, count: 8192, pos: 599015424) pread64(fd: 24, buf: 0x7fdcf045f000, count: 8192, pos: 599031808) pread64(fd: 7, buf: 0x7fdcf0461000, count: 8192, pos: 599048192) pread64(fd: 6, buf: 0x7fdcf0463000, count: 8192, pos: 599040000) pread64(fd: 24, buf: 0x7fdcf0465000, count: 8192, pos: 599056384) pread64(fd: 7, buf: 0x7fdcf0467000, count: 8192, pos: 599072768) pread64(fd: 6, buf: 0x7fdcf0469000, count: 8192, pos: 599064576) pread64(fd: 24, buf: 0x7fdcf046b000, count: 8192, pos: 599080960) pread64(fd: 7, buf: 0x7fdcf046d000, count: 8192, pos: 599097344) pread64(fd: 6, buf: 0x7fdcf046f000, count: 8192, pos: 599089152) pread64(fd: 24, buf: 0x7fdcf0471000, count: 8192, pos: 599105536) pread64(fd: 7, buf: 0x7fdcf0473000, count: 8192, pos: 599121920) pread64(fd: 6, buf: 0x7fdcf0475000, count: 8192, pos: 599113728) So, Q1 ASC gets to combine the I/O into nice large chunks. But the DESC queries end up doing a stream of 8K requests. The Q2 ASC gets to do 16KB reads in about half the cases, but the rest is still 8KB. FWIW I believe this is what Thomas Munro meant by [1]: You'll probably see a flood of uncombined 8KB IOs in the pg_aios view while travelling up the heap with cache misses today. It wasn't quite this obvious in pg_aios, though. I've usually seen only a single event there, so hard to make conclusion. The trace makes it pretty obvious, though. We don't combine the I/O, and we also know Linux in fact does not do any readahead for backwards scans. regards [1] https://www.postgresql.org/message-id/CA%2BhUKGKMaZLmNQHaa_DZMw9MJJKGegjrqnTY3KOZB-_nvFa3wQ%40mail.gmail.com -- Tomas Vondra
pgsql-hackers by date: