Re: index prefetching - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | b739c179-25ca-4371-b528-cf80ea111d94@vondra.me Whole thread Raw |
In response to | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On 8/12/25 23:52, Tomas Vondra wrote: > > On 8/12/25 23:22, Peter Geoghegan wrote: >> ... >> >> It looks like the patch does significantly better with the forwards scan, >> compared to the backwards scan (though both are improved by a lot). But that's >> not the main thing about these results that I find interesting. >> >> The really odd thing is that we get "shared hit=6619 read=49933" for the >> forwards scan, and "shared hit=10350 read=49933" for the backwards scan. The >> latter matches master (regardless of the scan direction used on master), while >> the former just looks wrong. What explains the "missing buffer hits" seen with >> the forwards scan? >> >> Discrepancies >> ------------- >> >> All 4 query executions agree that "rows=1048576.00", so the patch doesn't appear >> to simply be broken/giving wrong answers. Might it be that the "Buffers" >> instrumentation is broken? >> > > I think a bug in the prefetch patch is more likely. I tried with a patch > that adds various prefetch-related counters to explain, and I see this: > > > test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a > BETWEEN 16336 AND 49103 ORDER BY a; > > QUERY PLAN > ------------------------------------------------------------------------ > Index Scan using idx on public.t (actual time=0.682..527.055 > rows=1048576.00 loops=1) > Output: a, b > Index Cond: ((t.a >= 16336) AND (t.a <= 49103)) > Index Searches: 1 > Prefetch Distance: 271.263 > Prefetch Count: 60888 > Prefetch Stalls: 1 > Prefetch Skips: 991211 > Prefetch Resets: 3 > Prefetch Histogram: [2,4) => 2, [4,8) => 8, [8,16) => 17, [16,32) => > 24, [32,64) => 34, [64,128) => 52, [128,256) => 82, [256,512) => 60669 > Buffers: shared hit=5027 read=50872 > I/O Timings: shared read=33.528 > Planning: > Buffers: shared hit=78 read=23 > I/O Timings: shared read=2.349 > Planning Time: 3.686 ms > Execution Time: 559.659 ms > (17 rows) > > > test=# EXPLAIN (ANALYZE, VERBOSE, COSTS OFF) SELECT * FROM t WHERE a > BETWEEN 16336 AND 49103 ORDER BY a DESC; > QUERY PLAN > ------------------------------------------------------------------------ > Index Scan Backward using idx on public.t (actual time=1.110..4116.201 > rows=1048576.00 loops=1) > Output: a, b > Index Cond: ((t.a >= 16336) AND (t.a <= 49103)) > Index Searches: 1 > Prefetch Distance: 271.061 > Prefetch Count: 118806 > Prefetch Stalls: 1 > Prefetch Skips: 962515 > Prefetch Resets: 3 > Prefetch Histogram: [2,4) => 2, [4,8) => 7, [8,16) => 12, [16,32) => > 17, [32,64) => 24, [64,128) => 3, [128,256) => 4, [256,512) => 118737 > Buffers: shared hit=30024 read=50872 > I/O Timings: shared read=581.353 > Planning: > Buffers: shared hit=82 read=23 > I/O Timings: shared read=3.168 > Planning Time: 4.289 ms > Execution Time: 4185.407 ms > (17 rows) > > These two parts are interesting: > > Prefetch Count: 60888 > Prefetch Skips: 991211 > > Prefetch Count: 118806 > Prefetch Skips: 962515 > > It looks like the backwards scan skips fewer blocks. This is based on > the lastBlock optimization, i.e. looking for runs of the same block > number. I don't quite see why would it affect just the backwards scan, > though. Seems weird. > Actually, this might be a consequence of how backwards scans work (at least in btree). I logged the block in index_scan_stream_read_next, and this is what I see in the forward scan (at the beginning): index_scan_stream_read_next: block 24891 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 index_scan_stream_read_next: block 24894 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24897 index_scan_stream_read_next: block 24898 index_scan_stream_read_next: block 24899 index_scan_stream_read_next: block 24900 index_scan_stream_read_next: block 24901 index_scan_stream_read_next: block 24902 index_scan_stream_read_next: block 24903 index_scan_stream_read_next: block 24904 index_scan_stream_read_next: block 24905 index_scan_stream_read_next: block 24906 index_scan_stream_read_next: block 24907 index_scan_stream_read_next: block 24908 index_scan_stream_read_next: block 24909 index_scan_stream_read_next: block 24910 while in the backwards scan (at the end) I see this index_scan_stream_read_next: block 24910 index_scan_stream_read_next: block 24911 index_scan_stream_read_next: block 24908 index_scan_stream_read_next: block 24909 index_scan_stream_read_next: block 24906 index_scan_stream_read_next: block 24907 index_scan_stream_read_next: block 24908 index_scan_stream_read_next: block 24905 index_scan_stream_read_next: block 24906 index_scan_stream_read_next: block 24903 index_scan_stream_read_next: block 24904 index_scan_stream_read_next: block 24905 index_scan_stream_read_next: block 24902 index_scan_stream_read_next: block 24903 index_scan_stream_read_next: block 24900 index_scan_stream_read_next: block 24901 index_scan_stream_read_next: block 24902 index_scan_stream_read_next: block 24899 index_scan_stream_read_next: block 24900 index_scan_stream_read_next: block 24897 index_scan_stream_read_next: block 24898 index_scan_stream_read_next: block 24899 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24897 index_scan_stream_read_next: block 24894 index_scan_stream_read_next: block 24895 index_scan_stream_read_next: block 24896 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 index_scan_stream_read_next: block 24894 index_scan_stream_read_next: block 24891 index_scan_stream_read_next: block 24892 index_scan_stream_read_next: block 24893 These are only the blocks that ended up passes to the read stream, not the skipped ones. And you can immediately see the backward scan requests more blocks for (roughly) the same part of the scan - the min/max block roughly match. The reason is pretty simple - the table is very correlated, and the forward scan requests blocks mostly in the right order. Only rarely it has to jump "back" when progressing to the next value, and so the lastBlock optimization works nicely. But with the backwards scan we apparently scan the values backwards, but then the blocks for each value are accessed in forward direction. So we do a couple blocks "forward" and then jump to the preceding value - but that's a couple blocks *back*. And that breaks the lastBlock check. I believe this applies both to master and the prefetching, except that master doesn't have read stream - so it only does sync I/O. Could that hide the extra buffer accesses, somehow? Anyway, this access pattern in backwards scans seems a bit unfortunate. regards -- Tomas Vondra
pgsql-hackers by date: