Re: index prefetching - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: index prefetching |
Date | |
Msg-id | CAH2-Wz=L7h-koDKa3_NEg39Faw7MrOkSVOsodvQ4toSQahvWjQ@mail.gmail.com Whole thread Raw |
In response to | Re: index prefetching (Andres Freund <andres@anarazel.de>) |
Responses |
Re: index prefetching
|
List | pgsql-hackers |
On Wed, Aug 13, 2025 at 11:28 AM Andres Freund <andres@anarazel.de> wrote: > > With "sync" I always get this (after a restart): > > > > Buffers: shared hit=7435 read=52801 > > > > while with "worker" I get this: > > > > Buffers: shared hit=4879 read=52801 > > Buffers: shared hit=5151 read=52801 > > Buffers: shared hit=4978 read=52801 > > > > So not only it changes run to tun, it also does not add up to 60236. > > This is reproducible on master? If so, how? AFAIK it is *not* reproducible on master. > If this happens only with the prefetching patch applied, is is possible that > what happens here is that we occasionally re-request buffers that already in > the process of being read in? That would only happen with a read stream and > io_method != sync (since with sync we won't read ahead). If we have to start > reading in a buffer that's already undergoing IO we wait for the IO to > complete and count that access as a hit: This theory seems quite plausible to me. Though it is a bit surprising that I see incorrect buffer hit counts on the "good" forwards scan case, rather than on the "bad" backwards scan case. Here's what I mean by things being broken on the read stream side (at least with certain backwards scan cases): When I add instrumentation to the read stream side, by adding elog debug calls that show the blocknum seen by read_stream_get_block, I see out-of-order and repeated blocknums with the "bad" backwards scan case ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a desc"): ... NOTICE: index_scan_stream_read_next: index 1163 TID (25052,21) WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052 WARNING: blocknum: 25052, 0x55614810efb0 WARNING: blocknum: 25052, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 1161 TID (25053,3) WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25053 WARNING: blocknum: 25053, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 1160 TID (25052,19) WARNING: prior lastBlock is 25053 for batchno 2856, new one: 25052 WARNING: blocknum: 25052, 0x55614810efb0 WARNING: blocknum: 25052, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 1141 TID (25051,21) WARNING: prior lastBlock is 25052 for batchno 2856, new one: 25051 WARNING: blocknum: 25051, 0x55614810efb0 ... Notice that we see the same blocknum twice in close succession. Also notice that we're passed 25052 and then subsequently passed 25053, only to be passed 25053 once more. OTOH, when I run the equivalent "good" backwards scan ("SELECT * FROM t WHERE a BETWEEN 16336 AND 49103 ORDER BY a"), the output looks just about perfect. I have to look around quite a bit longer before I can find repeated blocknum within successive read_stream_get_block calls: ... NOTICE: index_scan_stream_read_next: index 303 TID (74783,1) WARNING: prior lastBlock is 74782 for batchno 2862, new one: 74783 WARNING: blocknum: 74783, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 323 TID (74784,1) WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784 WARNING: blocknum: 74784, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 324 TID (74783,21) WARNING: prior lastBlock is 74784 for batchno 2862, new one: 74783 WARNING: blocknum: 74783, 0x55614810efb0 NOTICE: index_scan_stream_read_next: index 325 TID (74784,2) WARNING: prior lastBlock is 74783 for batchno 2862, new one: 74784 WARNING: blocknum: 74784, 0x55614810efb0 ... These out-of-order repeat requests are much rarer. And I *never* see identical requests in *immediate* succession, whereas those are common with the backwards scan case. I believe that the out-of-order repeat requests shown here are a legitimate consequence of the TIDs being slightly out of order in relatively few places (so the forwards scan case may well already be behaving exactly as I expect): pg@regression:5432 [2470184]=# select ctid, a from t where ctid between '(74783,1)' and '(74784,1)'; ┌────────────┬────────┐ │ ctid │ a │ ├────────────┼────────┤ │ (74783,1) │ 49,077 │ │ (74783,2) │ 49,077 │ │ (74783,3) │ 49,077 │ │ (74783,4) │ 49,077 │ │ (74783,5) │ 49,077 │ │ (74783,6) │ 49,077 │ │ (74783,7) │ 49,077 │ │ (74783,8) │ 49,077 │ │ (74783,9) │ 49,077 │ │ (74783,10) │ 49,077 │ │ (74783,11) │ 49,077 │ │ (74783,12) │ 49,077 │ │ (74783,13) │ 49,077 │ │ (74783,14) │ 49,077 │ │ (74783,15) │ 49,077 │ │ (74783,16) │ 49,077 │ │ (74783,17) │ 49,077 │ │ (74783,18) │ 49,077 │ │ (74783,19) │ 49,077 │ │ (74783,20) │ 49,077 │ │ (74783,21) │ 49,078 │ │ (74784,1) │ 49,077 │ └────────────┴────────┘ (22 rows) Bear in mind that EXPLAIN ANALYZE shows *identical* "Buffers:" details for each query on master. So I believe that I am completely justified in expecting the calls to read_stream_get_block for the backwards scan to use identical blocknums to the ones for the equivalent/good forwards scan (except that they should be in the exact opposite order). And yet that's not what I see. Maybe this is something to do with the read position and the stream position becoming mixed up? I find it odd that the relevant readstream callback, index_scan_stream_read_next, says "If the stream position is undefined, just use the read position". That's just a guess, though. This issue is tricky to debug. I'm not yet used to debugging problems such as these (though I'll probably become an expert on it in the months ahead). -- Peter Geoghegan
pgsql-hackers by date: