Thread: [MASSMAIL]Streaming relation data out of order
Hi This idea is due to Robert Haas, who complained that he feared that the streaming I/O API already worked like this. It doesn't, but it could! Here is a concept patch to try it out. Normally, read_stream_next_buffer() spits out buffers in the order that the user's callback generated block numbers. This option says that any order would be OK. I had been assuming that this sort of thing would come with real asynchronous I/O: if I/Os complete out of order and the caller explicitly said she doesn't care about block order, we can stream them as the completion events arrive. But even with synchronous I/O, we could stream already-in-cache blocks before ones that require I/O. Letting the caller chew on blocks that are already available maximises the time between fadvise() and preadv() for misses, which minimises the likelihood that the process will have to go to "D" sleep. The patch is pretty trivial: if started with the READ_STREAM_OUT_OF_ORDER flag, "hit" buffers are allowed to jump in front of "miss" buffers in the queue. The attached coding may not be optimal, it's just a proof of concept. ANALYZE benefits from this, for example, with certain kinds of partially cached initial states and fast disks (?). I'm not sure how generally useful it is though. I'm posting it because I wonder if it could be interesting for the streaming bitmap heapscan project, and I wonder how many other things don't care about the order.
Attachment
On Tue, Apr 9, 2024 at 12:19 AM Thomas Munro <thomas.munro@gmail.com> wrote: > This idea is due to Robert Haas, who complained that he feared that > the streaming I/O API already worked like this. It doesn't, but it > could! Here is a concept patch to try it out. Oh, no, it's all my fault! My favorite part of this patch is the comment added to read_stream.h, which IMHO makes things a whole lot clearer. I have no opinion on the rest of it; I don't understand this code. -- Robert Haas EDB: http://www.enterprisedb.com
Here is a more fleshed out version of this concept patch, now that we have lots of streams wired up to try it with. Bitmap Heap Scan seemed to be a good candidate. postgres=# create table t (a int unique, b int unique); CREATE TABLE postgres=# insert into t select generate_series(1, 100000), generate_series(1, 100000); INSERT 0 100000 postgres=# select ctid from t where (a between 1 and 8000 or b between 1 and 8000) and ctid::text like '%,1)'; ctid -------- (0,1) (1,1) (2,1) ... pages in order ... (33,1) (34,1) (35,1) (36 rows) postgres=# select pg_buffercache_evict(bufferid) from pg_buffercache where relfilenode = pg_relation_filenode('t'::regclass) and relblocknumber % 3 != 2; ... some pages being kicked out to create a mixed hit/miss scan ... postgres=# select ctid from t where (a between 1 and 8000 or b between 1 and 8000) and ctid::text like '%,1)'; ctid -------- (0,1) (1,1) (2,1) ... order during early distance ramp-up ... (12,1) (20,1) <-- chaos reigns (23,1) (17,1) (26,1) (13,1) (29,1) ... (34,1) (36 rows) One weird issue, not just with reordering, is that read_stream.c's distance cools off too easily with some simple test patterns. Start at 1, double on miss, subtract one on hit, repeat, and you can be trapped alternating between 1 and 2 when you'd certainly benefit from IO concurrency and also reordering. It may need a longer memory. That seemed like too artificial a problem to worry about for v18, but it's why I used relblocknumber % 3 != 2 and not eg relblocknumber % 2 != 1 above.
Attachment
Hi, On 2025-04-09 19:17:00 +1200, Thomas Munro wrote: > One weird issue, not just with reordering, is that read_stream.c's > distance cools off too easily with some simple test patterns. Start > at 1, double on miss, subtract one on hit, repeat, and you can be > trapped alternating between 1 and 2 when you'd certainly benefit from > IO concurrency and also reordering. It may need a longer memory. Just having a hit in itself really is no reason to reduce the distance IMO. It's completely normal to have a few hits when doing readahead, but a few hits in a row doesn't mean the IO latency of the subsequent miss doesn't matter anymore. But if our entire lookahead only is hits, we actually are in a pattern where there no misses for the moment, and can reduce the distance. Another issue is that we, afaict, overweigh hits rather substantially due to IO combining. A single one block hit (hits are never bigger), decreases IO distance. But a read of size N only affects distance once, not N times. I.e. what I am suggesting is something like: 1) Increase distance by * 2 + read_size on a miss This would help us to increase with distance = 1, where * 2 only increases distance by 1, just as -1 obviously reduces it by 1. It'd also somewhat address the over-weighing of hits compared to misses. 2) Only reduce distance if we have no outstanding IOs > Subject: [PATCH v2 1/4] Move read stream modulo arithmetic into functions. > > Several places have open-coded circular index arithmetic. Make some > common functions for better readability and consistent assertion > checking. This avoids adding yet more open-coding in later patches. > --- > src/backend/storage/aio/read_stream.c | 93 +++++++++++++++++++++------ > 1 file changed, 74 insertions(+), 19 deletions(-) > > diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c > index 0e7f5557f5c..0d3dd65bfed 100644 > --- a/src/backend/storage/aio/read_stream.c > +++ b/src/backend/storage/aio/read_stream.c > @@ -216,6 +216,67 @@ read_stream_unget_block(ReadStream *stream, BlockNumber blocknum) > stream->buffered_blocknum = blocknum; > } > > +/* > + * Increment buffer index, wrapping around at queue size. > + */ > +static inline void > +read_stream_advance_buffer(ReadStream *stream, int16 *index) > +{ > + Assert(*index >= 0); > + Assert(*index < stream->queue_size); > + > + *index += 1; > + if (*index == stream->queue_size) > + *index = 0; > +} > +/* > + * Increment buffer index by n, wrapping around at queue size. > + */ > +static inline void > +read_stream_advance_buffer_n(ReadStream *stream, int16 *index, int16 n) > +{ > + Assert(*index >= 0); > + Assert(*index < stream->queue_size); > + Assert(n <= stream->io_combine_limit); > + > + *index += n; > + if (*index >= stream->queue_size) > + *index -= stream->queue_size; > + > + Assert(*index >= 0); > + Assert(*index < stream->queue_size); > +} Not sure if that's it's possible to have a large enough queue_size, but this doesn't look like it's safe against *index + n > PG_INT16_MAX. The value would wrap into the negative. I think all these variables really ought to be unsigned. None of them ever could be negative afaict. Afaict that would lead to the correct behaviour when wapping around. Greetings, Andres Freund
On Thu, Apr 10, 2025 at 7:35 AM Andres Freund <andres@anarazel.de> wrote: > 1) Increase distance by * 2 + read_size on a miss > > This would help us to increase with distance = 1, where * 2 only increases > distance by 1, just as -1 obviously reduces it by 1. > > It'd also somewhat address the over-weighing of hits compared to misses. > > 2) Only reduce distance if we have no outstanding IOs Interesting. That's similar in some ways to one of my own attempts: I had an extra thing called sustain, and that had to be counted down to zero before distance started decrementing. I didn't want the distance itself to get higher than it already did by doubling once per (combined) IO, I just wanted to defer the decay. (Names from an earlier abandoned attempt to use synthesiser terminology: attack, decay, sustain, release; cute but useless.) I tried a few different recipes, but just adding read sizes to sustain works out pretty close to your pleasingly simple rule 2: sustain until we can't see any IOs in the lookahead window. But I also tried adding a bit more for longer sustain, sustain += read_size * 2 or whatever, so you don't give up your window so easily on a run of hits, and it also achieve escape velocity from the gravity of 1. IDK, I hadn't formed any strong opinions yet and need to test more stuff, hence lack of concrete patches. Thanks, this was helpful, I'll play around with it! > > +/* > > + * Increment buffer index by n, wrapping around at queue size. > > + */ > > +static inline void > > +read_stream_advance_buffer_n(ReadStream *stream, int16 *index, int16 n) > > +{ > > + Assert(*index >= 0); > > + Assert(*index < stream->queue_size); > > + Assert(n <= stream->io_combine_limit); > > + > > + *index += n; > > + if (*index >= stream->queue_size) > > + *index -= stream->queue_size; > > + > > + Assert(*index >= 0); > > + Assert(*index < stream->queue_size); > > +} > > Not sure if that's it's possible to have a large enough queue_size, but this > doesn't look like it's safe against *index + n > PG_INT16_MAX. The value would > wrap into the negative. No, because the queue is sized such that indexes into this overflow space also fit under INT16_MAX: /* * If starting a multi-block I/O near the end of the queue, we might * temporarily need extra space for overflowing buffers before they are * moved to regular circular position. This is the maximum extra space we * could need. */ queue_overflow = io_combine_limit - 1; The patch clearly lacks a comment to explain that mystery. Will add. Thanks! > I think all these variables really ought to be unsigned. None of them ever > could be negative afaict. Maybe... I tend to default to signed variables when there isn't a good reason to prefer unsigned. But I have also been mulling a change here since: commit 06fb5612c970b3af95aca3db5a955669b07537ca Author: Thomas Munro <tmunro@postgresql.org> Date: Wed Mar 19 15:23:12 2025 +1300 Increase io_combine_limit range to 1MB. Unfortunately uint16 still needs to worry about UINT16_MAX (effective_io_concurrency = 1000 * io_combine_limit = 128 = 128,000 buffers + a few, still one bit short; before the above commit the cap was 32,000 buffers + a few, which is why int16 was enough when the code was written). Maybe we should just go directly to int, or uint32 if you insist, and stop worrying about overflow. I think we agreed that 128,000 buffers = 1000MB of in-progress I/O seemed unrealistic and 32K ought to be enough for anybody™, but really we're talking about saving a handful of bytes in struct ReadStream in exchange for a risk of bugs. Will write a patch to try that soon...