Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKG+2mMiyJEDV-rNb5q__x9VJW5p8oJpGn9ZpovbyoVe1VA@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
Re: Streaming I/O, vectored I/O (WIP) Re: Streaming I/O, vectored I/O (WIP) |
List | pgsql-hackers |
On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 12/03/2024 15:02, Thomas Munro wrote: > > src/backend/storage/aio/streaming_read.c > > src/include/storage/streaming_read.h > > Standard file header comments missing. Fixed. > It would be nice to have a comment at the top of streaming_read.c, > explaining at a high level how the circular buffer, lookahead and all > that works. Maybe even some diagrams. Done. > For example, what is head and what is tail? Before reading the code, I > assumed that 'head' was the next block range to return in > pg_streaming_read_buffer_get_next(). But I think it's actually the other > way round? Yeah. People seem to have different natural intuitions about head vs tail in this sort of thing, so I've switched to descriptive names: stream->{oldest,next}_buffer_index (see also below). > > /* > > * Create a new streaming read object that can be used to perform the > > * equivalent of a series of ReadBuffer() calls for one fork of one relation. > > * Internally, it generates larger vectored reads where possible by looking > > * ahead. > > */ > > PgStreamingRead * > > pg_streaming_read_buffer_alloc(int flags, > > void *pgsr_private, > > size_t per_buffer_data_size, > > BufferAccessStrategy strategy, > > BufferManagerRelation bmr, > > ForkNumber forknum, > > PgStreamingReadBufferCB next_block_cb) > > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of > the work it does is memory allocation. But I'd suggest something like > 'pg_streaming_read_begin' instead. I like it. Done. > Do we really need the pg_ prefix in these? Good question. My understanding of our convention is that pg_ is needed for local replacements/variants/extensions of things with well known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in a few places where the word is very common/short and we want to avoid collisions and make sure it's obviously ours (pg_popcount?), and I guess places that reflect the name of a SQL identifier with a prefix, but this doesn't seem to qualify for any of those things. It's a new thing, our own thing entirely, and sufficiently distinctive and unconfusable with standard stuff. So, prefix removed. Lots of other patches on top of this one are using "pgsr" as a variable name, ie containing that prefix; perhaps they would use "sr" or "streaming_read" or "stream". I used "stream" in a few places in this version. Other names improved in this version IMHO: pgsr_private -> callback_private. I find it clearer, as a way to indicate that the provider of the callback "owns" it. I also reordered the arguments: now it's streaming_read_buffer_begin(..., callback, callback_private, per_buffer_data_size), to keep those three things together. > > Buffer > > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data) > > Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', > for a shorter name. Hmm. The idea of 'buffer' appearing in a couple of names is that there are conceptually other kinds of I/O that we might want to stream, like raw files or buffers other than the buffer pool, maybe even sockets, so this would be part of a family of similar interfaces. I think it needs to be clear that this variant gives you buffers. I'm OK with removing "get" but I guess it would be better to keep the words in the same order across the three functions? What about these? streaming_read_buffer_begin(); streaming_read_buffer_next(); streaming_read_buffer_end(); Tried like that in this version. Other ideas would be to make "stream" the main noun, buffered_read_stream_begin() or something. Ideas welcome. It's also a bit grammatically weird to say StartReadBuffers() and WaitReadBuffers() in the bufmgr API... Hmm. Perhaps we should just call it ReadBuffers() and WaitForBufferIO()? Maybe surprising because the former isn't just like ReadBuffer() ... but on the other hand no one said it has to be, and sometimes it even is (when it gets a hit). I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite to make the asynchrony optional or explicit if someone has a problem with that. (Hmm, that'd be a bit like the Windows native file API, where ReadFile() is synchronous or asynchronous depending on flags.) > > > > /* > > * pgsr->ranges is a circular buffer. When it is empty, head == tail. > > * When it is full, there is an empty element between head and tail. Head > > * can also be empty (nblocks == 0), therefore we need two extra elements > > * for non-occupied ranges, on top of max_pinned_buffers to allow for the > > * maxmimum possible number of occupied ranges of the smallest possible > > * size of one. > > */ > > size = max_pinned_buffers + 2; > > I didn't understand this explanation for why it's + 2. I think the logic was right but the explanation was incomplete, sorry. It needed one gap between head and tail because head == tail means empty, and another because the tail item was still 'live' for one extra call until you examined it one more time to notice that all the buffers had been extracted. In any case I have now deleted all that. In this new version I don't need extra space between head and tail at all, because empty is detected with stream->pinned_buffers == 0, and tail moves forwards immediately when you consume buffers. > > /* > > * Skip the initial ramp-up phase if the caller says we're going to be > > * reading the whole relation. This way we start out doing full-sized > > * reads. > > */ > > if (flags & PGSR_FLAG_FULL) > > pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers); > > else > > pgsr->distance = 1; > > Should this be "Max(MAX_BUFFERS_PER_TRANSFER, > pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than > MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So > perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ? Right, done. Here are some other changes: * I'm fairly happy with the ABC adaptive distance algorithm so far, I think, but I spent more time tidying up the way it is implemented. I didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER], so I created a new dense array stream->buffers that behaved as a second circular queue. * The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to become the GUC that it always wanted to be: buffer_io_size defaulting to 128kB. Seems like a reasonable thing to have? Could also influence things like bulk write? (The main problem I have with the GUC currently is choosing a category, async resources is wrong....) * By analogy, it started to look a bit funny that each range had room for a ReadBuffersOperation, and we had enough ranges for max_pinned_buffers * 1 block range. So I booted that out to another dense array, of size max_ios. * At the same time, Bilal and Andres had been complaining privately about 'range' management overheads showing up in perf and creating a regression against master on fully cached scans that do nothing else (eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O and no CPU work with the page, a somewhat extreme case but a reasonable way to isolate the management costs); having made the above change, it suddenly seemed obvious that I should make the buffers array the 'main' circular queue, pointing off to another place for information required for dealing with misses. In this version, there are no more range objects. This feels better and occupies and touches less memory. See pictures below. * The 'head range' is replaced by the pending_read_{blocknum,nblocks} variables, which seems easier to understand. Essentially the callback's block numbers are run-length compressed into there until they can't be, at which point we start a read and start forming a new pending read. * A micro-optimisation arranges for the zero slots to be reused over and over again if we're doing distance = 1, to avoid rotating through memory for no benefit; I don't yet know if that pays for itself, it's just a couple of lines... * Various indexes and sizes that couldn't quite fit in uint8_t but couldn't possibly exceed a few thousand because they are bounded by numbers deriving from range-limited GUCs are now int16_t (while I was looking for low hanging opportunities to reduce memory usage...) In pictures, the circular queue arrangement changed from max_pinned_buffers * fat range objects like this, where only the per-buffer data was outside the range object (because its size is variable), and in the worst case all-hit case there were a lot of ranges with only one buffer in them: ranges buf/data +--------+--------------+----+ +-----+ | | | | | | +--------+--------------+----+ +-----+ tail -> | 10..10 | buffers[MAX] | op +----->| ? | +--------+--------------+----+ +-----+ | 42..44 | buffers[MAX] | op +----->| ? | +--------+--------------+----+ +-----+ | 60..60 | buffers[MAX] | op +--+ | ? | +--------+--------------+----+ | +-----+ head -> | | | | | | ? | +--------+--------------+----+ | +-----+ | | | | +-->| ? | +--------+--------------+----+ +-----+ | | | | | | +--------+--------------+----+ +-----+ ... to something that you might call a "columnar" layout, where ops are kicked out to their own array of size max_ios (a much smaller number), and the buffers, per-buffer data and indexes pointing to optional I/O objects are in parallel arrays of size max_pinned_buffers, like this: buffers buf/data buf/io ios (= ops) +----+ +-----+ +---+ +--------+ | | | | | | +---->| 42..44 | +----+ +-----+ +---+ | +--------+ oldest_buffer_index -> | 10 | | ? | | | | +-->| 60..60 | +----+ +-----+ +---+ | | +--------+ | 42 | | ? | | 0 +--+ | | | +----+ +-----+ +---+ | +--------+ | 43 | | ? | | | | | | +----+ +-----+ +---+ | +--------+ | 44 | | ? | | | | | | +----+ +-----+ +---+ | +--------+ | 60 | | ? | | 1 +----+ +----+ +-----+ +---+ next_buffer_index -> | | | | | | +----+ +-----+ +---+ In other words, there is essentially no waste/padding now, and in the all-hit case we stay in the zero'th elements of those arrays so they can stay red hot. Still working on validating this refactoring with other patches and test scenarios. I hope it's easier to understand, and does a better job of explaining itself. I'm also still processing a bunch of performance-related fixups mostly for bufmgr.c sent by Andres off-list (things like: StartReadBuffer() argument list is too wide, some things need inline, we should only initialise the op if it will be needed, oh I squashed that last one into the patch already), after he and Bilal studied some regressions in cases with no I/O. And thinking about Bilal's earlier message (extra read even when we're going to zero, oops, he's quite right about that) and a patch he sent me for that. More on those soon.
Attachment
pgsql-hackers by date: