Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKGJTwrS7F=uJPx3SeigMiQiW+LJaOkjGyZdCntwyMR=uAw@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
|
List | pgsql-hackers |
On Wed, Mar 27, 2024 at 1:40 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Is int16 enough though? It seems so, because: > > max_pinned_buffers = Max(max_ios * 4, buffer_io_size); > > and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and > buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32. > > If someone changes those constants though, int16 might overflow and fail > in weird ways. I'd suggest being more careful here and explicitly clamp > max_pinned_buffers at PG_INT16_MAX or have a static assertion or > something. (I think it needs to be somewhat less than PG_INT16_MAX, > because of the extra "overflow buffers" stuff and some other places > where you do arithmetic.) Clamp added. > > /* > > * We gave a contiguous range of buffer space to StartReadBuffers(), but > > * we want it to wrap around at max_pinned_buffers. Move values that > > * overflowed into the extra space. At the same time, put -1 in the I/O > > * slots for the rest of the buffers to indicate no I/O. They are covered > > * by the head buffer's I/O, if there is one. We avoid a % operator. > > */ > > overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers; > > if (overflow > 0) > > { > > memmove(&stream->buffers[0], > > &stream->buffers[stream->max_pinned_buffers], > > sizeof(stream->buffers[0]) * overflow); > > for (int i = 0; i < overflow; ++i) > > stream->buffer_io_indexes[i] = -1; > > for (int i = 1; i < nblocks - overflow; ++i) > > stream->buffer_io_indexes[stream->next_buffer_index + i] = -1; > > } > > else > > { > > for (int i = 1; i < nblocks; ++i) > > stream->buffer_io_indexes[stream->next_buffer_index + i] = -1; > > } > > Instead of clearing buffer_io_indexes here, it might be cheaper/simpler > to initialize the array to -1 in streaming_read_buffer_begin(), and > reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(), > after the WaitReadBuffers() call. In other words, except when an I/O is > in progress, keep all the elements at -1, even the elements that are not > currently in use. Yeah that wasn't nice and I had already got as far as doing exactly that ↑ on my own, but your second idea ↓ is better! > Alternatively, you could remember the first buffer that the I/O applies > to in the 'ios' array. In other words, instead of pointing from buffer > to the I/O that it depends on, point from the I/O to the buffer that > depends on it. The last attached patch implements that approach. I'm not > wedded to it, but it feels a little simpler. Yeah, nice improvement. > > if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE) > > { > > /* Distance ramps up fast (behavior C). */ > > ... > > } > > else > > { > > /* No advice; move towards full I/O size (behavior B). */ > > ... > > } > > The comment on ReadBuffersOperation says "Declared in public header only > to allow inclusion in other structs, but contents should not be > accessed", but here you access the 'flags' field. > > You also mentioned that the StartReadBuffers() argument list is too > long. Perhaps the solution is to redefine ReadBuffersOperation so that > it consists of two parts: 1st part is filled in by the caller, and > contains the arguments, and 2nd part is private to bufmgr.c. The > signature for StartReadBuffers() would then be just: > > bool StartReadBuffers(ReadBuffersOperation *operation); Yeah. I had already got as far as doing this on the regression hunting expedition, but I kept some arguments for frequently changing things, eg blocknum. It means that the stuff that never changes is in there, and the stuff that changes each time doesn't have to be written to memory at all. > That would make it OK to read the 'flags' field. It would also allow > reusing the same ReadBuffersOperation struct for multiple I/Os for the > same relation; you only need to change the changing parts of the struct > on each operation. Right. Done. > In the attached patch set, the first three patches are your v9 with no > changes. The last patch refactors away 'buffer_io_indexes' like I > mentioned above. The others are fixes for some other trivial things that > caught my eye. Thanks, all squashed into the patch. In an offline chat with Robert and Andres, we searched for a better name for the GUC. We came up with "io_combine_limit". It's easier to document a general purpose limit than to explain what "buffer_io_size" does (particularly since the set of affected features will grow over time but starts so small). I'll feel better about using it to control bulk writes too, with that name. I collapsed the various memory allocations into one palloc. The buffer array is now a buffers[FLEXIBLE_ARRAY_MEMBER]. I got rid of "finished" (now represented by distance == 0, I was removing branches and variables). I got rid of "started", which can now be deduced (used for suppressing advice when you're calling _next() because you need a block and we need to read it immediately), see the function argument suppress_advice. Here is a new proposal for the names, updated in v10: read_stream_begin_relation() read_stream_next_buffer() void read_stream_end() I think we'll finish up with different 'constructor' functions for different kinds of streams. For example I already want one that can provide a multi-relation callback for use by recovery (shown in v1). Others might exist for raw file access, etc. The defining characteristic of this one is that it accesses one specific relation/fork. Well, _relfork() might be more accurate but less easy on the eye. I won't be surprised if people not following this thread have ideas after commit; it's certainly happened before something gets renamed in beta and I won't mind a bit if that happens... I fixed a thinko in the new ReadBuffer() implementation mentioned before (thanks to Bilal for pointing this out): it didn't handle the RBM_ZERO_XXX flags properly. Well, it did work and the tests passed, but it performed a useless read first. I probably got mixed up when I removed the extended interface which was capable of streaming zeroed buffers but this simple one isn't and it wasn't suppressing the read as it would need to. Later I'll propose to add that back in for recovery. I fixed a recently added thinko in the circular queue: I mistakenly thought I didn't need a spare gap between head and tail anymore because now we never compare them, since we track the number of pinned buffers instead, but after read_stream_next_buffer() returns, users of per-buffer data need that data to remain valid until the next call. So the recent refactoring work didn't survive contact with Melanie's BHS work, which uses that. We need to "wipe" the previous (spare) one, which can't possibly be in use yet, and if anyone ever sees 0x7f in their per-buffer data, it will mean that they illegally accessed the older value after a new call to read_stream_next_buffer(). Fixed, with comments to clarify. Retesting with Melanie's latest BHS patch set and my random.sql (upthread) gives the same system call trace as before. The intention of that was to demonstrate what exact sequences effective_io_concurrency values give. Now you can also run that with different values of the new io_combine_limit. If you run it with io_combine_limit = '8kB', it looks almost exactly like master ie no big reads allowed; the only difference is that read_stream.c refuses to issue advice for strictly sequential reads: effective_io_concurrency = 1, range size = 2 unpatched patched ============================================================================== pread(93,...,8192,0x58000) = 8192 pread(84,...,8192,0x58000) = 8192 posix_fadvise(93,0x5a000,0x2000,...) pread(84,...,8192,0x5a000) = 8192 pread(93,...,8192,0x5a000) = 8192 posix_fadvise(84,0xb0000,0x2000,...) posix_fadvise(93,0xb0000,0x2000,...) pread(84,...,8192,0xb0000) = 8192 pread(93,...,8192,0xb0000) = 8192 pread(84,...,8192,0xb2000) = 8192 posix_fadvise(93,0xb2000,0x2000,...) posix_fadvise(84,0x108000,0x2000,...) pread(93,...,8192,0xb2000) = 8192 pread(84,...,8192,0x108000) = 8192 posix_fadvise(93,0x108000,0x2000,...) pread(84,...,8192,0x10a000) = 8192 You wouldn't normally see that though as the default io_combine_limit would just merge those adjacent reads after the first one triggers ramp-up towards behaviour C: effective_io_concurrency = 1, range size = 2 unpatched patched ============================================================================== pread(93,...,8192,0x58000) = 8192 pread(80,...,8192,0x58000) = 8192 posix_fadvise(93,0x5a000,0x2000,...) pread(80,...,8192,0x5a000) = 8192 pread(93,...,8192,0x5a000) = 8192 posix_fadvise(80,0xb0000,0x4000,...) posix_fadvise(93,0xb0000,0x2000,...) preadv(80,...,2,0xb0000) = 16384 pread(93,...,8192,0xb0000) = 8192 posix_fadvise(80,0x108000,0x4000,...) posix_fadvise(93,0xb2000,0x2000,...) preadv(80,...,2,0x108000) = 16384 pread(93,...,8192,0xb2000) = 8192 posix_fadvise(80,0x160000,0x4000,...) I spent most of the past few days trying to regain some lost performance. Thanks to Andres for some key observations and help! That began with reports from Bilal and Melanie (possibly related to things Tomas had seen too, not sure) of regressions in all-cached workloads, which I already improved a bit with the ABC algorithm that minimised pinning for this case. That is, if there's no recent I/O so we reach what I call behaviour A, it should try to do as little magic as possible. But it turns out that wasn't enough! It is very hard to beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over millions of already-cached blocks, if you have to do exactly the same work AND extra instructions for management. There were two layers to the solution in this new version: First, I now have a special case in read_stream_next_buffer(), a sort of open-coded specialisation for behaviour A with no per-buffer data, and that got rid of most of the regression, but some remained. Next, Andres pointed out that ReadBuffer() itself, even though it is now implemented on top of StartReadBuffers(nblocks = 1), was still beating my special case code that calls StartReadBuffers(nblocks = 1), even though it looks about the same, because bufmgr.c was able to inline and specialise the latter for one block. To give streaming_read.c that power from its home inside another translation units, we needed to export a special case singular StartReadBuffer() (investigation and patch by Andres, added as co-author). It just calls the plural function with nblocks = 1, but it gets inlined. So now the special case for behaviour A drills through both layers, and hopefully now there is no measurable regression.. need to test a bitt more. Of course we can't *beat* the old code in this case, yet, but... (We speculate that a future tree-based buffer mapping table might allow efficient lookup for a range of block numbers in one go, and then it could be worth paying the book-keeping costs to find ranges. Perhaps behaviour A and the associated special case code could then be deleted, as you'd probably want to use multi-block magic all the time, for both for I/O and mapping table lookups. Or something like that?)
Attachment
pgsql-hackers by date: