Thread: Streaming I/O, vectored I/O (WIP)
Hi, Currently PostgreSQL reads (and writes) data files 8KB at a time. That's because we call ReadBuffer() one block at a time, with no opportunity for lower layers to do better than that. This thread is about a model where you say which block you'll want next with a callback, and then you pull the buffers out of a "stream". That way, the streaming infrastructure can look as far into the future as it wants, and then: * systematically issue POSIX_FADV_WILLNEED for random access, replacing patchy ad hoc advice * build larger vectored I/Os; eg one preadv() call can replace 16 pread() calls That's more efficient, and it goes faster. It's better even on systems without 'advice' and/or vectored I/O support, because some I/Os can be merged into wider simple pread/pwrite calls, and various other small efficiencies come from batching. The real goal, though, is to make it easier for later work to replace the I/O subsystem with true asynchronous and concurrent I/O, as required to get decent performance with direct I/O (and, at a wild guess, the magic network smgr replacements that many of our colleagues on this list work on). Client code such as access methods wouldn't need to change again to benefit from that, as it would be fully insulated by the streaming abstraction. There are more kinds of streaming I/O that would be useful, such as raw unbuffered files, and of course writes, and I've attached some early incomplete demo code for writes (just for fun), but the main idea I want to share in this thread is the idea of replacing lots of ReadBuffer() calls with the streaming model. That's the thing with the most potential users throughout the source tree and AMs, and I've attached some work-in-progress examples of half a dozen use cases. === 1. Vectored I/O through the layers === * Provide vectored variants of FileRead() and FileWrite(). * Provide vectored variants of smgrread() and smgrwrite(). * Provide vectored variant of ReadBuffer(). * Provide multi-block smgrprefetch(). === 2. Streaming read API === * Give SMgrRelation pointers a well-defined lifetime. * Provide basic streaming read API. === 3. Example users of streaming read API === * Use streaming reads in pg_prewarm. [TM] * WIP: Use streaming reads in heapam scans. [AF] * WIP: Use streaming reads in vacuum. [AF] * WIP: Use streaming reads in nbtree vacuum scan. [AF] * WIP: Use streaming reads in bitmap heapscan. [MP] * WIP: Use streaming reads in recovery. [TM] === 4. Some less developed work on vectored writes === * WIP: Provide vectored variant of FlushBuffer(). * WIP: Use vectored writes in checkpointer. All of these are WIP; those marked WIP above are double-WIP. But there's enough to demo the concept and discuss. Here are some assorted notes: * probably need to split block-count and I/O-count in stats system? * streaming needs to "ramp up", instead of going straight to big reads * the buffer pin limit is somewhat primitive * more study of buffer pool correctness required * 16 block/128KB size limit is not exactly arbitrary but not well researched (by me at least) * various TODOs in user patches A bit about where this code came from and how it relates to the "AIO" project[1]: The idea and terminology 'streaming I/O' are due to Andres Freund. This implementation of it is mine, and to keep this mailing list fun, he hasn't reviewed it yet. The example user patches are by Andres, Melanie Plageman and myself, and were cherry picked from the AIO branch, where they originally ran on top of Andres's truly asynchronous 'streaming read', which is completely different code. It has (or will have) exactly the same API, but it does much more, with much more infrastructure. But the AIO branch is far too much to propose at once. We might have been a little influenced by a recent discussion on pgsql-performance[2] that I could summarise as "why do you guys need to do all this fancy AIO stuff, just give me bigger reads!". That was actually a bit of a special case, I think (something is wrong with btrfs's prefetch heuristics?), but in conversation we realised that converting parts of PostgreSQL over to a stream-oriented model could be done independently of AIO, and could offer some nice incremental benefits already. So I worked on producing this code with an identical API that just maps on to old fashioned synchronous I/O calls, except bigger and better. The "example user" patches would be proposed separately in their own threads after some more work, but I wanted to demonstrate the wide applicability of this style of API in this preview. Some of these make use of the ability to attach a bit of extra data to each buffer -- see Melanie's bitmap heapscan patch, for example. In later revisions I'll probably just pick one or two examples to work with for a smaller core patch set, and then the rest can be developed separately. (We thought about btree scans too as a nice high value area to tackle, but Tomas Vondra is hacking in that area and we didn't want to step on his toes.) [1] https://wiki.postgresql.org/wiki/AIO [2] https://www.postgresql.org/message-id/flat/218fa2e0-bc58-e469-35dd-c5cb35906064%40gmx.net
Attachment
- v1-0001-Provide-vectored-variants-of-FileRead-and-FileWri.patch
- v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
- v1-0003-Provide-vectored-variant-of-ReadBuffer.patch
- v1-0004-Provide-multi-block-smgrprefetch.patch
- v1-0005-Give-SMgrRelation-pointers-a-well-defined-lifetim.patch
- v1-0006-Provide-basic-streaming-read-API.patch
- v1-0007-Use-streaming-reads-in-pg_prewarm.patch
- v1-0008-WIP-Use-streaming-reads-in-heapam-scans.patch
- v1-0009-WIP-Use-streaming-reads-in-vacuum.patch
- v1-0010-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v1-0011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
- v1-0012-WIP-Use-streaming-reads-in-recovery.patch
- v1-0013-WIP-Provide-vectored-variant-of-FlushBuffer.patch
- v1-0014-WIP-Use-vector-writes-in-checkpointer.patch
On 31/08/2023 07:00, Thomas Munro wrote: > Currently PostgreSQL reads (and writes) data files 8KB at a time. > That's because we call ReadBuffer() one block at a time, with no > opportunity for lower layers to do better than that. This thread is > about a model where you say which block you'll want next with a > callback, and then you pull the buffers out of a "stream". I love this idea! Makes it a lot easier to perform prefetch, as evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch: 13 files changed, 289 insertions(+), 637 deletions(-) I'm a bit disappointed and surprised by v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though: 4 files changed, 244 insertions(+), 78 deletions(-) The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped that this would simplify it. I didn't look closely at that patch, so maybe it's simpler even though it's more code. > There are more kinds of streaming I/O that would be useful, such as > raw unbuffered files, and of course writes, and I've attached some > early incomplete demo code for writes (just for fun), but the main > idea I want to share in this thread is the idea of replacing lots of > ReadBuffer() calls with the streaming model. All this makes sense. Some random comments on the patches: > + /* Avoid a slightly more expensive kernel call if there is no benefit. */ > + if (iovcnt == 1) > + returnCode = pg_pread(vfdP->fd, > + iov[0].iov_base, > + iov[0].iov_len, > + offset); > + else > + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); How about pushing down this optimization to pg_preadv() itself? pg_readv() is currently just a macro if the system provides preadv(), but it could be a "static inline" that does the above dance. I think that optimization is platform-dependent anyway, pread() might not be any faster on some OSs. In particular, if the system doesn't provide preadv() and we use the implementation in src/port/preadv.c, it's the same kernel call anyway. > v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch No smgrextendv()? I guess none of the patches here needed it. > /* > * Prepare to read a block. The buffer is pinned. If this is a 'hit', then > * the returned buffer can be used immediately. Otherwise, a physical read > * should be completed with CompleteReadBuffers(). PrepareReadBuffer() > * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the > * caller has the opportunity to coalesce reads of neighboring blocks into one > * CompleteReadBuffers() call. > * > * *found is set to true for a hit, and false for a miss. > * > * *allocated is set to true for a miss that allocates a buffer for the first > * time. If there are multiple calls to PrepareReadBuffer() for the same > * block before CompleteReadBuffers() or ReadBuffer_common() finishes the > * read, then only the first such call will receive *allocated == true, which > * the caller might use to issue just one prefetch hint. > */ > Buffer > PrepareReadBuffer(BufferManagerRelation bmr, > ForkNumber forkNum, > BlockNumber blockNum, > BufferAccessStrategy strategy, > bool *found, > bool *allocated) > If you decide you don't want to perform the read, after all, is there a way to abort it without calling CompleteReadBuffers()? Looking at the later patch that introduces the streaming read API, seems that it finishes all the reads, so I suppose we don't need an abort function. Does it all get cleaned up correctly on error? > /* > * Convert an array of buffer address into an array of iovec objects, and > * return the number that were required. 'iov' must have enough space for up > * to PG_IOV_MAX elements. > */ > static int > buffers_to_iov(struct iovec *iov, void **buffers, int nblocks) The comment is a bit inaccurate. There's an assertion that If nblocks <= PG_IOV_MAX, so while it's true that 'iov' must have enough space for up to PG_IOV_MAX elements, that's only because we also assume that nblocks <= PG_IOV_MAX. I don't see anything in the callers (mdreadv() and mdwritev()) to prevent them from passing nblocks > PG_IOV_MAX. in streaming_read.h: > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr, > uintptr_t pgsr_private, > void *per_io_private, > BufferManagerRelation *bmr, > ForkNumber *forkNum, > BlockNumber *blockNum, > ReadBufferMode *mode); I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on each read. I see that you used that in the WAL replay prefetching, so I guess that makes sense. > extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr); > extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private); > extern void pg_streaming_read_reset(PgStreamingRead *pgsr); > extern void pg_streaming_read_free(PgStreamingRead *pgsr); Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL replay prefetching patch, and only after calling pg_streaming_read_reset(). Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly? Is there any need to "reset" the stream, without also starting prefetching? In v1-0012-WIP-Use-streaming-reads-in-recovery.patch: > @@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id, > * If the WAL record contains a block reference with the given ID, *rlocator, > * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and > * returns true. Otherwise returns false. > + * > + * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership > + * of the pin is transferred to the caller. > */ > bool > XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id, > @@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id, > if (blknum) > *blknum = bkpb->blkno; > if (prefetch_buffer) > + { > *prefetch_buffer = bkpb->prefetch_buffer; > + > + /* > + * Clear this flag is so that we can assert that redo records take > + * ownership of all buffers pinned by xlogprefetcher.c. > + */ > + bkpb->prefetch_buffer = InvalidBuffer; > + } > return true; > } Could these changes be committed independently of all the other changes? -- Heikki Linnakangas Neon (https://neon.tech)
Hi, On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote: > I'm a bit disappointed and surprised by > v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though: > > 4 files changed, 244 insertions(+), 78 deletions(-) > > The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped > that this would simplify it. I didn't look closely at that patch, so maybe > it's simpler even though it's more code. A good chunk of the changes is pretty boring stuff. A good chunk of the remainder could be simplified a lot - it's partially there because vacuumlazy changed a lot over the last couple years and because a bit more refactoring is needed. I do think it's actually simpler in some ways - besides being more efficient... > > v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch > > No smgrextendv()? I guess none of the patches here needed it. I can't really imagine needing it anytime soon - due to the desire to avoid ENOSPC for pages in the buffer pool the common pattern is to extend relations with zeroes on disk, then populate those buffers in memory. It's possible that you could use something like smgrextendv() when operating directly on the smgr level - but then I suspect you're going to be better off to extend the relation to the right size in one operation and then just use smgrwritev() to write out the contents. > > /* > > * Prepare to read a block. The buffer is pinned. If this is a 'hit', then > > * the returned buffer can be used immediately. Otherwise, a physical read > > * should be completed with CompleteReadBuffers(). PrepareReadBuffer() > > * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the > > * caller has the opportunity to coalesce reads of neighboring blocks into one > > * CompleteReadBuffers() call. > > * > > * *found is set to true for a hit, and false for a miss. > > * > > * *allocated is set to true for a miss that allocates a buffer for the first > > * time. If there are multiple calls to PrepareReadBuffer() for the same > > * block before CompleteReadBuffers() or ReadBuffer_common() finishes the > > * read, then only the first such call will receive *allocated == true, which > > * the caller might use to issue just one prefetch hint. > > */ > > Buffer > > PrepareReadBuffer(BufferManagerRelation bmr, > > ForkNumber forkNum, > > BlockNumber blockNum, > > BufferAccessStrategy strategy, > > bool *found, > > bool *allocated) > > > > If you decide you don't want to perform the read, after all, is there a way > to abort it without calling CompleteReadBuffers()? When would that be needed? > Looking at the later patch that introduces the streaming read API, seems > that it finishes all the reads, so I suppose we don't need an abort > function. Does it all get cleaned up correctly on error? I think it should. The buffer error handling is one of the areas where I really would like to have some way of testing the various cases, it's easy to get things wrong, and basically impossible to write reliable tests for with our current infrastructure. > > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr, > > uintptr_t pgsr_private, > > void *per_io_private, > > BufferManagerRelation *bmr, > > ForkNumber *forkNum, > > BlockNumber *blockNum, > > ReadBufferMode *mode); > > I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on > each read. I see that you used that in the WAL replay prefetching, so I > guess that makes sense. Yea, that's the origin - I don't like it, but I don't really have a better idea. > > extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr); > > extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private); > > extern void pg_streaming_read_reset(PgStreamingRead *pgsr); > > extern void pg_streaming_read_free(PgStreamingRead *pgsr); > > Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL > replay prefetching patch, and only after calling pg_streaming_read_reset(). > Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly? > Is there any need to "reset" the stream, without also starting prefetching? Heh, I think this is a discussion Thomas I were having before... Greetings, Andres Freund
On Thu, Sep 28, 2023 at 9:13 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote: > > Looking at the later patch that introduces the streaming read API, seems > > that it finishes all the reads, so I suppose we don't need an abort > > function. Does it all get cleaned up correctly on error? > > I think it should. The buffer error handling is one of the areas where I > really would like to have some way of testing the various cases, it's easy to > get things wrong, and basically impossible to write reliable tests for with > our current infrastructure. One thing to highlight is that this patch doesn't create a new state in that case. In master, we already have the concept of a buffer with BM_TAG_VALID but not BM_VALID and not BM_IO_IN_PROGRESS, reachable if there is an I/O error. Eventually another reader will try the I/O again, or the buffer will fall out of the pool. With this patch it's the same, it's just a wider window: more kinds of errors might be thrown in code between Prepare() and Complete() before we even have BM_IO_IN_PROGRESS. So there is nothing extra to clean up. Right? Yeah, it would be nice to test buffer pool logic directly. Perhaps with a C unit test framework[1] and pluggable smgr[2] we could mock up cases like I/O errors... > > > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr, > > > uintptr_t pgsr_private, > > > void *per_io_private, > > > BufferManagerRelation *bmr, > > > ForkNumber *forkNum, > > > BlockNumber *blockNum, > > > ReadBufferMode *mode); > > > > I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on > > each read. I see that you used that in the WAL replay prefetching, so I > > guess that makes sense. > > Yea, that's the origin - I don't like it, but I don't really have a better > idea. Another idea I considered was that streams could be associated with a single relation, but recovery could somehow manage a set of them. From a certain point of view, that makes sense (we could be redoing work that was created by multiple concurrent streams at 'do' time, and with the approach shown here some clustering opportunities available at do time are lost at redo time), but it's not at all clear that it's worth the overheads or complexity, and I couldn't immediately figure out how to do it. But I doubt there would ever be any other users of a single stream with multiple relations, and I agree that this is somehow not quite satisfying... Perhaps we should think about that some more... [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com
On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > + /* Avoid a slightly more expensive kernel call if there is no benefit. */ > > + if (iovcnt == 1) > > + returnCode = pg_pread(vfdP->fd, > > + iov[0].iov_base, > > + iov[0].iov_len, > > + offset); > > + else > > + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); > > How about pushing down this optimization to pg_preadv() itself? > pg_readv() is currently just a macro if the system provides preadv(), > but it could be a "static inline" that does the above dance. I think > that optimization is platform-dependent anyway, pread() might not be any > faster on some OSs. In particular, if the system doesn't provide > preadv() and we use the implementation in src/port/preadv.c, it's the > same kernel call anyway. Done. I like it, I just feel a bit bad about moving the p*v() replacement functions around a couple of times already! I figured it might as well be static inline even if we use the fallback (= Solaris and Windows). > I don't see anything in the callers (mdreadv() and mdwritev()) to > prevent them from passing nblocks > PG_IOV_MAX. The outer loop in md*v() copes with segment boundaries and also makes sure lengthof(iov) AKA PG_IOV_MAX isn't exceeded (though that couldn't happen with the current caller). > in streaming_read.h: > > > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr, > > uintptr_t pgsr_private, > > void *per_io_private, > > BufferManagerRelation *bmr, > > ForkNumber *forkNum, > > BlockNumber *blockNum, > > ReadBufferMode *mode); > > I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on > each read. I see that you used that in the WAL replay prefetching, so I > guess that makes sense. In this version I have introduced an alternative simple callback. It's approximately what we had already tried out in an earlier version before I started streamifying recovery, but in this version you can choose, so recovery can opt for the wider callback. I've added some ramp-up logic. The idea is that after we streamify everything in sight, we don't want to penalise users that don't really need more than one or two blocks, but don't know that yet. Here is how the system calls look when you do pg_prewarm(): pread64(32, ..., 8192, 0) = 8192 <--- start with just one block pread64(32, ..., 16384, 8192) = 16384 pread64(32, ..., 32768, 24576) = 32768 pread64(32, ..., 65536, 57344) = 65536 pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16 blocks at a time pread64(32, ..., 131072, 253952) = 131072 pread64(32, ..., 131072, 385024) = 131072 I guess it could be done in quite a few different ways and I'm open to better ideas. This way inserts prefetching stalls but ramps up quickly and is soon out of the way. I wonder if we would want to make that a policy that a caller can disable, if you want to skip the ramp-up and go straight for the largest possible I/O size? Then I think we'd need a 'flags' argument to the streaming read constructor functions. A small detour: While contemplating how this interacts with parallel sequential scan, which also has a notion of ramping up, I noticed another problem. One parallel seq scan process does this: fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0 preadv(32, [...], 2, 35127296) = 131072 preadv(32, [...], 2, 35258368) = 131072 fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0 preadv(32, [...], 2, 36175872) = 131072 preadv(32, [...], 2, 36306944) = 131072 ... We don't really want those fadvise() calls. We don't get them with parallelism disabled, because streaming_read.c is careful not to generate advice for sequential workloads based on ancient wisdom from this mailing list, re-confirmed on recent Linux: WILLNEED hints actually get in the way of Linux's own prefetching and slow you down, so we only want them for truly random access. But the logic can't see that another process is making holes in this process's sequence. The two obvious solutions are (1) pass in a flag at the start saying "I promise this is sequential even if it doesn't look like it, no hints please" and (2) invent "shared" (cross-process) streaming reads, and teach all the parallel seq scan processes to get their buffers from there. Idea (2) is interesting to think about but even if it is a useful idea (not sure) it is certainly overkill just to solve this little problem for now. So perhaps I should implement (1), which would be another reason to add a flags argument. It's not a perfect solution though because some more 'data driven' parallel scans (indexes, bitmaps, ...) have a similar problem that is less amenable to top-down kludgery. I've included just the pg_prewarm example user for now while we discuss the basic infrastructure. The rest are rebased and in my public Github branch streaming-read (repo macdice/postgres) if anyone is interested (don't mind the red CI failures, they're just saying I ran out of monthly CI credits on the 29th, so close...)
Attachment
- v2-0001-Optimize-pg_readv-pg_pwritev-for-single-vector.patch
- v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri.patch
- v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
- v2-0004-Provide-vectored-variant-of-ReadBuffer.patch
- v2-0005-Provide-multi-block-smgrprefetch.patch
- v2-0006-Give-SMgrRelation-pointers-a-well-defined-lifetim.patch
- v2-0007-Provide-basic-streaming-read-API.patch
- v2-0008-Use-streaming-reads-in-pg_prewarm.patch
On 28/11/2023 14:17, Thomas Munro wrote: > On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> + /* Avoid a slightly more expensive kernel call if there is no benefit. */ >>> + if (iovcnt == 1) >>> + returnCode = pg_pread(vfdP->fd, >>> + iov[0].iov_base, >>> + iov[0].iov_len, >>> + offset); >>> + else >>> + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); >> >> How about pushing down this optimization to pg_preadv() itself? >> pg_readv() is currently just a macro if the system provides preadv(), >> but it could be a "static inline" that does the above dance. I think >> that optimization is platform-dependent anyway, pread() might not be any >> faster on some OSs. In particular, if the system doesn't provide >> preadv() and we use the implementation in src/port/preadv.c, it's the >> same kernel call anyway. > > Done. I like it, I just feel a bit bad about moving the p*v() > replacement functions around a couple of times already! I figured it > might as well be static inline even if we use the fallback (= Solaris > and Windows). LGTM. I think this 0001 patch is ready for commit, independently of the rest of the patches. In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h: > +/* Filename components */ > +#define PG_TEMP_FILES_DIR "pgsql_tmp" > +#define PG_TEMP_FILE_PREFIX "pgsql_tmp" > + These seem out of place, we already have them in common/file_utils.h. Other than that, v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look good to me. -- Heikki Linnakangas Neon (https://neon.tech)
On 28/11/2023 14:17, Thomas Munro wrote: > On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> in streaming_read.h: >> >>> typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr, >>> uintptr_t pgsr_private, >>> void *per_io_private, >>> BufferManagerRelation *bmr, >>> ForkNumber *forkNum, >>> BlockNumber *blockNum, >>> ReadBufferMode *mode); >> >> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on >> each read. I see that you used that in the WAL replay prefetching, so I >> guess that makes sense. > > In this version I have introduced an alternative simple callback. > It's approximately what we had already tried out in an earlier version > before I started streamifying recovery, but in this version you can > choose, so recovery can opt for the wider callback. Ok. Two APIs is a bit redundant, but because most callers would prefer the simpler API, that's probably a good tradeoff. > I've added some ramp-up logic. The idea is that after we streamify > everything in sight, we don't want to penalise users that don't really > need more than one or two blocks, but don't know that yet. Here is > how the system calls look when you do pg_prewarm(): > > pread64(32, ..., 8192, 0) = 8192 <--- start with just one block > pread64(32, ..., 16384, 8192) = 16384 > pread64(32, ..., 32768, 24576) = 32768 > pread64(32, ..., 65536, 57344) = 65536 > pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16 > blocks at a time > pread64(32, ..., 131072, 253952) = 131072 > pread64(32, ..., 131072, 385024) = 131072 > I guess it could be done in quite a few different ways and I'm open to > better ideas. This way inserts prefetching stalls but ramps up > quickly and is soon out of the way. I wonder if we would want to make > that a policy that a caller can disable, if you want to skip the > ramp-up and go straight for the largest possible I/O size? Then I > think we'd need a 'flags' argument to the streaming read constructor > functions. I think a 'flags' argument and a way to opt-out of the slow start would make sense. pg_prewarm in particular knows that it will read the whole relation. > A small detour: While contemplating how this interacts with parallel > sequential scan, which also has a notion of ramping up, I noticed > another problem. One parallel seq scan process does this: > > fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0 > preadv(32, [...], 2, 35127296) = 131072 > preadv(32, [...], 2, 35258368) = 131072 > fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0 > preadv(32, [...], 2, 36175872) = 131072 > preadv(32, [...], 2, 36306944) = 131072 > ... > > We don't really want those fadvise() calls. We don't get them with > parallelism disabled, because streaming_read.c is careful not to > generate advice for sequential workloads based on ancient wisdom from > this mailing list, re-confirmed on recent Linux: WILLNEED hints > actually get in the way of Linux's own prefetching and slow you down, > so we only want them for truly random access. But the logic can't see > that another process is making holes in this process's sequence. Hmm, aside from making the sequential pattern invisible to this process, are we defeating Linux's logic too, just by performing the reads from multiple processes? The processes might issue the reads to the kernel out-of-order. How bad is the slowdown when you issue WILLNEED hints on sequential access? > The two obvious solutions are (1) pass in a flag at the start saying > "I promise this is sequential even if it doesn't look like it, no > hints please" and (2) invent "shared" (cross-process) streaming > reads, and teach all the parallel seq scan processes to get their > buffers from there. > > Idea (2) is interesting to think about but even if it is a useful idea > (not sure) it is certainly overkill just to solve this little problem > for now. So perhaps I should implement (1), which would be another > reason to add a flags argument. It's not a perfect solution though > because some more 'data driven' parallel scans (indexes, bitmaps, ...) > have a similar problem that is less amenable to top-down kludgery. (1) seems fine to me. -- Heikki Linnakangas Neon (https://neon.tech)
On Wed, Nov 29, 2023 at 01:17:19AM +1300, Thomas Munro wrote: Thanks for posting a new version. I've included a review of 0004. > I've included just the pg_prewarm example user for now while we > discuss the basic infrastructure. The rest are rebased and in my > public Github branch streaming-read (repo macdice/postgres) if anyone > is interested (don't mind the red CI failures, they're just saying I > ran out of monthly CI credits on the 29th, so close...) I agree it makes sense to commit the interface with just prewarm as a user. Then we can start new threads for the various streaming read users (e.g. vacuum, sequential scan, bitmapheapscan). > From db5de8ab5a1a804f41006239302fdce954cab331 Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Sat, 22 Jul 2023 17:31:54 +1200 > Subject: [PATCH v2 4/8] Provide vectored variant of ReadBuffer(). > > diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c > index f7c67d504c..8ae3a72053 100644 > --- a/src/backend/storage/buffer/bufmgr.c > +++ b/src/backend/storage/buffer/bufmgr.c > @@ -1046,175 +1048,326 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) > flags |= EB_LOCK_FIRST; > > - return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence), > - forkNum, strategy, flags); > + *hit = false; > + > + return ExtendBufferedRel(bmr, forkNum, strategy, flags); > } > > - TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum, > - smgr->smgr_rlocator.locator.spcOid, > - smgr->smgr_rlocator.locator.dbOid, > - smgr->smgr_rlocator.locator.relNumber, > - smgr->smgr_rlocator.backend); > + buffer = PrepareReadBuffer(bmr, > + forkNum, > + blockNum, > + strategy, > + hit, > + &allocated); > + > + /* At this point we do NOT hold any locks. */ > + > + if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK) > + { > + /* if we just want zeroes and a lock, we're done */ > + ZeroBuffer(buffer, mode); > + } > + else if (!*hit) > + { > + /* we might need to perform I/O */ > + CompleteReadBuffers(bmr, > + &buffer, > + forkNum, > + blockNum, > + 1, > + mode == RBM_ZERO_ON_ERROR, > + strategy); > + } > + > + return buffer; > +} > + > +/* > + * Prepare to read a block. The buffer is pinned. If this is a 'hit', then > + * the returned buffer can be used immediately. Otherwise, a physical read > + * should be completed with CompleteReadBuffers(). PrepareReadBuffer() > + * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the ot -> to > + * caller has the opportunity to coalesce reads of neighboring blocks into one > + * CompleteReadBuffers() call. > + * > + * *found is set to true for a hit, and false for a miss. > + * > + * *allocated is set to true for a miss that allocates a buffer for the first > + * time. If there are multiple calls to PrepareReadBuffer() for the same > + * block before CompleteReadBuffers() or ReadBuffer_common() finishes the > + * read, then only the first such call will receive *allocated == true, which > + * the caller might use to issue just one prefetch hint. > + */ > +Buffer > +PrepareReadBuffer(BufferManagerRelation bmr, > + ForkNumber forkNum, > + BlockNumber blockNum, > + BufferAccessStrategy strategy, > + bool *found, > + bool *allocated) > +{ > + BufferDesc *bufHdr; > + bool isLocalBuf; > + IOContext io_context; > + IOObject io_object; > > + Assert(blockNum != P_NEW); > + > + if (bmr.rel) > + { > + bmr.smgr = RelationGetSmgr(bmr.rel); > + bmr.relpersistence = bmr.rel->rd_rel->relpersistence; > + } > + > + isLocalBuf = SmgrIsTemp(bmr.smgr); > if (isLocalBuf) > { > - /* > - * We do not use a BufferAccessStrategy for I/O of temporary tables. > - * However, in some cases, the "strategy" may not be NULL, so we can't > - * rely on IOContextForStrategy() to set the right IOContext for us. > - * This may happen in cases like CREATE TEMPORARY TABLE AS... > - */ > io_context = IOCONTEXT_NORMAL; > io_object = IOOBJECT_TEMP_RELATION; > - bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found); > - if (found) > - pgBufferUsage.local_blks_hit++; > - else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG || > - mode == RBM_ZERO_ON_ERROR) > - pgBufferUsage.local_blks_read++; > } > else > { > - /* > - * lookup the buffer. IO_IN_PROGRESS is set if the requested block is > - * not currently in memory. > - */ > io_context = IOContextForStrategy(strategy); > io_object = IOOBJECT_RELATION; > - bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum, > - strategy, &found, io_context); > - if (found) > - pgBufferUsage.shared_blks_hit++; > - else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG || > - mode == RBM_ZERO_ON_ERROR) > - pgBufferUsage.shared_blks_read++; You've lost this test in your new version. You can do the same thing (avoid counting zeroed buffers as blocks read) by moving this pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common() where you know if you called ZeroBuffer() or CompleteReadBuffers(). > } > > - /* At this point we do NOT hold any locks. */ > + TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum, > + bmr.smgr->smgr_rlocator.locator.spcOid, > + bmr.smgr->smgr_rlocator.locator.dbOid, > + bmr.smgr->smgr_rlocator.locator.relNumber, > + bmr.smgr->smgr_rlocator.backend); > > - /* if it was already in the buffer pool, we're done */ > - if (found) > + ResourceOwnerEnlarge(CurrentResourceOwner); > + if (isLocalBuf) > + { > + bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, found, allocated); > + if (*found) > + pgBufferUsage.local_blks_hit++; > + else > + pgBufferUsage.local_blks_read++; See comment above. > + } > + else > + { > + bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum, > + strategy, found, allocated, io_context); > + if (*found) > + pgBufferUsage.shared_blks_hit++; > + else > + pgBufferUsage.shared_blks_read++; > + } > + if (bmr.rel) > + { > + pgstat_count_buffer_read(bmr.rel); This is double-counting reads. You've left the call in ReadBufferExtended() as well as adding this here. It should be fine to remove it from ReadBufferExtended(). Because you test bmr.rel, leaving the call here in PrepareReadBuffer() wouldn't have an effect on ReadBuffer_common() callers who don't pass a relation (like recovery). The other current callers of ReadBuffer_common() (by way of ExtendBufferedRelTo()) who do pass a relation are visibility map and freespace map extension, and I don't think we track relation stats for the VM and FSM. This does continue the practice of counting zeroed buffers as reads in table-level stats. But, that is the same as master. > - * if we have gotten to this point, we have allocated a buffer for the > - * page but its contents are not yet valid. IO_IN_PROGRESS is set for it, > - * if it's a shared buffer. > - */ > - Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID)); /* spinlock not needed */ > +/* > + * Complete a set reads prepared with PrepareReadBuffers(). The buffers must > + * cover a cluster of neighboring block numbers. > + * > + * Typically this performs one physical vector read covering the block range, > + * but if some of the buffers have already been read in the meantime by any > + * backend, zero or multiple reads may be performed. > + */ > +void > +CompleteReadBuffers(BufferManagerRelation bmr, > + Buffer *buffers, > + ForkNumber forknum, > + BlockNumber blocknum, > + int nblocks, > + bool zero_on_error, > + BufferAccessStrategy strategy) > +{ ... > - pgstat_count_io_op_time(io_object, io_context, > - IOOP_READ, io_start, 1); > + /* We found a buffer that we have to read in. */ > + io_buffers[0] = buffers[i]; > + io_pages[0] = BufferGetBlock(buffers[i]); > + io_first_block = blocknum + i; > + io_buffers_len = 1; > > - /* check for garbage data */ > - if (!PageIsVerifiedExtended((Page) bufBlock, blockNum, > - PIV_LOG_WARNING | PIV_REPORT_STAT)) > + /* > + * How many neighboring-on-disk blocks can we can scatter-read into > + * other buffers at the same time? > + */ > + while ((i + 1) < nblocks && > + CompleteReadBuffersCanStartIO(buffers[i + 1])) > + { > + /* Must be consecutive block numbers. */ > + Assert(BufferGetBlockNumber(buffers[i + 1]) == > + BufferGetBlockNumber(buffers[i]) + 1); > + > + io_buffers[io_buffers_len] = buffers[++i]; > + io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]); > + } > + > + io_start = pgstat_prepare_io_time(); > + smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len); > + pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1); I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes will be BLCKSZ and multiplying that by the number of reads should produce the number of bytes read. > diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c > index 4efb34b75a..ee9307b612 100644 > --- a/src/backend/storage/buffer/localbuf.c > +++ b/src/backend/storage/buffer/localbuf.c > @@ -116,7 +116,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum, > */ > BufferDesc * > LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > - bool *foundPtr) > + bool *foundPtr, bool *allocPtr) > { > BufferTag newTag; /* identity of requested block */ > LocalBufferLookupEnt *hresult; > @@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > Assert(BufferTagsEqual(&bufHdr->tag, &newTag)); > > *foundPtr = PinLocalBuffer(bufHdr, true); > + *allocPtr = false; > } > else > { > @@ -170,6 +171,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); > > *foundPtr = false; > + *allocPtr = true; > } I would prefer you use consistent naming for allocPtr/allocatedPtr/allocated. I also think that all the functions taking it as an output argument should explain what it is (BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of digging around to figure it out. You have a nice comment about it above PrepareReadBuffer(). I think you may need to resign yourself to restating that bit (or some version of it) for all of the functions taking it as an argument. > > diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h > index 41e26d3e20..e29ca85077 100644 > --- a/src/include/storage/bufmgr.h > +++ b/src/include/storage/bufmgr.h > @@ -14,6 +14,8 @@ > #ifndef BUFMGR_H > #define BUFMGR_H > > +#include "pgstat.h" I don't know what we are supposed to do, but I would have included this in bufmgr.c (where I actually needed it) instead of including it here. > +#include "port/pg_iovec.h" > #include "storage/block.h" > #include "storage/buf.h" > #include "storage/bufpage.h" > @@ -47,6 +49,8 @@ typedef enum > RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page > * in "cleanup" mode */ > RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */ > + RBM_WILL_ZERO, /* Don't read from disk, caller will call > + * ZeroBuffer() */ It's confusing that this (RBM_WILL_ZERO) is part of this commit since it isn't used in this commit. > RBM_NORMAL_NO_LOG, /* Don't log page as invalid during WAL > * replay; otherwise same as RBM_NORMAL */ > } ReadBufferMode; - Melanie
On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > LGTM. I think this 0001 patch is ready for commit, independently of the > rest of the patches. Done. > In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h: > > > +/* Filename components */ > > +#define PG_TEMP_FILES_DIR "pgsql_tmp" > > +#define PG_TEMP_FILE_PREFIX "pgsql_tmp" > > + > > These seem out of place, we already have them in common/file_utils.h. Yeah, they moved from there in f39b2658 and I messed up the rebase. Fixed. > Other than that, > v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and > v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look > good to me. One thing I wasn't 100% happy with was the treatment of ENOSPC. A few callers believe that short writes set errno: they error out with a message including %m. We have historically set errno = ENOSPC inside FileWrite() if the write size was unexpectedly small AND the kernel didn't set errno to a non-zero value (having set it to zero ourselves earlier). In FileWriteV(), I didn't want to do that because it is expensive to compute the total write size from the vector array and we managed to measure an effect due to that in some workloads. Note that the smgr patch actually handles short writes by continuing, instead of raising an error. Short writes do already occur in the wild on various systems for various rare technical reasons other than ENOSPC I have heard (imagine transient failure to acquire some temporary memory that the kernel chooses not to wait for, stuff like that, though certainly many people and programs believe they should not happen[1]), and it seems like a good idea to actually handle them as our write sizes increase and the probability of short writes might presumably increase. With the previous version of the patch, we'd have to change a couple of other callers not to believe that short writes are errors and set errno (callers are inconsistent on this point). I don't really love that we have "fake" system errors but I also want to stay focused here, so in this new version V3 I tried a new approach: I realised I can just always set errno without needing the total size, so that (undocumented) aspect of the interface doesn't change. The point being that it doesn't matter if you clobber errno with a bogus value when the write was non-short. Thoughts? [1] https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften
Attachment
On 29/11/2023 21:39, Thomas Munro wrote: > One thing I wasn't 100% happy with was the treatment of ENOSPC. A few > callers believe that short writes set errno: they error out with a > message including %m. We have historically set errno = ENOSPC inside > FileWrite() if the write size was unexpectedly small AND the kernel > didn't set errno to a non-zero value (having set it to zero ourselves > earlier). In FileWriteV(), I didn't want to do that because it is > expensive to compute the total write size from the vector array and we > managed to measure an effect due to that in some workloads. > > Note that the smgr patch actually handles short writes by continuing, > instead of raising an error. Short writes do already occur in the > wild on various systems for various rare technical reasons other than > ENOSPC I have heard (imagine transient failure to acquire some > temporary memory that the kernel chooses not to wait for, stuff like > that, though certainly many people and programs believe they should > not happen[1]), and it seems like a good idea to actually handle them > as our write sizes increase and the probability of short writes might > presumably increase. Maybe we should bite the bullet and always retry short writes in FileWriteV(). Is that what you meant by "handling them"? If the total size is expensive to calculate, how about passing it as an extra argument? Presumably it is cheap for the callers to calculate at the same time that they build the iovec array? > With the previous version of the patch, we'd have to change a couple > of other callers not to believe that short writes are errors and set > errno (callers are inconsistent on this point). I don't really love > that we have "fake" system errors but I also want to stay focused > here, so in this new version V3 I tried a new approach: I realised I > can just always set errno without needing the total size, so that > (undocumented) aspect of the interface doesn't change. The point > being that it doesn't matter if you clobber errno with a bogus value > when the write was non-short. Thoughts? Feels pretty ugly, but I don't see anything outright wrong with that. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 29/11/2023 21:39, Thomas Munro wrote: > > One thing I wasn't 100% happy with was the treatment of ENOSPC. A few > > callers believe that short writes set errno: they error out with a > > message including %m. We have historically set errno = ENOSPC inside > > FileWrite() if the write size was unexpectedly small AND the kernel > > didn't set errno to a non-zero value (having set it to zero ourselves > > earlier). In FileWriteV(), I didn't want to do that because it is > > expensive to compute the total write size from the vector array and we > > managed to measure an effect due to that in some workloads. > > > > Note that the smgr patch actually handles short writes by continuing, > > instead of raising an error. Short writes do already occur in the > > wild on various systems for various rare technical reasons other than > > ENOSPC I have heard (imagine transient failure to acquire some > > temporary memory that the kernel chooses not to wait for, stuff like > > that, though certainly many people and programs believe they should > > not happen[1]), and it seems like a good idea to actually handle them > > as our write sizes increase and the probability of short writes might > > presumably increase. > > Maybe we should bite the bullet and always retry short writes in > FileWriteV(). Is that what you meant by "handling them"? > If the total size is expensive to calculate, how about passing it as an > extra argument? Presumably it is cheap for the callers to calculate at > the same time that they build the iovec array? It's cheap for md.c, because it already has nblocks_this_segment. That's one reason I chose to put the retry there. If we push it down to fd.c in order to be able to help other callers, you're right that we could pass in the total size (and I guess assert that it's correct), but that is sort of annoyingly redundant and further from the interface we're wrapping. There is another problem with pushing it down to fd.c, though. Suppose you try to write 8192 bytes, and the kernel says "you wrote 4096 bytes" so your loop goes around again with the second half the data and now the kernel says "-1, ENOSPC". What are you going to do? fd.c doesn't raise errors for I/O failure, it fails with -1 and errno, so you'd either have to return -1, ENOSPC (converting short writes into actual errors, a lie because you did write some data), or return 4096 (and possibly also set errno = ENOSPC as we have always done). So you can't really handle this problem at this level, can you? Unless you decide that fd.c should get into the business of raising errors for I/O failures, which would be a bit of a departure. That's why I did the retry higher up in md.c. > > With the previous version of the patch, we'd have to change a couple > > of other callers not to believe that short writes are errors and set > > errno (callers are inconsistent on this point). I don't really love > > that we have "fake" system errors but I also want to stay focused > > here, so in this new version V3 I tried a new approach: I realised I > > can just always set errno without needing the total size, so that > > (undocumented) aspect of the interface doesn't change. The point > > being that it doesn't matter if you clobber errno with a bogus value > > when the write was non-short. Thoughts? > > Feels pretty ugly, but I don't see anything outright wrong with that. Cool. I would consider cleaning up all the callers and get rid of this ENOSPC stuff in independent work, but I didn't want discussion of that (eg what external/extension code knows about this API?) to derail THIS project, hence desire to preserve existing behaviour.
Hi, On 2023-11-30 13:01:46 +1300, Thomas Munro wrote: > On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 29/11/2023 21:39, Thomas Munro wrote: > > > One thing I wasn't 100% happy with was the treatment of ENOSPC. A few > > > callers believe that short writes set errno: they error out with a > > > message including %m. We have historically set errno = ENOSPC inside > > > FileWrite() if the write size was unexpectedly small AND the kernel > > > didn't set errno to a non-zero value (having set it to zero ourselves > > > earlier). In FileWriteV(), I didn't want to do that because it is > > > expensive to compute the total write size from the vector array and we > > > managed to measure an effect due to that in some workloads. > > > > > > Note that the smgr patch actually handles short writes by continuing, > > > instead of raising an error. Short writes do already occur in the > > > wild on various systems for various rare technical reasons other than > > > ENOSPC I have heard (imagine transient failure to acquire some > > > temporary memory that the kernel chooses not to wait for, stuff like > > > that, though certainly many people and programs believe they should > > > not happen[1]), and it seems like a good idea to actually handle them > > > as our write sizes increase and the probability of short writes might > > > presumably increase. > > > > Maybe we should bite the bullet and always retry short writes in > > FileWriteV(). Is that what you meant by "handling them"? > > If the total size is expensive to calculate, how about passing it as an > > extra argument? Presumably it is cheap for the callers to calculate at > > the same time that they build the iovec array? > > It's cheap for md.c, because it already has nblocks_this_segment. > That's one reason I chose to put the retry there. If we push it down > to fd.c in order to be able to help other callers, you're right that > we could pass in the total size (and I guess assert that it's > correct), but that is sort of annoyingly redundant and further from > the interface we're wrapping. > There is another problem with pushing it down to fd.c, though. > Suppose you try to write 8192 bytes, and the kernel says "you wrote > 4096 bytes" so your loop goes around again with the second half the > data and now the kernel says "-1, ENOSPC". What are you going to do? > fd.c doesn't raise errors for I/O failure, it fails with -1 and errno, > so you'd either have to return -1, ENOSPC (converting short writes > into actual errors, a lie because you did write some data), or return > 4096 (and possibly also set errno = ENOSPC as we have always done). > So you can't really handle this problem at this level, can you? > Unless you decide that fd.c should get into the business of raising > errors for I/O failures, which would be a bit of a departure. > > That's why I did the retry higher up in md.c. I think that's the right call. I think for AIO we can't do retry handling purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd buy us that much in md.c anyway, we still need to handle the cross segment case and such, from what I can tell? Greetings, Andres Freund
On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-11-30 13:01:46 +1300, Thomas Munro wrote: > > On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > Maybe we should bite the bullet and always retry short writes in > > > FileWriteV(). Is that what you meant by "handling them"? > > > If the total size is expensive to calculate, how about passing it as an > > > extra argument? Presumably it is cheap for the callers to calculate at > > > the same time that they build the iovec array? > > There is another problem with pushing it down to fd.c, though. > > Suppose you try to write 8192 bytes, and the kernel says "you wrote > > 4096 bytes" so your loop goes around again with the second half the > > data and now the kernel says "-1, ENOSPC". What are you going to do? > > fd.c doesn't raise errors for I/O failure, it fails with -1 and errno, > > so you'd either have to return -1, ENOSPC (converting short writes > > into actual errors, a lie because you did write some data), or return > > 4096 (and possibly also set errno = ENOSPC as we have always done). > > So you can't really handle this problem at this level, can you? > > Unless you decide that fd.c should get into the business of raising > > errors for I/O failures, which would be a bit of a departure. > > > > That's why I did the retry higher up in md.c. > > I think that's the right call. I think for AIO we can't do retry handling > purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd > buy us that much in md.c anyway, we still need to handle the cross segment > case and such, from what I can tell? Heikki, what do you think about this: we could go with the v3 fd.c and md.c patches, but move adjust_iovec_for_partial_transfer() into src/common/file_utils.c, so that at least that slightly annoying part of the job is available for re-use by future code that faces the same problem? Note that in file_utils.c we already have pg_pwritev_with_retry(), which is clearly related to all this: that is a function that guarantees to either complete the full pwritev() or throw an ERROR, but leaves it undefined whether any data has been written on ERROR. It has to add up the size too, and it adjusts the iovec array at the same time, so it wouldn't use adjust_iovec_for_partial_transfer(). This is essentially the type of interface that I declined to put into fd.c's FileWrite() and FileRead() because I feel like it doesn't fit with the existing functions' primary business of adding vfd support to well known basic I/O functions that return bytes transferred and set errno. Perhaps someone might later want to introduce File*WithRetry() wrappers or something if that proves useful? I wouldn't want them for md.c though because I already know the size.
On 09/12/2023 02:41, Thomas Munro wrote: > On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote: >> On 2023-11-30 13:01:46 +1300, Thomas Munro wrote: >>> On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>>> Maybe we should bite the bullet and always retry short writes in >>>> FileWriteV(). Is that what you meant by "handling them"? >>>> If the total size is expensive to calculate, how about passing it as an >>>> extra argument? Presumably it is cheap for the callers to calculate at >>>> the same time that they build the iovec array? > >>> There is another problem with pushing it down to fd.c, though. >>> Suppose you try to write 8192 bytes, and the kernel says "you wrote >>> 4096 bytes" so your loop goes around again with the second half the >>> data and now the kernel says "-1, ENOSPC". What are you going to do? >>> fd.c doesn't raise errors for I/O failure, it fails with -1 and errno, >>> so you'd either have to return -1, ENOSPC (converting short writes >>> into actual errors, a lie because you did write some data), or return >>> 4096 (and possibly also set errno = ENOSPC as we have always done). >>> So you can't really handle this problem at this level, can you? >>> Unless you decide that fd.c should get into the business of raising >>> errors for I/O failures, which would be a bit of a departure. >>> >>> That's why I did the retry higher up in md.c. >> >> I think that's the right call. I think for AIO we can't do retry handling >> purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd >> buy us that much in md.c anyway, we still need to handle the cross segment >> case and such, from what I can tell? > > Heikki, what do you think about this: we could go with the v3 fd.c > and md.c patches, but move adjust_iovec_for_partial_transfer() into > src/common/file_utils.c, so that at least that slightly annoying part > of the job is available for re-use by future code that faces the same > problem? Ok, works for me. -- Heikki Linnakangas Neon (https://neon.tech)
On Sat, Dec 9, 2023 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Ok, works for me. I finished up making a few more improvements: 1. I eventually figured out how to generalise compute_remaining_iovec() (as I now call it) so that the existing pg_pwritev_with_retry() in file_utils.c could also use it, so that's now done in a patch of its own. 2. FileReadV/FileWriteV patch: * further simplification of the traditional ENOSPC 'guess' * unconstify() changed to raw cast (pending [1]) * fixed the DO_DB()-wrapped debugging code 3. smgrreadv/smgrwritev patch: * improved ENOSPC handling * improve description of EOF and ENOSPC handling * fixed the sizes reported in dtrace static probes * fixed some words in the docs about that * changed error messages to refer to "blocks %u..%u" 4. smgrprefetch-with-nblocks patch has no change, hasn't drawn any comments hopefully because it is uncontroversial. I'm planning to commit these fairly soon. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGK3OXFjkOyZiw-DgL2bUqk9by1uGuCnViJX786W%2BfyDSw%40mail.gmail.com
Attachment
On 11/12/2023 11:12, Thomas Munro wrote: > 1. I eventually figured out how to generalise > compute_remaining_iovec() (as I now call it) so that the existing > pg_pwritev_with_retry() in file_utils.c could also use it, so that's > now done in a patch of its own. In compute_remaining_iovec(): > 'source' and 'destination' may point to the same array, in which > case it is adjusted in-place; otherwise 'destination' must have enough > space for 'iovcnt' elements. Is there any use case for not adjusting it in place? pg_pwritev_with_retry() takes a const iovec array, but maybe just remove the 'const' and document that it scribbles on it? > I'm planning to commit these fairly soon. +1 -- Heikki Linnakangas Neon (https://neon.tech)
On Mon, Dec 11, 2023 at 10:28 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 11/12/2023 11:12, Thomas Munro wrote: > > 1. I eventually figured out how to generalise > > compute_remaining_iovec() (as I now call it) so that the existing > > pg_pwritev_with_retry() in file_utils.c could also use it, so that's > > now done in a patch of its own. > > In compute_remaining_iovec(): > > 'source' and 'destination' may point to the same array, in which > > case it is adjusted in-place; otherwise 'destination' must have enough > > space for 'iovcnt' elements. > Is there any use case for not adjusting it in place? > pg_pwritev_with_retry() takes a const iovec array, but maybe just remove > the 'const' and document that it scribbles on it? I guess I just wanted to preserve pg_pwritev_with_retry()'s existing prototype, primarily because it matches standard pwritev()/writev().
Le 11/12/2023 à 10:12, Thomas Munro a écrit : > 3. smgrreadv/smgrwritev patch: > > * improved ENOSPC handling > * improve description of EOF and ENOSPC handling > * fixed the sizes reported in dtrace static probes > * fixed some words in the docs about that > * changed error messages to refer to "blocks %u..%u" > > 4. smgrprefetch-with-nblocks patch has no change, hasn't drawn any > comments hopefully because it is uncontroversial. > > I'm planning to commit these fairly soon. Thanks, very useful additions. Not sure what you have already done to come next... I have 2 smalls patches here: * to use range prefetch in pg_prewarm (smgrprefetch only at the moment, using smgrreadv to come next). * to support nblocks=0 in smgrprefetch (posix_fadvise supports a len=0 to apply flag from offset to end of file). Should I add to commitfest ? --- Cédric Villemain +33 (0)6 20 30 22 52 https://Data-Bene.io PostgreSQL Expertise, Support, Training, R&D
I've written a new version of the vacuum streaming read user on top of the rebased patch set [1]. It differs substantially from Andres' and includes several refactoring patches that can apply on top of master. As such, I've proposed those in a separate thread [2]. I noticed mac and windows fail to build on CI for my branch with the streaming read code. I haven't had a chance to investigate -- but I must have done something wrong on rebase. - Melanie [1] https://github.com/melanieplageman/postgres/tree/stepwise_vac_streaming_read [2] https://www.postgresql.org/message-id/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA%40mail.gmail.com
On Wed, Nov 29, 2023 at 2:21 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Thanks for posting a new version. I've included a review of 0004. Thanks! I committed the patches up as far as smgr.c before the holidays. The next thing to commit would be the bufmgr.c one for vectored ReadBuffer(), v5-0001. Here's my response to your review of that, which triggered quite a few changes. See also new version of the streaming_read.c patch, with change list at end. (I'll talk about v5-0002, the SMgrRelation lifetime one, over on the separate thread about that where Heikki posted a better version.) > ot -> to Fixed. > > - if (found) > > - pgBufferUsage.shared_blks_hit++; > > - else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG || > > - mode == RBM_ZERO_ON_ERROR) > > - pgBufferUsage.shared_blks_read++; > > You've lost this test in your new version. You can do the same thing > (avoid counting zeroed buffers as blocks read) by moving this > pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common() > where you know if you called ZeroBuffer() or CompleteReadBuffers(). Yeah, right. After thinking about that some more, that's true and that placement would be good, but only if we look just at this patch, where we are chopping ReadBuffer() into two parts (PrepareReadBuffer() and CompleteReadBuffers()), and then putting them back together again. However, soon we'll want to use the two functions separately, and we won't call ReadBuffer[_common](). New idea: PrepareReadBuffer() can continue to be in charge of bumping {local,shared}_blks_hit, but {local,shared}_blks_read++ can happen in CompleteReadBuffers(). There is no counter for zeroed buffers, but if there ever is in the future, it can go into ZeroBuffer(). In this version, I've moved that into CompleteReadBuffers(), along with a new comment to explain a pre-existing deficiency in the whole scheme: there is a race where you finish up counting a read but someone else actually does the read, and also counts it. I'm trying to preserve the existing bean counting logic to the extent possible across this refactoring. > > + } > > + else > > + { > > + bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum, > > + strategy, found, allocated, io_context); > > + if (*found) > > + pgBufferUsage.shared_blks_hit++; > > + else > > + pgBufferUsage.shared_blks_read++; > > + } > > + if (bmr.rel) > > + { > > + pgstat_count_buffer_read(bmr.rel); > > This is double-counting reads. You've left the call in > ReadBufferExtended() as well as adding this here. It should be fine to > remove it from ReadBufferExtended(). Because you test bmr.rel, leaving > the call here in PrepareReadBuffer() wouldn't have an effect on > ReadBuffer_common() callers who don't pass a relation (like recovery). > The other current callers of ReadBuffer_common() (by way of > ExtendBufferedRelTo()) who do pass a relation are visibility map and > freespace map extension, and I don't think we track relation stats for > the VM and FSM. Oh yeah. Right. Fixed. > This does continue the practice of counting zeroed buffers as reads in > table-level stats. But, that is the same as master. Right. It is a little strange that pgstast_count_buffer_read() finishes up in a different location than pgBufferUsage.{local,shared}_blks_read++, but that's precisely due to this pre-existing difference in accounting policy. That generally seems like POLA failure, so I've added a comment to help us remember about that, for another day. > > + io_start = pgstat_prepare_io_time(); > > + smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len); > > + pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1); > > I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes > will be BLCKSZ and multiplying that by the number of reads should > produce the number of bytes read. OK, thanks, fixed. > > BufferDesc * > > LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > > - bool *foundPtr) > > + bool *foundPtr, bool *allocPtr) > > { > > BufferTag newTag; /* identity of requested block */ > > LocalBufferLookupEnt *hresult; > > @@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > > Assert(BufferTagsEqual(&bufHdr->tag, &newTag)); > > > > *foundPtr = PinLocalBuffer(bufHdr, true); > > + *allocPtr = false; > ... > > I would prefer you use consistent naming for > allocPtr/allocatedPtr/allocated. I also think that all the functions > taking it as an output argument should explain what it is > (BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of > digging around to figure it out. You have a nice comment about it above > PrepareReadBuffer(). I think you may need to resign yourself to > restating that bit (or some version of it) for all of the functions > taking it as an argument. I worked on addressing your complaints, but while doing so I had second thoughts about even having that argument. The need for it came up while working on streaming recovery. Without it, whenever there were repeated references to the same block (as happens very often in the WAL), it'd issue extra useless POSIX_FADV_WILLNEED syscalls, so I wanted to find a way to distinguish the *first* miss for a given block, to be able to issue the advice just once before the actual read happens. But now I see that other users of streaming reads probably won't repeatedly stream the same block, and I am not proposing streaming recovery for PG 17. Time to simplify. I decided to kick allocatedPtr out to think about some more, but I really hope we'll be able to start a real background I/O instead of issuing advice in PG 18 proposals, and that'll be negotiated via IO_IN_PROGRESS, so then we'd never need allocatedPtr. Thus, removed. > > #ifndef BUFMGR_H > > #define BUFMGR_H > > > > +#include "pgstat.h" > > I don't know what we are supposed to do, but I would have included this > in bufmgr.c (where I actually needed it) instead of including it here. Fixed. > > +#include "port/pg_iovec.h" > > #include "storage/block.h" > > #include "storage/buf.h" > > #include "storage/bufpage.h" > > @@ -47,6 +49,8 @@ typedef enum > > RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page > > * in "cleanup" mode */ > > RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */ > > > + RBM_WILL_ZERO, /* Don't read from disk, caller will call > > + * ZeroBuffer() */ > > It's confusing that this (RBM_WILL_ZERO) is part of this commit since it > isn't used in this commit. Yeah. Removed. Bikeshedding call: I am open to better suggestions for the names PrepareReadBuffer() and CompleteReadBuffers(), they seem a little grammatically clumsy. I now also have a much simplified version of the streaming read patch. The short version is that it has all advanced features removed, so that now it *only* does the clustering required to build up large CompleteReadBuffers() calls. That's short and sweet, and enough for pg_prewarm to demonstrate 128KB reads, a good first step. Then WILLNEED advice, and then ramp-up are added as separate patches, for easier review. I've got some more patches in development that would re-add "extended" multi-relation mode with wider callback that can also stream zeroed buffers, as required so far only by recovery -- but I can propose that later. Other improvements: * Melanie didn't like the overloaded term "cluster" (off-list feedback). Now I use "block range" to describe a range of contiguous blocks (blocknum, nblocks). * KK didn't like "prefetch" being used with various meanings (off-list feedback). Better words found. * pgsr_private might as well be void *; uintptr_t is theoretically more general but doesn't seem to buy much for realistic use. * per_io_data might as well be called per_buffer_data (it's not per "I/O" and that isn't the level of abstraction for this API anyway which is about blocks and buffers). * I reordered some function arguments that jumped out after the above changes. * Once I started down the path of using a flags field to control various policy stuff as discussed up-thread, it started to seem clearer that callers probably shouldn't directly control I/O depth, which was all a bit ad-hoc and unfinished before. I think we'd likely want that to be centralised. New idea: it should be enough to be able to specify policies as required now and in future with flags. Thoughts? I've also included 4 of the WIP patches from earlier (based on an obsolete version of the vacuum thing, sorry I know you have a much better one now), which can be mostly ignored. Here I just wanted to share working code to drive vectored reads with different access patterns, and also show the API interactions and how they might each set flag bits. For example, parallel seq scan uses PGSR_FLAG_SEQUENTIAL to insist that its scans are sequential despite appearances, pg_prewarm uses PGSR_FLAG_FULL to declare that it'll read the whole relation so there is no point in ramping up, and the vacuum-related stuff uses PGSR_FLAG_MAINTENANCE to select tuning based on maintenance_io_concurrency instead of effective_io_concurrency.
Attachment
- v5-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v5-0002-Give-SMgrRelation-pointers-a-well-defined-lifetim.patch
- v5-0003-Provide-API-for-streaming-reads-of-relations.patch
- v5-0004-Use-streaming-reads-in-pg_prewarm.patch
- v5-0005-Issue-advice-for-random-streaming-reads.patch
- v5-0006-Allow-streaming-reads-to-ramp-up-in-size.patch
- v5-0007-WIP-Use-streaming-reads-in-heapam-scans.patch
- v5-0008-WIP-Use-streaming-reads-in-vacuum.patch
- v5-0009-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v5-0010-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
On 10/01/2024 06:13, Thomas Munro wrote: > Bikeshedding call: I am open to better suggestions for the names > PrepareReadBuffer() and CompleteReadBuffers(), they seem a little > grammatically clumsy. How will these functions work in the brave new async I/O world? I assume PrepareReadBuffer() will initiate the async I/O, and CompleteReadBuffers() will wait for it to complete. How about StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and WaitBufferRead()? About the signature of those functions: Does it make sense for CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector of buffers? If StartReadBuffer() initiates the async I/O immediately, is there any benefit to batching the waiting? If StartReadBuffer() starts the async I/O, the idea that you can call ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for you in RBM_ZERO_* modes. Putting all that together, I propose: /* * Initiate reading a block from disk to the buffer cache. * * XXX: Until we have async I/O, this just allocates the buffer in the buffer * cache. The actual I/O happens in WaitReadBuffer(). */ Buffer StartReadBuffer(BufferManagerRelation bmr, ForkNumber forkNum, BlockNumber blockNum, BufferAccessStrategy strategy, ReadBufferMode mode, bool *foundPtr); /* * Wait for a read that was started earlier with StartReadBuffer() to finish. * * XXX: Until we have async I/O, this is the function that actually performs * the I/O. If multiple I/Os have been started with StartReadBuffer(), this * will try to perform all of them in one syscall. Subsequent calls to * WaitReadBuffer(), for those other buffers, will finish quickly. */ void WaitReadBuffer(Buffer buf); I'm not sure how well this fits with the streaming read API. The streaming read code performs grouping of adjacent blocks to one CompleteReadBuffers() call. If WaitReadBuffer() does the batching, that's not really required. But does that make sense with async I/O? With async I/O, will you need a vectorized version of StartReadBuffer() too? -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 10/01/2024 06:13, Thomas Munro wrote: > > Bikeshedding call: I am open to better suggestions for the names > > PrepareReadBuffer() and CompleteReadBuffers(), they seem a little > > grammatically clumsy. > > How will these functions work in the brave new async I/O world? I assume > PrepareReadBuffer() will initiate the async I/O, and > CompleteReadBuffers() will wait for it to complete. How about > StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and > WaitBufferRead()? What we have imagined so far is that the asynchronous version would probably have three steps, like this: * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed * StartReadBuffers() -> starts the I/O for n contiguous !found buffers * CompleteReadBuffers() -> waits, completes as necessary In the proposed synchronous version, the middle step is missing, but streaming_read.c directly calls smgrprefetch() instead. I thought about shoving that inside a prosthetic StartReadBuffers() function, but I backed out of simulating asynchronous I/O too fancifully. The streaming read API is where we really want to stabilise a nice API, so we can moves things around behind it if required. A bit of analysis of the one block -> nblocks change and the synchronous -> asynchronous change: Two things are new in a world with nblocks > 1. (1) We needed to be able to set BM_IO_IN_PROGRESS on more than one block at a time, but commit 12f3867f already provided that, and (2) someone else might come along and read a block in the middle of our range, effectively chopping our range into subranges. That's true also in master but when nblocks === 1 that's all-or-nothing, and now we have partial cases. In the proposed synchronous code, CompleteReadBuffers() claims as many contiguous BM_IO_IN_PROGRESS flags as it in the range, and then loops process the rest, skipping over any blocks that are already done. Further down in md.c, you might also cross a segment boundary. So that's two different reasons while a single call to CompleteReadBuffers() might finish up generating zero or more than one I/O system call, though very often it's one. Hmm, while spelling that out, I noticed an obvious problem and improvement to make to that part of v5. If backend #1 is trying to read blocks 101..103 and acquires BM_IO_IN_PROGRESS for 101, but backend #2 comes along and starts reading block 102 first, backend #1's StartBufferIO() call would wait for 102's I/O CV while it still holds BM_IO_IN_PROGRESS for block 101, potentially blocking a third backend #3 that wants to read block 101 even though no I/O is in progress for that block yet! At least that's deadlock free (because always in block order), but it seems like undesirable lock chaining. Here is my proposed improvement: StartBufferIO() gains a nowait flag. For the head block we wait, but while trying to build a larger range we don't. We'll try 102 again in the next loop, with a wait. Here is a small fixup for that. In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would take place in StartReadBuffers() instead, which would be responsible for kicking off asynchronous I/Os (= asking a background worker to call pread(), or equivalent fancy kernel async APIs). One question is what it does if it finds a block in the middle that chops the read up, or for that matter a segment boundary. I don't think we have to decide now, but the two options seem to be that it starts one single I/O and reports its size, making it the client's problem to call again with the rest, or that it starts more than one I/O and they are somehow chained together so that the caller doesn't know about that and can later wait for all of them to finish using just one <handle thing>. (The reason why I'm not 100% certain how it will look is that the real, working code in the aio branch right now doesn't actually expose a vector/nblocks bufmgr interface at all, yet. Andres's original prototype had a single-block Start*(), Complete*() design, but a lower level of the AIO system notices if pending read operations are adjacent and could be merged. While discussing all this we decided it was a bit strange to have lower code deal with allocating, chaining and processing lots of separate I/O objects in shared memory, when higher level code could often work in bigger ranges up front, and then interact with the AIO subsystem with many fewer objects and steps. Also, the present simple and lightweight synchronous proposal that lacks the whole subsystem that could do that by magic.) > About the signature of those functions: Does it make sense for > CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector > of buffers? If StartReadBuffer() initiates the async I/O immediately, is > there any benefit to batching the waiting? > > If StartReadBuffer() starts the async I/O, the idea that you can call > ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think > StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for > you in RBM_ZERO_* modes. Yeah, good thoughts, and topics that have occupied me for some time now. I also thought that StartReadBuffer() should take ReadBufferMode, but I came to the idea that it probably shouldn't like this: I started working on all this by trying to implement the most complicated case I could imagine, streaming recovery, and then working back to the easy cases that just do scans with RBM_NORMAL. In recovery, we can predict that a block will be zeroed using WAL flags, and pre-existing cross-checks at redo time that enforce that the flags and redo code definitely agree on that, but we can't predict which exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and get_cleanup_lock=true, as the comment warns them not to, but I digress). That's OK, because we can't take locks while looking ahead in recovery anyway (the redo routine carefully controls lock order/protocol), so the code to actually do the locking needs to be somewhere near the output end of the stream when the redo code calls XLogReadBufferForRedoExtended(). But if you try to use RBM_XXX in these interfaces, it begins to look pretty funny: the streaming callback needs to be able to say which ReadBufferMode, but anywhere near Prepare*(), Start*() or even Complete*() is too soon, so maybe we need to invent a new value RBM_WILL_ZERO that doesn't yet say which of the zero modes to use, and then the redo routine needs to pass in the RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to XLogReadBufferForRedoExtended() and it does it in a separate step anyway, so we are ignoring ReadBufferMode. But that feels just wrong -- we'd be using RBM but implementing them only partially. Another way to put it is that ReadBufferMode actually conflates a bunch of different behaviour that applies at different times that we are now separating, and recovery reveals this most clearly because it doesn't have all the information needed while looking ahead. It might be possible to shove more information in the WAL to fix the information problem, but it seemed more natural to me to separate the aspects of ReadBufferMode, because that isn't the only problem: groupwise-processing of lock doesn't even make sense. So I teased a couple of those aspects out into separate flags, for example: The streaming read interface has two variants: the "simple" implicit RBM_NORMAL, single relation, single fork version that is used in most client examples and probably everything involving the executor, and the "extended" version (like the earlier versions in this thread, removed for now based on complaints that most early uses don't use it, will bring it back separately later with streaming recovery patches). In the extended version, the streaming callback can set *will_zero = true, which is about all the info that recovery can figure out from the WAL anyway, and then XLogReadBufferForRedoExtended() will later call ZeroBuffer() because at that time we have the ReadBufferMode. The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the right time and makes sense to process as a batch, so it becomes a flag. > Putting all that together, I propose: > > /* > * Initiate reading a block from disk to the buffer cache. > * > * XXX: Until we have async I/O, this just allocates the buffer in the > buffer > * cache. The actual I/O happens in WaitReadBuffer(). > */ > Buffer > StartReadBuffer(BufferManagerRelation bmr, > ForkNumber forkNum, > BlockNumber blockNum, > BufferAccessStrategy strategy, > ReadBufferMode mode, > bool *foundPtr); > > /* > * Wait for a read that was started earlier with StartReadBuffer() to > finish. > * > * XXX: Until we have async I/O, this is the function that actually > performs > * the I/O. If multiple I/Os have been started with StartReadBuffer(), this > * will try to perform all of them in one syscall. Subsequent calls to > * WaitReadBuffer(), for those other buffers, will finish quickly. > */ > void > WaitReadBuffer(Buffer buf); I'm confused about where the extra state lives that would allow the communication required to build a larger I/O. In the AIO branch, it does look a little more like that, but there is more magic state and machinery hiding behind the curtain: the backend's pending I/O list builds up a chain of I/Os, and when you try to wait, if it hasn't already been submitted to the kernel/bgworkers yet it will be, and before that merging will happen. So you get bigger I/Os without having to say so explicitly. For this synchronous version (and hopefully soon a more efficient improved version in the AIO branch), we want to take advantage of the client's pre-existing and more cheaply obtained knowledge of ranges. > I'm not sure how well this fits with the streaming read API. The > streaming read code performs grouping of adjacent blocks to one > CompleteReadBuffers() call. If WaitReadBuffer() does the batching, > that's not really required. But does that make sense with async I/O? > With async I/O, will you need a vectorized version of StartReadBuffer() too? I think so, yes.
Attachment
On 11/01/2024 05:19, Thomas Munro wrote: > On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> On 10/01/2024 06:13, Thomas Munro wrote: >>> Bikeshedding call: I am open to better suggestions for the names >>> PrepareReadBuffer() and CompleteReadBuffers(), they seem a little >>> grammatically clumsy. >> >> How will these functions work in the brave new async I/O world? I assume >> PrepareReadBuffer() will initiate the async I/O, and >> CompleteReadBuffers() will wait for it to complete. How about >> StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and >> WaitBufferRead()? > > What we have imagined so far is that the asynchronous version would > probably have three steps, like this: > > * PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed > * StartReadBuffers() -> starts the I/O for n contiguous !found buffers > * CompleteReadBuffers() -> waits, completes as necessary Ok. It feels surprising to have three steps. I understand that you need two steps, one to start the I/O and another to wait for them to finish, but why do you need separate Prepare and Start steps? What can you do in between them? (You explained that. I'm just saying that that's my initial reaction when seeing that API. It is surprising.) >> If StartReadBuffer() starts the async I/O, the idea that you can call >> ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think >> StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for >> you in RBM_ZERO_* modes. > > Yeah, good thoughts, and topics that have occupied me for some time > now. I also thought that StartReadBuffer() should take > ReadBufferMode, but I came to the idea that it probably shouldn't like > this: > > I started working on all this by trying to implement the most > complicated case I could imagine, streaming recovery, and then working > back to the easy cases that just do scans with RBM_NORMAL. In > recovery, we can predict that a block will be zeroed using WAL flags, > and pre-existing cross-checks at redo time that enforce that the flags > and redo code definitely agree on that, but we can't predict which > exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or > RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and > get_cleanup_lock=true, as the comment warns them not to, but I > digress). > > That's OK, because we can't take locks while looking ahead in recovery > anyway (the redo routine carefully controls lock order/protocol), so > the code to actually do the locking needs to be somewhere near the > output end of the stream when the redo code calls > XLogReadBufferForRedoExtended(). But if you try to use RBM_XXX in > these interfaces, it begins to look pretty funny: the streaming > callback needs to be able to say which ReadBufferMode, but anywhere > near Prepare*(), Start*() or even Complete*() is too soon, so maybe we > need to invent a new value RBM_WILL_ZERO that doesn't yet say which of > the zero modes to use, and then the redo routine needs to pass in the > RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to > XLogReadBufferForRedoExtended() and it does it in a separate step > anyway, so we are ignoring ReadBufferMode. But that feels just wrong > -- we'd be using RBM but implementing them only partially. I see. When you're about to zero the page, there's not much point in splitting the operation into Prepare/Start/Complete stages anyway. You're not actually doing any I/O. Perhaps it's best to have a separate "Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)" function that does the same as ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today. > The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the > right time and makes sense to process as a batch, so it becomes a > flag. +1 >> Putting all that together, I propose: >> >> /* >> * Initiate reading a block from disk to the buffer cache. >> * >> * XXX: Until we have async I/O, this just allocates the buffer in the >> buffer >> * cache. The actual I/O happens in WaitReadBuffer(). >> */ >> Buffer >> StartReadBuffer(BufferManagerRelation bmr, >> ForkNumber forkNum, >> BlockNumber blockNum, >> BufferAccessStrategy strategy, >> ReadBufferMode mode, >> bool *foundPtr); >> >> /* >> * Wait for a read that was started earlier with StartReadBuffer() to >> finish. >> * >> * XXX: Until we have async I/O, this is the function that actually >> performs >> * the I/O. If multiple I/Os have been started with StartReadBuffer(), this >> * will try to perform all of them in one syscall. Subsequent calls to >> * WaitReadBuffer(), for those other buffers, will finish quickly. >> */ >> void >> WaitReadBuffer(Buffer buf); > > I'm confused about where the extra state lives that would allow the > communication required to build a larger I/O. In the AIO branch, it > does look a little more like that, but there is more magic state and > machinery hiding behind the curtain: the backend's pending I/O list > builds up a chain of I/Os, and when you try to wait, if it hasn't > already been submitted to the kernel/bgworkers yet it will be, and > before that merging will happen. So you get bigger I/Os without > having to say so explicitly. Yeah, I was thinking that there would be a global variable that holds a list of operations started with StartReadBuffer(). > For this synchronous version (and hopefully soon a more efficient > improved version in the AIO branch), we want to take advantage of the > client's pre-existing and more cheaply obtained knowledge of ranges. Ok. > In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would > take place in StartReadBuffers() instead, which would be responsible > for kicking off asynchronous I/Os (= asking a background worker to > call pread(), or equivalent fancy kernel async APIs). One question is > what it does if it finds a block in the middle that chops the read up, > or for that matter a segment boundary. I don't think we have to > decide now, but the two options seem to be that it starts one single > I/O and reports its size, making it the client's problem to call again > with the rest, or that it starts more than one I/O and they are > somehow chained together so that the caller doesn't know about that > and can later wait for all of them to finish using just one <handle > thing>. > > (The reason why I'm not 100% certain how it will look is that the > real, working code in the aio branch right now doesn't actually expose > a vector/nblocks bufmgr interface at all, yet. Andres's original > prototype had a single-block Start*(), Complete*() design, but a lower > level of the AIO system notices if pending read operations are > adjacent and could be merged. While discussing all this we decided it > was a bit strange to have lower code deal with allocating, chaining > and processing lots of separate I/O objects in shared memory, when > higher level code could often work in bigger ranges up front, and then > interact with the AIO subsystem with many fewer objects and steps. > Also, the present simple and lightweight synchronous proposal that > lacks the whole subsystem that could do that by magic.) Hmm, let's sketch out what this would look like with the approach that you always start one I/O and report the size: /* * Initiate reading a range of blocks from disk to the buffer cache. * * If the pages were already found in the buffer cache, returns true. * Otherwise false, and the caller must call WaitReadBufferRange() to * wait for the I/O to finish, before accessing the buffers. * * 'buffers' is a caller-supplied array large enough to hold (endBlk - * startBlk) buffers. It is filled with the buffers that the pages are * read into. * * This always starts a read of at least one block, and tries to * initiate one read I/O for the whole range if possible. But if the * read cannot be performed as a single I/O, a partial read is started, * and *endBlk is updated to reflect the range for which the read was * started. The caller can make another call to read the rest of the * range. A partial read can occur if some, but not all, of the pages * are already in the buffer cache, or because the range crosses a * segment boundary. * * XXX: Until we have async I/O, this just allocates the buffers in the * buffer cache. And perhaps calls smgrprefetch(). The actual I/O * happens in WaitReadBufferRange(). */ bool StartReadBufferRange(BufferManagerRelation bmr, ForkNumber forkNum, BlockNumber startBlk, BlockNumber *endBlk, BufferAccessStrategy strategy, Buffer *buffers); /* * Wait for a read that was started earlier with StartReadBufferRange() * to finish. * * XXX: Until we have async I/O, this is the function that actually * performs * the I/O. StartReadBufferRange already checked that the pages can be * read in one preadv() syscall. However, it's possible that another * backend performed the read for some of the pages in between. In that * case this will perform multiple syscalls, after all. */ void WaitReadBufferRange(Buffer *buffers, int nbuffers, bool zero_on_error); /* * Allocates a buffer for the given page in the buffer cache, and locks * the page. No I/O is initiated. The caller must initialize it and mark * the buffer dirty before releasing the lock. * * This is equivalent to ReadBuffer(RBM_ZERO_AND_LOCK) or * ReadBuffer(RBM_ZERO_AND_CLEANUP_LOCK). */ Buffer ZeroBuffer(BufferManagerRelation bmr, ForkNumber forkNum, BlockNumber blockNum, BufferAccessStrategy strategy, bool cleanup_lock); This range-oriented API fits the callers pretty well: the streaming read API works with block ranges already. There is no need for separate Prepare and Start steps. One weakness is that if StartReadBufferRange() finds that the range is "chopped up", it needs to return and throw away the work it had to do to look up the next buffer. So in the extreme case that every other block in the range is in the buffer cache, each call would look up two buffers in the buffer cache, startBlk and startBlk + 1, but only return one buffer to the caller. -- Heikki Linnakangas Neon (https://neon.tech)
On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Ok. It feels surprising to have three steps. I understand that you need > two steps, one to start the I/O and another to wait for them to finish, > but why do you need separate Prepare and Start steps? What can you do in > between them? (You explained that. I'm just saying that that's my > initial reaction when seeing that API. It is surprising.) Actually I don't think I explained that very well. First, some more detail about how a two-step version would work: * we only want to start one I/O in an StartReadBuffers() call, because otherwise it is hard/impossible for the caller to cap concurrency I/O * therefore StartReadBuffers() can handle sequences matching /^H*I*H*/ ("H" = hit, "I" = miss, I/O) in one call * in the asynchronous version, "I" in that pattern means we got BM_IO_IN_PROGRESS * in the synchronous version, "I" means that it's not valid, not BM_IN_IN_PROGRESS, but we won't actually try to get BM_IO_IN_PROGRESS until the later Complete/Wait call (and then the answer might chagne, but we'll just deal with that by looping in the synchronous version) * streaming_read.c has to deal with buffering up work that StartReadBuffers() didn't accept * that's actually quite easy, you just use the rest to create a new range in the next slot Previously I thought the requirement to deal with buffering future stuff that StartReadBuffers() couldn't accept yet was a pain, and life became so much simpler once I deleted all that and exposed PrepareReadBuffer() to the calling code. Perhaps I just hadn't done a good enough job of that. The other disadvantage you reminded me of was the duplicate buffer lookup in certain unlucky patterns, which I had forgotten about in my previous email. But I guess it's not fatal to the idea and there is a potential partial mitigation. (See below). A third thing was the requirement for communication between StartReadBuffers() and CompleteReadBuffers() which I originally had an "opaque" object that the caller has to keep around that held private state. It seemed nice to go back to talking just about buffer numbers, but that's not really an argument for anything... OK, I'm going to try the two-step version (again) with interfaces along the lines you sketched out... more soon. > I see. When you're about to zero the page, there's not much point in > splitting the operation into Prepare/Start/Complete stages anyway. > You're not actually doing any I/O. Perhaps it's best to have a separate > "Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)" > function that does the same as > ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today. That makes sense, but... hmm, sometimes just allocating a page generates I/O if it has to evict a dirty buffer. Nothing in this code does anything fancy about that, but imagine some hypothetical future thing that manages to do that asynchronously -- then we might want to take advantage of the ability to stream even a zeroed page, ie doing something ahead of time? Just a thought for another day, and perhaps that is just an argument for including it in the streaming read API, but it doesn't mean that the bufmgr.c API can't be as you say. > One weakness is that if StartReadBufferRange() finds that the range is > "chopped up", it needs to return and throw away the work it had to do to > look up the next buffer. So in the extreme case that every other block > in the range is in the buffer cache, each call would look up two buffers > in the buffer cache, startBlk and startBlk + 1, but only return one > buffer to the caller. Yeah, right. This was one of the observations that influenced my PrepareReadBuffer() three-step thing that I'd forgotten. To spell that out with an example, suppose the buffer pool contains every odd numbered block. Successive StartReadBuffers() calls would process "HMHm", "MHm", "MHm"... where "m" represents a miss that we can't do anything with for a block we'll look up in the buffer pool again in the next call. With the PrepareReadBuffer() design, that miss just starts a new range and we don't have to look it up again. Hmm, I suppose that could be mitigated somewhat with ReadRecentBuffer() if we can find somewhere decent to store it. BTW it was while thinking about and testing cases like that that I found Palak Chaturvedi's https://commitfest.postgresql.org/46/4426/ extremely useful. It can kick out every second page or any other range-chopping scenario you can express in a WHERE clause. I would quite like to get that tool into the tree...
Hi, Thanks for working on this! On Wed, 10 Jan 2024 at 07:14, Thomas Munro <thomas.munro@gmail.com> wrote: > Thanks! I committed the patches up as far as smgr.c before the > holidays. The next thing to commit would be the bufmgr.c one for > vectored ReadBuffer(), v5-0001. Here's my response to your review of > that, which triggered quite a few changes. > > See also new version of the streaming_read.c patch, with change list > at end. (I'll talk about v5-0002, the SMgrRelation lifetime one, over > on the separate thread about that where Heikki posted a better > version.) I have a couple of comments / questions. 0001-Provide-vectored-variant-of-ReadBuffer: - Do we need to pass the hit variable to ReadBuffer_common()? I think it can be just declared in the ReadBuffer_common() now. 0003-Provide-API-for-streaming-reads-of-relations: - Do we need to re-think about getting a victim buffer logic? StrategyGetBuffer() function errors if it can not find any unpinned buffers, this can be more common in the async world since we pin buffers before completing the read (while looking ahead). - If the returned block from the callback is an invalid block, pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there be cases like the returned block being an invalid block but we should continue to read after this invalid block? - max_pinned_buffers and pinned_buffers_trigger variables are set in the initialization part (in the pg_streaming_read_buffer_alloc_internal() function) then they do not change. In some cases there could be no acquirable buffers to pin while initializing the pgsr (LimitAdditionalPins() set max_pinned_buffers to 1) but while the read is continuing there could be chances to create larger reads (other consecutive reads are finished while this read is continuing). Do you think that trying to reset max_pinned_buffers and pinned_buffers_trigger to have higher values after the initialization to have larger reads make sense? + /* Is there a head range that we can't extend? */ + head_range = &pgsr->ranges[pgsr->head]; + if (head_range->nblocks > 0 && + (!need_complete || + !head_range->need_complete || + head_range->blocknum + head_range->nblocks != blocknum)) + { + /* Yes, time to start building a new one. */ + head_range = pg_streaming_read_new_range(pgsr); - I think if both need_complete and head_range->need_complete are false, we can extend the head range regardless of the consecutiveness of the blocks. 0006-Allow-streaming-reads-to-ramp-up-in-size: - ramp_up_pin_limit variable is declared as an int but we do not check the overflow while doubling it. This could be a problem in longer reads. -- Regards, Nazir Bilal Yavuz Microsoft
On Wed, Feb 7, 2024 at 11:54 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > 0001-Provide-vectored-variant-of-ReadBuffer: > > - Do we need to pass the hit variable to ReadBuffer_common()? I think > it can be just declared in the ReadBuffer_common() now. Right, thanks! Done, in the version I'll post shortly. > 0003-Provide-API-for-streaming-reads-of-relations: > > - Do we need to re-think about getting a victim buffer logic? > StrategyGetBuffer() function errors if it can not find any unpinned > buffers, this can be more common in the async world since we pin > buffers before completing the read (while looking ahead). Hmm, well that is what the pin limit machinery is supposed to be the solution to. It has always been possible to see that error if shared_buffers is too small, so that your backends can't pin what they need to make progress because there are too many other backends, the whole buffer pool is pinned and there is nothing available to steal/evict. Here, sure, we pin more stuff per backend, but not more than a "fair share", that is, Buffers / max backends, so it's not any worse, is it? Well maybe it's marginally worse in some case, for example if a query that uses many streams has one pinned buffer per stream (which we always allow) where before we'd have acquired and released pins in a slightly different sequence or whatever, but there is always going to be a minimum shared_buffers that will work at all for a given workload and we aren't changing it by much if at all here. If you're anywhere near that limit, your performance must be so bad that it'd only be a toy setting anyway. Does that sound reasonable? Note that this isn't the first to use multi-pin logic or that limit mechanism: that was the new extension code that shipped in 16. This will do that more often, though. > - If the returned block from the callback is an invalid block, > pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there > be cases like the returned block being an invalid block but we should > continue to read after this invalid block? Yeah, I think there will be, and I think we should do it with some kind of reset/restart function. I don't think we need it for the current users so I haven't included it yet (there is a maybe-related discussion about reset for another reasons, I think Melanie has an idea about that), but I think something like that will useful for future stuff like streaming recovery, where you can run out of WAL to read but more will come via the network soon. > - max_pinned_buffers and pinned_buffers_trigger variables are set in > the initialization part (in the > pg_streaming_read_buffer_alloc_internal() function) then they do not > change. In some cases there could be no acquirable buffers to pin > while initializing the pgsr (LimitAdditionalPins() set > max_pinned_buffers to 1) but while the read is continuing there could > be chances to create larger reads (other consecutive reads are > finished while this read is continuing). Do you think that trying to > reset max_pinned_buffers and pinned_buffers_trigger to have higher > values after the initialization to have larger reads make sense? That sounds hard! You're right that in the execution of a query there might well be cases like that (inner and outer scan of a hash join don't actually run at the same time, likewise for various other plan shapes), and something that would magically and dynamically balance resource usage might be ideal, but I don't know where to begin. Concretely, as long as your buffer pool is measured in gigabytes and your max backends is measured in hundreds, the per backend pin limit should actually be fairly hard to hit anyway, as it would be in the thousands. So I don't think it is as important as other resource usage balance problems that we also don't attempt (memory, CPU, I/O bandwidth). > + /* Is there a head range that we can't extend? */ > + head_range = &pgsr->ranges[pgsr->head]; > + if (head_range->nblocks > 0 && > + (!need_complete || > + !head_range->need_complete || > + head_range->blocknum + head_range->nblocks != blocknum)) > + { > + /* Yes, time to start building a new one. */ > + head_range = pg_streaming_read_new_range(pgsr); > > - I think if both need_complete and head_range->need_complete are > false, we can extend the head range regardless of the consecutiveness > of the blocks. Yeah, I think we can experiment with ideas like that. Not done yet but I'm thinking about it -- more shortly. > 0006-Allow-streaming-reads-to-ramp-up-in-size: > > - ramp_up_pin_limit variable is declared as an int but we do not check > the overflow while doubling it. This could be a problem in longer > reads. But it can't get above very high, because eventually it exceeds max_pinned_buffers, which is anchored to the ground by various small limits.
On Fri, Jan 12, 2024 at 12:32 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Ok. It feels surprising to have three steps. I understand that you need > > two steps, one to start the I/O and another to wait for them to finish, > > but why do you need separate Prepare and Start steps? What can you do in > > between them? (You explained that. I'm just saying that that's my > > initial reaction when seeing that API. It is surprising.) [...] > OK, I'm going to try the two-step version (again) with interfaces > along the lines you sketched out... more soon. Here's the 2 step version. The streaming_read.c API is unchanged, but the bugmfr.c API now has only the following extra functions: bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op) WaitReadBuffers(ReadBuffersOperation *op) That is, the PrepareReadBuffer() step is gone. StartReadBuffers() updates *nblocks to the number actually processed, which is always at least one. If it returns true, then you must call WaitReadBuffers(). When it finds a 'hit' (doesn't need I/O), then that one final (or only) buffer is processed, but no more. StartReadBuffers() always conceptually starts 0 or 1 I/Os. Example: if you ask for 16 blocks, and it finds two misses followed by a hit, it'll set *nblocks = 3, smgrprefetch(2 blocks), and smgrreadv(2 blocks) them in WaitReadBuffer(). The caller can't really tell that the third block was a hit. The only case it can distinguish is if the first one was a hit, and then it returns false and sets *nblocks = 1. This arrangement, where the results include the 'boundary' block that ends the readable range, avoids the double-lookup problem we discussed upthread. I think it should probably also be able to handle multiple consecutive 'hits' at the start of a sequence, but in this version I kept it simpler. It couldn't ever handle more than one after an I/O range though, because it can't guess if the one after will be a hit or a miss. If it turned out to be a miss, we don't want to start a second I/O, so unless we decide that we're happy unpinning and re-looking-up next time, it's better to give up then. Hence the idea of including the hit as a bonus block on the end. It took me a long time but I eventually worked my way around to preferring this way over the 3 step version. streaming_read.c now has to do a bit more work including sometimes 'ungetting' a block (ie deferring one that the callback has requested until next time), to resolve some circularities that come up with flow control. But I suspect you'd probably finish up having to deal with 'short' writes anyway, because in the asynchronous future, in a three-step version, the StartReadBuffers() (as 2nd step) might also be short when it fails to get enough BM_IO_IN_PROGRESS flags, so you have to deal with some version of these problems anyway. Thoughts? I am still thinking about how to improve the coding in streaming_read.c, ie to simplify and beautify the main control loop and improve the flow control logic. And looking for interesting test cases to hit various conditions in it and try to break it. And trying to figure out how this read-coalescing and parallel seq scan's block allocator might interfere with each other to produce non-idea patterns of system calls. Here are some example strace results generated by a couple of simple queries. See CF #4426 for pg_buffercache_invalidate(). === Sequential scan example === create table big as select generate_series(1, 10000000); select count(*) from big; pread64(81, ...) = 8192 <-- starts small pread64(81, ...) = 16384 pread64(81, ...) = 32768 pread64(81, ...) = 65536 pread64(81, ...) = 131072 <-- fully ramped up size reached preadv(81, ...) = 131072 <-- more Vs seen as buffers fill up/fragments preadv(81, ...) = 131072 ...repeating... preadv(81, ...) = 131072 preadv(81, ...) = 131072 pread64(81, ...) = 8192 <-- end fragment create table small as select generate_series(1, 100000); select bool_and(pg_buffercache_invalidate(bufferid)) from pg_buffercache where relfilenode = pg_relation_filenode('small') and relblocknumber % 3 != 0; -- <-- kick out every 3rd block select count(*) from small; preadv(88, ...) = 16384 <-- just the 2-block fragments we need to load preadv(88, ...) = 16384 preadv(88, ...) = 16384 === Bitmap heapscan example === create table heap (i int primary key); insert into heap select generate_series(1, 1000000); select bool_and(pg_buffercache_invalidate(bufferid)) from pg_buffercache where relfilenode = pg_relation_filenode('heap'); select count(i) from heap where i in (10, 1000, 10000, 100000) or i in (20, 200, 2000, 200000); pread64(75, ..., 8192, 0) = 8192 fadvise64(75, 32768, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(75, 65536, 8192, POSIX_FADV_WILLNEED) = 0 pread64(75, ..., 8192, 32768) = 8192 pread64(75, ..., 8192, 65536) = 8192 fadvise64(75, 360448, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(75, 3620864, 8192, POSIX_FADV_WILLNEED) = 0 fadvise64(75, 7241728, 8192, POSIX_FADV_WILLNEED) = 0 pread64(75, ..., 8192, 360448) = 8192 pread64(75, ..., 8192, 3620864) = 8192 pread64(75, ..., 8192, 7241728) = 8192
Attachment
- v5-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v5-0002-Provide-API-for-streaming-reads-of-relations.patch
- v5-0003-Use-streaming-reads-in-pg_prewarm.patch
- v5-0004-WIP-Use-streaming-reads-in-heapam-scans.patch
- v5-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v5-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's the 2 step version. The streaming_read.c API is unchanged, but > the bugmfr.c API now has only the following extra functions: > > bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op) > WaitReadBuffers(ReadBuffersOperation *op) I wonder if there are semi-standard names that people use for this kind of API. Somehow I like "start" and "finish" or "start" and "complete" better than "start" and "wait". But I don't really know what's best. If there's a usual practice, it'd be good to adhere to it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 27, 2024 at 5:03 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Here's the 2 step version. The streaming_read.c API is unchanged, but > > the bugmfr.c API now has only the following extra functions: > > > > bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op) > > WaitReadBuffers(ReadBuffersOperation *op) > > I wonder if there are semi-standard names that people use for this > kind of API. Somehow I like "start" and "finish" or "start" and > "complete" better than "start" and "wait". But I don't really know > what's best. If there's a usual practice, it'd be good to adhere to > it. I think "complete" has a subtly different meaning, which is why I liked Heikki's suggestion. One way to look at it is that only one backend can "complete" a read, but any number of backends can "wait". But I don't have strong views on this. It feels like this API is relatively difficult to use directly anyway, so almost all users will go through the streaming read API. Here's a new version with two main improvements. (Note that I'm only talking about 0001-0003 here, the rest are useful for testing but are just outdated versions of patches that have their own threads.) 1. Melanie discovered small regressions in all-cached simple scans. Here's a better look-ahead distance control algorithm that addresses that. First let me state the updated goals of the algorithm: A) for all-hit scans, pin very few buffers, since that can't help if we're not doing any I/O B) for all-miss sequential scans, pin only as many buffers as it takes to build full-sized I/Os, since fully sequential scans are left to the OS to optimise for now (ie no "advice") C) for all-miss random scans, pin as many buffers as it takes to reach our I/O concurrency level In all cases, respect the per-backend pin limit as a last resort limit (roughly NBuffers / max_connections), but we're now actively trying *not* to use so many. For patterns in between the A, B, C extremes, do something in between. The way the new algorithm attempts to classify the scan adaptively over time is as follows: * look ahead distance starts out at one block (behaviour A) * every time we start an I/O, we double the distance until we reach the max pin limit (behaviour C), or if we're not issuing "advice" because sequential access is detected, until we reach the MAX_TRANSFER_BUFFERS (behaviour B) * every time we get a hit, we decrement the distance by one (we move slowly back to behaviour A) Query to observe a system transitioning A->B->A->B, when doing a full scan that has ~50 contiguous blocks already in shared buffers somewhere in the middle: create extension pg_prewarm; create extension pg_buffercache; set max_parallel_workers_per_gather = 0; create table t (i int); insert into t select generate_series(1, 100000); select pg_prewarm('t'); select bool_and(pg_buffercache_invalidate(bufferid)) from pg_buffercache where relfilenode = pg_relation_filenode('t') and (relblocknumber between 0 and 100 or relblocknumber > 150); select count(*) from t; pread(31,...) = 8192 (0x2000) <--- start small (A) preadv(31,...) = 16384 (0x4000) <--- ramp up... preadv(31,...) = 32768 (0x8000) preadv(31,...) = 65536 (0x10000) preadv(31,...) = 131072 (0x20000) <--- full size (B) preadv(31,...) = 131072 (0x20000) ... preadv(31,...) = 131072 (0x20000) preadv(31,...) = 49152 (0xc000) <--- end of misses, decay to A pread(31,...) = 8192 (0x2000) <--- start small again (A) preadv(31,...) = 16384 (0x4000) preadv(31,...) = 32768 (0x8000) preadv(31,...) = 65536 (0x10000) preadv(31,...) = 131072 (0x20000) <--- full size (B) preadv(31,...) = 131072 (0x20000) ... preadv(31,...) = 131072 (0x20000) preadv(31,...) = 122880 (0x1e000) <-- end of relation Note that this never pins more than 16 buffers, because it's case B, not case C in the description above. There is no benefit in looking further ahead if you're relying on the kernel's sequential prefetching. The fully cached regression Melanie reported now stays entirely in A. The previous coding would ramp up to high look-ahead distance for no good reason and never ramp down, so it was wasteful and slightly slower than master. 2. Melanie and I noticed while discussing the pre-existing ad hoc bitmap heap scan that this thing should respect the {effective,maintenance}_io_concurrency setting of the containing tablespace. Note special cases to avoid problems while stream-reading pg_tablespace itself and pg_database in backend initialisation. There is a third problem that I'm still working on: the behaviour for very small values of effective_io_concurrency isn't quite right, as discussed in detail by Tomas and Melanie on the bitmap heapscan thread. The attached makes 0 do approximately the right thing (though I'm hoping to make it less special), but other small numbers aren't quite right yet -- 1 is still issuing a useless fadvise at the wrong time, and 2 is working in groups of N at a time instead of interleaving as you might perhaps expect. These are probably side-effects of my focusing on coalescing large reads and losing sight of the small cases. I need a little more adaptivity and generality in the algorithm at the small end, not least because 1 is the default value. I'll share a patch to improve that very soon.
Attachment
- v6-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v6-0002-Provide-API-for-streaming-reads-of-relations.patch
- v6-0003-Use-streaming-reads-in-pg_prewarm.patch
- v6-0004-WIP-Use-streaming-reads-in-heapam-scans.patch
- v6-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v6-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
- v6-0007-WIP-pg_buffercache_invalidate.patch
On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Hi Thomas, I am planning to review this patch set, so started going through 0001, I have a question related to how we are issuing smgrprefetch in StartReadBuffers() + if (operation->io_buffers_len > 0) + { + if (flags & READ_BUFFERS_ISSUE_ADVICE) { - if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages) - { - ereport(WARNING, - (errcode(ERRCODE_DATA_CORRUPTED), - errmsg("invalid page in block %u of relation %s; zeroing out page", - blockNum, - relpath(smgr->smgr_rlocator, forkNum)))); - MemSet((char *) bufBlock, 0, BLCKSZ); - } - else - ereport(ERROR, - (errcode(ERRCODE_DATA_CORRUPTED), - errmsg("invalid page in block %u of relation %s", - blockNum, - relpath(smgr->smgr_rlocator, forkNum)))); + /* + * In theory we should only do this if PrepareReadBuffers() had to + * allocate new buffers above. That way, if two calls to + * StartReadBuffers() were made for the same blocks before + * WaitReadBuffers(), only the first would issue the advice. + * That'd be a better simulation of true asynchronous I/O, which + * would only start the I/O once, but isn't done here for + * simplicity. Note also that the following call might actually + * issue two advice calls if we cross a segment boundary; in a + * true asynchronous version we might choose to process only one + * real I/O at a time in that case. + */ + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len); } This is always issuing smgrprefetch starting with the input blockNum, shouldn't we pass the first blockNum which we did not find in the Buffer pool? So basically in the loop above this call where we are doing PrepareReadBuffer() we should track the first blockNum for which the found is not true and pass that blockNum into the smgrprefetch() as a first block right? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 12, 2024 at 7:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I am planning to review this patch set, so started going through 0001, > I have a question related to how we are issuing smgrprefetch in > StartReadBuffers() Thanks! > + /* > + * In theory we should only do this if PrepareReadBuffers() had to > + * allocate new buffers above. That way, if two calls to > + * StartReadBuffers() were made for the same blocks before > + * WaitReadBuffers(), only the first would issue the advice. > + * That'd be a better simulation of true asynchronous I/O, which > + * would only start the I/O once, but isn't done here for > + * simplicity. Note also that the following call might actually > + * issue two advice calls if we cross a segment boundary; in a > + * true asynchronous version we might choose to process only one > + * real I/O at a time in that case. > + */ > + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len); > } > > This is always issuing smgrprefetch starting with the input blockNum, > shouldn't we pass the first blockNum which we did not find in the > Buffer pool? So basically in the loop above this call where we are > doing PrepareReadBuffer() we should track the first blockNum for which > the found is not true and pass that blockNum into the smgrprefetch() > as a first block right? I think you'd be right if StartReadBuffers() were capable of processing a sequence consisting of a hit followed by misses, but currently it always gives up after the first hit. That is, it always processes some number of misses (0-16) and then at most one hit. So for now the variable would always turn out to be the same as blockNum. The reason is that I wanted to allows "full sized" read system calls to form. If you said "hey please read these 16 blocks" (I'm calling that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits, then it could only form a read of 14 blocks, but there might be more blocks that could be read after those. We would have some arbitrary shorter read system calls, when we wanted to make them all as big as possible. So in the current patch you say "hey please read these 16 blocks" and it returns saying "only read 1", you call again with 15 and it says "only read 1", and you call again and says "read 16!" (assuming 2 more were readable after the original range we started with). Then physical reads are maximised. Maybe there is some nice way to solve that, but I thought this way was the simplest (and if there is some instruction-cache-locality/tight-loop/perf reason why we should work harder to find ranges of hits, it could be for later). Does that make sense?
On Tue, Mar 12, 2024 at 7:40 PM Thomas Munro <thomas.munro@gmail.com> wrote: > possible. So in the current patch you say "hey please read these 16 > blocks" and it returns saying "only read 1", you call again with 15 Oops, typo worth correcting: s/15/16/. Point being that the caller is interested in more blocks after the original 16, so it uses 16 again when it calls back (because that's the size of the Buffer array it provides).
On Tue, Mar 12, 2024 at 12:10 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I think you'd be right if StartReadBuffers() were capable of > processing a sequence consisting of a hit followed by misses, but > currently it always gives up after the first hit. That is, it always > processes some number of misses (0-16) and then at most one hit. So > for now the variable would always turn out to be the same as blockNum. > Okay, then shouldn't this "if (found)" block immediately break the loop so that when we hit the block we just return that block? So it makes sense what you explained but with the current code if there are the first few hits followed by misses then we will issue the smgrprefetch() for the initial hit blocks as well. + if (found) + { + /* + * Terminate the read as soon as we get a hit. It could be a + * single buffer hit, or it could be a hit that follows a readable + * range. We don't want to create more than one readable range, + * so we stop here. + */ + actual_nblocks = operation->nblocks = *nblocks = i + 1; (Dilip: I think we should break after this?) + } + else + { + /* Extend the readable range to cover this block. */ + operation->io_buffers_len++; + } + } > The reason is that I wanted to allows "full sized" read system calls > to form. If you said "hey please read these 16 blocks" (I'm calling > that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits, > then it could only form a read of 14 blocks, but there might be more > blocks that could be read after those. We would have some arbitrary > shorter read system calls, when we wanted to make them all as big as > possible. So in the current patch you say "hey please read these 16 > blocks" and it returns saying "only read 1", you call again with 15 > and it says "only read 1", and you call again and says "read 16!" > (assuming 2 more were readable after the original range we started > with). Then physical reads are maximised. Maybe there is some nice > way to solve that, but I thought this way was the simplest (and if > there is some instruction-cache-locality/tight-loop/perf reason why we > should work harder to find ranges of hits, it could be for later). > Does that make sense? Understood, I think this makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 12, 2024 at 11:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > + actual_nblocks = operation->nblocks = *nblocks = i + 1; > (Dilip: I think we should break after this?) In the next loop, i < actual_nblocks is false so the loop terminates. But yeah that was a bit obscure, so I have added an explicit break. Here also is a new version also of the streaming_read.c patch. This is based on feedback from the bitmap heap scan thread, where Tomas and Melanie noticed some problems when comparing effective_io_concurrency = 0 and 1 against master. The attached random.sql exercises a bunch of random scans with different settings, and random.txt shows the resulting system calls, unpatched vs patched. Looking at it that way, I was able to tweak the coding until I had the same behaviour as master, except with fewer system calls wherever possible due to coalescing or suppressing sequential advice.
Attachment
- random.txt
- v7-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v7-0002-Provide-API-for-streaming-reads-of-relations.patch
- v7-0003-Use-streaming-reads-in-pg_prewarm.patch
- v7-0004-WIP-Use-streaming-reads-in-heapam-scans.patch
- v7-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v7-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
- v7-0007-WIP-pg_buffercache_invalidate.patch
- random.sql
I am planning to push the bufmgr.c patch soon. At that point the new API won't have any direct callers yet, but the traditional ReadBuffer() family of functions will internally reach StartReadBuffers(nblocks=1) followed by WaitReadBuffers(), ZeroBuffer() or nothing as appropriate. Any more thoughts or objections? Naming, semantics, correctness of buffer protocol, sufficiency of comments, something else?
Hi, On Sat, 16 Mar 2024 at 02:53, Thomas Munro <thomas.munro@gmail.com> wrote: > > I am planning to push the bufmgr.c patch soon. At that point the new > API won't have any direct callers yet, but the traditional > ReadBuffer() family of functions will internally reach > StartReadBuffers(nblocks=1) followed by WaitReadBuffers(), > ZeroBuffer() or nothing as appropriate. Any more thoughts or > objections? Naming, semantics, correctness of buffer protocol, > sufficiency of comments, something else? + if (StartReadBuffers(bmr, + &buffer, + forkNum, + blockNum, + &nblocks, + strategy, + flags, + &operation)) + WaitReadBuffers(&operation); I think we need to call WaitReadBuffers when 'mode != RBM_ZERO_AND_CLEANUP_LOCK && mode != RBM_ZERO_AND_LOCK' or am I missing something? Couple of nitpicks: It would be nice to explain what the PrepareReadBuffer function does with a comment. + if (nblocks == 0) + return; /* nothing to do */ It is guaranteed that nblocks will be bigger than 0. Can't we just use Assert(operation->io_buffers_len > 0);? -- Regards, Nazir Bilal Yavuz Microsoft
Some quick comments: On 12/03/2024 15:02, Thomas Munro wrote: > src/backend/storage/aio/streaming_read.c > src/include/storage/streaming_read.h Standard file header comments missing. It would be nice to have a comment at the top of streaming_read.c, explaining at a high level how the circular buffer, lookahead and all that works. Maybe even some diagrams. For example, what is head and what is tail? Before reading the code, I assumed that 'head' was the next block range to return in pg_streaming_read_buffer_get_next(). But I think it's actually the other way round? > /* > * Create a new streaming read object that can be used to perform the > * equivalent of a series of ReadBuffer() calls for one fork of one relation. > * Internally, it generates larger vectored reads where possible by looking > * ahead. > */ > PgStreamingRead * > pg_streaming_read_buffer_alloc(int flags, > void *pgsr_private, > size_t per_buffer_data_size, > BufferAccessStrategy strategy, > BufferManagerRelation bmr, > ForkNumber forknum, > PgStreamingReadBufferCB next_block_cb) I'm not a fan of the name, especially the 'alloc' part. Yeah, most of the work it does is memory allocation. But I'd suggest something like 'pg_streaming_read_begin' instead. Do we really need the pg_ prefix in these? > Buffer > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data) Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', for a shorter name. > > /* > * pgsr->ranges is a circular buffer. When it is empty, head == tail. > * When it is full, there is an empty element between head and tail. Head > * can also be empty (nblocks == 0), therefore we need two extra elements > * for non-occupied ranges, on top of max_pinned_buffers to allow for the > * maxmimum possible number of occupied ranges of the smallest possible > * size of one. > */ > size = max_pinned_buffers + 2; I didn't understand this explanation for why it's + 2. > /* > * Skip the initial ramp-up phase if the caller says we're going to be > * reading the whole relation. This way we start out doing full-sized > * reads. > */ > if (flags & PGSR_FLAG_FULL) > pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers); > else > pgsr->distance = 1; Should this be "Max(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ? -- Heikki Linnakangas Neon (https://neon.tech)
On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 12/03/2024 15:02, Thomas Munro wrote: > > src/backend/storage/aio/streaming_read.c > > src/include/storage/streaming_read.h > > Standard file header comments missing. Fixed. > It would be nice to have a comment at the top of streaming_read.c, > explaining at a high level how the circular buffer, lookahead and all > that works. Maybe even some diagrams. Done. > For example, what is head and what is tail? Before reading the code, I > assumed that 'head' was the next block range to return in > pg_streaming_read_buffer_get_next(). But I think it's actually the other > way round? Yeah. People seem to have different natural intuitions about head vs tail in this sort of thing, so I've switched to descriptive names: stream->{oldest,next}_buffer_index (see also below). > > /* > > * Create a new streaming read object that can be used to perform the > > * equivalent of a series of ReadBuffer() calls for one fork of one relation. > > * Internally, it generates larger vectored reads where possible by looking > > * ahead. > > */ > > PgStreamingRead * > > pg_streaming_read_buffer_alloc(int flags, > > void *pgsr_private, > > size_t per_buffer_data_size, > > BufferAccessStrategy strategy, > > BufferManagerRelation bmr, > > ForkNumber forknum, > > PgStreamingReadBufferCB next_block_cb) > > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of > the work it does is memory allocation. But I'd suggest something like > 'pg_streaming_read_begin' instead. I like it. Done. > Do we really need the pg_ prefix in these? Good question. My understanding of our convention is that pg_ is needed for local replacements/variants/extensions of things with well known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in a few places where the word is very common/short and we want to avoid collisions and make sure it's obviously ours (pg_popcount?), and I guess places that reflect the name of a SQL identifier with a prefix, but this doesn't seem to qualify for any of those things. It's a new thing, our own thing entirely, and sufficiently distinctive and unconfusable with standard stuff. So, prefix removed. Lots of other patches on top of this one are using "pgsr" as a variable name, ie containing that prefix; perhaps they would use "sr" or "streaming_read" or "stream". I used "stream" in a few places in this version. Other names improved in this version IMHO: pgsr_private -> callback_private. I find it clearer, as a way to indicate that the provider of the callback "owns" it. I also reordered the arguments: now it's streaming_read_buffer_begin(..., callback, callback_private, per_buffer_data_size), to keep those three things together. > > Buffer > > pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data) > > Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', > for a shorter name. Hmm. The idea of 'buffer' appearing in a couple of names is that there are conceptually other kinds of I/O that we might want to stream, like raw files or buffers other than the buffer pool, maybe even sockets, so this would be part of a family of similar interfaces. I think it needs to be clear that this variant gives you buffers. I'm OK with removing "get" but I guess it would be better to keep the words in the same order across the three functions? What about these? streaming_read_buffer_begin(); streaming_read_buffer_next(); streaming_read_buffer_end(); Tried like that in this version. Other ideas would be to make "stream" the main noun, buffered_read_stream_begin() or something. Ideas welcome. It's also a bit grammatically weird to say StartReadBuffers() and WaitReadBuffers() in the bufmgr API... Hmm. Perhaps we should just call it ReadBuffers() and WaitForBufferIO()? Maybe surprising because the former isn't just like ReadBuffer() ... but on the other hand no one said it has to be, and sometimes it even is (when it gets a hit). I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite to make the asynchrony optional or explicit if someone has a problem with that. (Hmm, that'd be a bit like the Windows native file API, where ReadFile() is synchronous or asynchronous depending on flags.) > > > > /* > > * pgsr->ranges is a circular buffer. When it is empty, head == tail. > > * When it is full, there is an empty element between head and tail. Head > > * can also be empty (nblocks == 0), therefore we need two extra elements > > * for non-occupied ranges, on top of max_pinned_buffers to allow for the > > * maxmimum possible number of occupied ranges of the smallest possible > > * size of one. > > */ > > size = max_pinned_buffers + 2; > > I didn't understand this explanation for why it's + 2. I think the logic was right but the explanation was incomplete, sorry. It needed one gap between head and tail because head == tail means empty, and another because the tail item was still 'live' for one extra call until you examined it one more time to notice that all the buffers had been extracted. In any case I have now deleted all that. In this new version I don't need extra space between head and tail at all, because empty is detected with stream->pinned_buffers == 0, and tail moves forwards immediately when you consume buffers. > > /* > > * Skip the initial ramp-up phase if the caller says we're going to be > > * reading the whole relation. This way we start out doing full-sized > > * reads. > > */ > > if (flags & PGSR_FLAG_FULL) > > pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers); > > else > > pgsr->distance = 1; > > Should this be "Max(MAX_BUFFERS_PER_TRANSFER, > pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than > MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So > perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ? Right, done. Here are some other changes: * I'm fairly happy with the ABC adaptive distance algorithm so far, I think, but I spent more time tidying up the way it is implemented. I didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER], so I created a new dense array stream->buffers that behaved as a second circular queue. * The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to become the GUC that it always wanted to be: buffer_io_size defaulting to 128kB. Seems like a reasonable thing to have? Could also influence things like bulk write? (The main problem I have with the GUC currently is choosing a category, async resources is wrong....) * By analogy, it started to look a bit funny that each range had room for a ReadBuffersOperation, and we had enough ranges for max_pinned_buffers * 1 block range. So I booted that out to another dense array, of size max_ios. * At the same time, Bilal and Andres had been complaining privately about 'range' management overheads showing up in perf and creating a regression against master on fully cached scans that do nothing else (eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O and no CPU work with the page, a somewhat extreme case but a reasonable way to isolate the management costs); having made the above change, it suddenly seemed obvious that I should make the buffers array the 'main' circular queue, pointing off to another place for information required for dealing with misses. In this version, there are no more range objects. This feels better and occupies and touches less memory. See pictures below. * The 'head range' is replaced by the pending_read_{blocknum,nblocks} variables, which seems easier to understand. Essentially the callback's block numbers are run-length compressed into there until they can't be, at which point we start a read and start forming a new pending read. * A micro-optimisation arranges for the zero slots to be reused over and over again if we're doing distance = 1, to avoid rotating through memory for no benefit; I don't yet know if that pays for itself, it's just a couple of lines... * Various indexes and sizes that couldn't quite fit in uint8_t but couldn't possibly exceed a few thousand because they are bounded by numbers deriving from range-limited GUCs are now int16_t (while I was looking for low hanging opportunities to reduce memory usage...) In pictures, the circular queue arrangement changed from max_pinned_buffers * fat range objects like this, where only the per-buffer data was outside the range object (because its size is variable), and in the worst case all-hit case there were a lot of ranges with only one buffer in them: ranges buf/data +--------+--------------+----+ +-----+ | | | | | | +--------+--------------+----+ +-----+ tail -> | 10..10 | buffers[MAX] | op +----->| ? | +--------+--------------+----+ +-----+ | 42..44 | buffers[MAX] | op +----->| ? | +--------+--------------+----+ +-----+ | 60..60 | buffers[MAX] | op +--+ | ? | +--------+--------------+----+ | +-----+ head -> | | | | | | ? | +--------+--------------+----+ | +-----+ | | | | +-->| ? | +--------+--------------+----+ +-----+ | | | | | | +--------+--------------+----+ +-----+ ... to something that you might call a "columnar" layout, where ops are kicked out to their own array of size max_ios (a much smaller number), and the buffers, per-buffer data and indexes pointing to optional I/O objects are in parallel arrays of size max_pinned_buffers, like this: buffers buf/data buf/io ios (= ops) +----+ +-----+ +---+ +--------+ | | | | | | +---->| 42..44 | +----+ +-----+ +---+ | +--------+ oldest_buffer_index -> | 10 | | ? | | | | +-->| 60..60 | +----+ +-----+ +---+ | | +--------+ | 42 | | ? | | 0 +--+ | | | +----+ +-----+ +---+ | +--------+ | 43 | | ? | | | | | | +----+ +-----+ +---+ | +--------+ | 44 | | ? | | | | | | +----+ +-----+ +---+ | +--------+ | 60 | | ? | | 1 +----+ +----+ +-----+ +---+ next_buffer_index -> | | | | | | +----+ +-----+ +---+ In other words, there is essentially no waste/padding now, and in the all-hit case we stay in the zero'th elements of those arrays so they can stay red hot. Still working on validating this refactoring with other patches and test scenarios. I hope it's easier to understand, and does a better job of explaining itself. I'm also still processing a bunch of performance-related fixups mostly for bufmgr.c sent by Andres off-list (things like: StartReadBuffer() argument list is too wide, some things need inline, we should only initialise the op if it will be needed, oh I squashed that last one into the patch already), after he and Bilal studied some regressions in cases with no I/O. And thinking about Bilal's earlier message (extra read even when we're going to zero, oops, he's quite right about that) and a patch he sent me for that. More on those soon.
Attachment
On Sun, Mar 24, 2024 at 9:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 12/03/2024 15:02, Thomas Munro wrote: > > > src/backend/storage/aio/streaming_read.c > > > src/include/storage/streaming_read.h > > > > Standard file header comments missing. > > Fixed. > > > It would be nice to have a comment at the top of streaming_read.c, > > explaining at a high level how the circular buffer, lookahead and all > > that works. Maybe even some diagrams. > > Done. > > > For example, what is head and what is tail? Before reading the code, I > > assumed that 'head' was the next block range to return in > > pg_streaming_read_buffer_get_next(). But I think it's actually the other > > way round? > > Yeah. People seem to have different natural intuitions about head vs > tail in this sort of thing, so I've switched to descriptive names: > stream->{oldest,next}_buffer_index (see also below). > > > > /* > > > * Create a new streaming read object that can be used to perform the > > > * equivalent of a series of ReadBuffer() calls for one fork of one relation. > > > * Internally, it generates larger vectored reads where possible by looking > > > * ahead. > > > */ > > > PgStreamingRead * > > > pg_streaming_read_buffer_alloc(int flags, > > > void *pgsr_private, > > > size_t per_buffer_data_size, > > > BufferAccessStrategy strategy, > > > BufferManagerRelation bmr, > > > ForkNumber forknum, > > > PgStreamingReadBufferCB next_block_cb) > > > > I'm not a fan of the name, especially the 'alloc' part. Yeah, most of > > the work it does is memory allocation. But I'd suggest something like > > 'pg_streaming_read_begin' instead. > > I like it. Done. > > > Do we really need the pg_ prefix in these? > > Good question. My understanding of our convention is that pg_ is > needed for local replacements/variants/extensions of things with well > known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in > a few places where the word is very common/short and we want to avoid > collisions and make sure it's obviously ours (pg_popcount?), and I > guess places that reflect the name of a SQL identifier with a prefix, > but this doesn't seem to qualify for any of those things. It's a new > thing, our own thing entirely, and sufficiently distinctive and > unconfusable with standard stuff. So, prefix removed. > > Lots of other patches on top of this one are using "pgsr" as a > variable name, ie containing that prefix; perhaps they would use "sr" > or "streaming_read" or "stream". I used "stream" in a few places in > this version. > > Other names improved in this version IMHO: pgsr_private -> > callback_private. I find it clearer, as a way to indicate that the > provider of the callback "owns" it. I also reordered the arguments: > now it's streaming_read_buffer_begin(..., callback, callback_private, > per_buffer_data_size), to keep those three things together. I haven't reviewed the whole patch, but as I was rebasing bitmapheapscan streaming read user, I found callback_private confusing because it seems like it is a private callback, not private data belonging to the callback. Perhaps call it callback_private_data? Also maybe mention what it is for in the comment above streaming_read_buffer_begin() and in the StreamingRead structure itself. - Melanie
On Mon, Mar 25, 2024 at 6:30 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I haven't reviewed the whole patch, but as I was rebasing > bitmapheapscan streaming read user, I found callback_private confusing > because it seems like it is a private callback, not private data > belonging to the callback. Perhaps call it callback_private_data? Also WFM. > maybe mention what it is for in the comment above > streaming_read_buffer_begin() and in the StreamingRead structure > itself. Yeah. I've tried to improve the comments on all three public functions. I also moved the three public functions _begin(), _next(), _end() to be next to each other after the static helper functions. Working on perf regression/tuning reports today, more soon...
Attachment
On 24/03/2024 15:02, Thomas Munro wrote: > On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next', >> for a shorter name. > > Hmm. The idea of 'buffer' appearing in a couple of names is that > there are conceptually other kinds of I/O that we might want to > stream, like raw files or buffers other than the buffer pool, maybe > even sockets, so this would be part of a family of similar interfaces. > I think it needs to be clear that this variant gives you buffers. I'm > OK with removing "get" but I guess it would be better to keep the > words in the same order across the three functions? What about these? > > streaming_read_buffer_begin(); > streaming_read_buffer_next(); > streaming_read_buffer_end(); > > Tried like that in this version. Other ideas would be to make > "stream" the main noun, buffered_read_stream_begin() or something. > Ideas welcome. Works for me, although "streaming_read_buffer" is a pretty long prefix. The flags like "STREAMING_READ_MAINTENANCE" probably ought to be "STREAMING_READ_BUFFER_MAINTENANCE" as well. Maybe "buffer_stream_next()"? > Here are some other changes: > > * I'm fairly happy with the ABC adaptive distance algorithm so far, I > think, but I spent more time tidying up the way it is implemented. I > didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER], > so I created a new dense array stream->buffers that behaved as a > second circular queue. > > * The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to > become the GUC that it always wanted to be: buffer_io_size defaulting > to 128kB. Seems like a reasonable thing to have? Could also > influence things like bulk write? (The main problem I have with the > GUC currently is choosing a category, async resources is wrong....) > > * By analogy, it started to look a bit funny that each range had room > for a ReadBuffersOperation, and we had enough ranges for > max_pinned_buffers * 1 block range. So I booted that out to another > dense array, of size max_ios. > > * At the same time, Bilal and Andres had been complaining privately > about 'range' management overheads showing up in perf and creating a > regression against master on fully cached scans that do nothing else > (eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O > and no CPU work with the page, a somewhat extreme case but a > reasonable way to isolate the management costs); having made the above > change, it suddenly seemed obvious that I should make the buffers > array the 'main' circular queue, pointing off to another place for > information required for dealing with misses. In this version, there > are no more range objects. This feels better and occupies and touches > less memory. See pictures below. +1 for all that. Much better! > * Various indexes and sizes that couldn't quite fit in uint8_t but > couldn't possibly exceed a few thousand because they are bounded by > numbers deriving from range-limited GUCs are now int16_t (while I was > looking for low hanging opportunities to reduce memory usage...) Is int16 enough though? It seems so, because: max_pinned_buffers = Max(max_ios * 4, buffer_io_size); and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32. If someone changes those constants though, int16 might overflow and fail in weird ways. I'd suggest being more careful here and explicitly clamp max_pinned_buffers at PG_INT16_MAX or have a static assertion or something. (I think it needs to be somewhat less than PG_INT16_MAX, because of the extra "overflow buffers" stuff and some other places where you do arithmetic.) > /* > * We gave a contiguous range of buffer space to StartReadBuffers(), but > * we want it to wrap around at max_pinned_buffers. Move values that > * overflowed into the extra space. At the same time, put -1 in the I/O > * slots for the rest of the buffers to indicate no I/O. They are covered > * by the head buffer's I/O, if there is one. We avoid a % operator. > */ > overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers; > if (overflow > 0) > { > memmove(&stream->buffers[0], > &stream->buffers[stream->max_pinned_buffers], > sizeof(stream->buffers[0]) * overflow); > for (int i = 0; i < overflow; ++i) > stream->buffer_io_indexes[i] = -1; > for (int i = 1; i < nblocks - overflow; ++i) > stream->buffer_io_indexes[stream->next_buffer_index + i] = -1; > } > else > { > for (int i = 1; i < nblocks; ++i) > stream->buffer_io_indexes[stream->next_buffer_index + i] = -1; > } Instead of clearing buffer_io_indexes here, it might be cheaper/simpler to initialize the array to -1 in streaming_read_buffer_begin(), and reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(), after the WaitReadBuffers() call. In other words, except when an I/O is in progress, keep all the elements at -1, even the elements that are not currently in use. Alternatively, you could remember the first buffer that the I/O applies to in the 'ios' array. In other words, instead of pointing from buffer to the I/O that it depends on, point from the I/O to the buffer that depends on it. The last attached patch implements that approach. I'm not wedded to it, but it feels a little simpler. > if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE) > { > /* Distance ramps up fast (behavior C). */ > ... > } > else > { > /* No advice; move towards full I/O size (behavior B). */ > ... > } The comment on ReadBuffersOperation says "Declared in public header only to allow inclusion in other structs, but contents should not be accessed", but here you access the 'flags' field. You also mentioned that the StartReadBuffers() argument list is too long. Perhaps the solution is to redefine ReadBuffersOperation so that it consists of two parts: 1st part is filled in by the caller, and contains the arguments, and 2nd part is private to bufmgr.c. The signature for StartReadBuffers() would then be just: bool StartReadBuffers(ReadBuffersOperation *operation); That would make it OK to read the 'flags' field. It would also allow reusing the same ReadBuffersOperation struct for multiple I/Os for the same relation; you only need to change the changing parts of the struct on each operation. In the attached patch set, the first three patches are your v9 with no changes. The last patch refactors away 'buffer_io_indexes' like I mentioned above. The others are fixes for some other trivial things that caught my eye. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
- v9.heikki-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v9.heikki-0002-Provide-API-for-streaming-reads-of-relatio.patch
- v9.heikki-0003-Use-streaming-reads-in-pg_prewarm.patch
- v9.heikki-0004-Cleanup-boilerplate-file-header-comments.patch
- v9.heikki-0005-Tidy-up-includes.patch
- v9.heikki-0006-Mention-that-STREAMING_READ_MAINTENANCE-is.patch
- v9.heikki-0007-Trivial-comment-fixes.patch
- v9.heikki-0008-Use-wipe_mem-for-callback-data-that-should.patch
- v9.heikki-0009-Replace-buffer_io_indexes-array-with-ref-f.patch
On Wed, Mar 27, 2024 at 1:40 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Is int16 enough though? It seems so, because: > > max_pinned_buffers = Max(max_ios * 4, buffer_io_size); > > and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and > buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32. > > If someone changes those constants though, int16 might overflow and fail > in weird ways. I'd suggest being more careful here and explicitly clamp > max_pinned_buffers at PG_INT16_MAX or have a static assertion or > something. (I think it needs to be somewhat less than PG_INT16_MAX, > because of the extra "overflow buffers" stuff and some other places > where you do arithmetic.) Clamp added. > > /* > > * We gave a contiguous range of buffer space to StartReadBuffers(), but > > * we want it to wrap around at max_pinned_buffers. Move values that > > * overflowed into the extra space. At the same time, put -1 in the I/O > > * slots for the rest of the buffers to indicate no I/O. They are covered > > * by the head buffer's I/O, if there is one. We avoid a % operator. > > */ > > overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers; > > if (overflow > 0) > > { > > memmove(&stream->buffers[0], > > &stream->buffers[stream->max_pinned_buffers], > > sizeof(stream->buffers[0]) * overflow); > > for (int i = 0; i < overflow; ++i) > > stream->buffer_io_indexes[i] = -1; > > for (int i = 1; i < nblocks - overflow; ++i) > > stream->buffer_io_indexes[stream->next_buffer_index + i] = -1; > > } > > else > > { > > for (int i = 1; i < nblocks; ++i) > > stream->buffer_io_indexes[stream->next_buffer_index + i] = -1; > > } > > Instead of clearing buffer_io_indexes here, it might be cheaper/simpler > to initialize the array to -1 in streaming_read_buffer_begin(), and > reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(), > after the WaitReadBuffers() call. In other words, except when an I/O is > in progress, keep all the elements at -1, even the elements that are not > currently in use. Yeah that wasn't nice and I had already got as far as doing exactly that ↑ on my own, but your second idea ↓ is better! > Alternatively, you could remember the first buffer that the I/O applies > to in the 'ios' array. In other words, instead of pointing from buffer > to the I/O that it depends on, point from the I/O to the buffer that > depends on it. The last attached patch implements that approach. I'm not > wedded to it, but it feels a little simpler. Yeah, nice improvement. > > if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE) > > { > > /* Distance ramps up fast (behavior C). */ > > ... > > } > > else > > { > > /* No advice; move towards full I/O size (behavior B). */ > > ... > > } > > The comment on ReadBuffersOperation says "Declared in public header only > to allow inclusion in other structs, but contents should not be > accessed", but here you access the 'flags' field. > > You also mentioned that the StartReadBuffers() argument list is too > long. Perhaps the solution is to redefine ReadBuffersOperation so that > it consists of two parts: 1st part is filled in by the caller, and > contains the arguments, and 2nd part is private to bufmgr.c. The > signature for StartReadBuffers() would then be just: > > bool StartReadBuffers(ReadBuffersOperation *operation); Yeah. I had already got as far as doing this on the regression hunting expedition, but I kept some arguments for frequently changing things, eg blocknum. It means that the stuff that never changes is in there, and the stuff that changes each time doesn't have to be written to memory at all. > That would make it OK to read the 'flags' field. It would also allow > reusing the same ReadBuffersOperation struct for multiple I/Os for the > same relation; you only need to change the changing parts of the struct > on each operation. Right. Done. > In the attached patch set, the first three patches are your v9 with no > changes. The last patch refactors away 'buffer_io_indexes' like I > mentioned above. The others are fixes for some other trivial things that > caught my eye. Thanks, all squashed into the patch. In an offline chat with Robert and Andres, we searched for a better name for the GUC. We came up with "io_combine_limit". It's easier to document a general purpose limit than to explain what "buffer_io_size" does (particularly since the set of affected features will grow over time but starts so small). I'll feel better about using it to control bulk writes too, with that name. I collapsed the various memory allocations into one palloc. The buffer array is now a buffers[FLEXIBLE_ARRAY_MEMBER]. I got rid of "finished" (now represented by distance == 0, I was removing branches and variables). I got rid of "started", which can now be deduced (used for suppressing advice when you're calling _next() because you need a block and we need to read it immediately), see the function argument suppress_advice. Here is a new proposal for the names, updated in v10: read_stream_begin_relation() read_stream_next_buffer() void read_stream_end() I think we'll finish up with different 'constructor' functions for different kinds of streams. For example I already want one that can provide a multi-relation callback for use by recovery (shown in v1). Others might exist for raw file access, etc. The defining characteristic of this one is that it accesses one specific relation/fork. Well, _relfork() might be more accurate but less easy on the eye. I won't be surprised if people not following this thread have ideas after commit; it's certainly happened before something gets renamed in beta and I won't mind a bit if that happens... I fixed a thinko in the new ReadBuffer() implementation mentioned before (thanks to Bilal for pointing this out): it didn't handle the RBM_ZERO_XXX flags properly. Well, it did work and the tests passed, but it performed a useless read first. I probably got mixed up when I removed the extended interface which was capable of streaming zeroed buffers but this simple one isn't and it wasn't suppressing the read as it would need to. Later I'll propose to add that back in for recovery. I fixed a recently added thinko in the circular queue: I mistakenly thought I didn't need a spare gap between head and tail anymore because now we never compare them, since we track the number of pinned buffers instead, but after read_stream_next_buffer() returns, users of per-buffer data need that data to remain valid until the next call. So the recent refactoring work didn't survive contact with Melanie's BHS work, which uses that. We need to "wipe" the previous (spare) one, which can't possibly be in use yet, and if anyone ever sees 0x7f in their per-buffer data, it will mean that they illegally accessed the older value after a new call to read_stream_next_buffer(). Fixed, with comments to clarify. Retesting with Melanie's latest BHS patch set and my random.sql (upthread) gives the same system call trace as before. The intention of that was to demonstrate what exact sequences effective_io_concurrency values give. Now you can also run that with different values of the new io_combine_limit. If you run it with io_combine_limit = '8kB', it looks almost exactly like master ie no big reads allowed; the only difference is that read_stream.c refuses to issue advice for strictly sequential reads: effective_io_concurrency = 1, range size = 2 unpatched patched ============================================================================== pread(93,...,8192,0x58000) = 8192 pread(84,...,8192,0x58000) = 8192 posix_fadvise(93,0x5a000,0x2000,...) pread(84,...,8192,0x5a000) = 8192 pread(93,...,8192,0x5a000) = 8192 posix_fadvise(84,0xb0000,0x2000,...) posix_fadvise(93,0xb0000,0x2000,...) pread(84,...,8192,0xb0000) = 8192 pread(93,...,8192,0xb0000) = 8192 pread(84,...,8192,0xb2000) = 8192 posix_fadvise(93,0xb2000,0x2000,...) posix_fadvise(84,0x108000,0x2000,...) pread(93,...,8192,0xb2000) = 8192 pread(84,...,8192,0x108000) = 8192 posix_fadvise(93,0x108000,0x2000,...) pread(84,...,8192,0x10a000) = 8192 You wouldn't normally see that though as the default io_combine_limit would just merge those adjacent reads after the first one triggers ramp-up towards behaviour C: effective_io_concurrency = 1, range size = 2 unpatched patched ============================================================================== pread(93,...,8192,0x58000) = 8192 pread(80,...,8192,0x58000) = 8192 posix_fadvise(93,0x5a000,0x2000,...) pread(80,...,8192,0x5a000) = 8192 pread(93,...,8192,0x5a000) = 8192 posix_fadvise(80,0xb0000,0x4000,...) posix_fadvise(93,0xb0000,0x2000,...) preadv(80,...,2,0xb0000) = 16384 pread(93,...,8192,0xb0000) = 8192 posix_fadvise(80,0x108000,0x4000,...) posix_fadvise(93,0xb2000,0x2000,...) preadv(80,...,2,0x108000) = 16384 pread(93,...,8192,0xb2000) = 8192 posix_fadvise(80,0x160000,0x4000,...) I spent most of the past few days trying to regain some lost performance. Thanks to Andres for some key observations and help! That began with reports from Bilal and Melanie (possibly related to things Tomas had seen too, not sure) of regressions in all-cached workloads, which I already improved a bit with the ABC algorithm that minimised pinning for this case. That is, if there's no recent I/O so we reach what I call behaviour A, it should try to do as little magic as possible. But it turns out that wasn't enough! It is very hard to beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over millions of already-cached blocks, if you have to do exactly the same work AND extra instructions for management. There were two layers to the solution in this new version: First, I now have a special case in read_stream_next_buffer(), a sort of open-coded specialisation for behaviour A with no per-buffer data, and that got rid of most of the regression, but some remained. Next, Andres pointed out that ReadBuffer() itself, even though it is now implemented on top of StartReadBuffers(nblocks = 1), was still beating my special case code that calls StartReadBuffers(nblocks = 1), even though it looks about the same, because bufmgr.c was able to inline and specialise the latter for one block. To give streaming_read.c that power from its home inside another translation units, we needed to export a special case singular StartReadBuffer() (investigation and patch by Andres, added as co-author). It just calls the plural function with nblocks = 1, but it gets inlined. So now the special case for behaviour A drills through both layers, and hopefully now there is no measurable regression.. need to test a bitt more. Of course we can't *beat* the old code in this case, yet, but... (We speculate that a future tree-based buffer mapping table might allow efficient lookup for a range of block numbers in one go, and then it could be worth paying the book-keeping costs to find ranges. Perhaps behaviour A and the associated special case code could then be deleted, as you'd probably want to use multi-block magic all the time, for both for I/O and mapping table lookups. Or something like that?)
Attachment
On Wed, Mar 27, 2024 at 10:11 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > I got rid of "finished" (now represented by distance == 0, I was > removing branches and variables). I got rid of "started", which can > now be deduced (used for suppressing advice when you're calling > _next() because you need a block and we need to read it immediately), > see the function argument suppress_advice. I started rebasing the sequential scan streaming read user over this new version, and this change (finished now represented with distance == 0) made me realize that I'm not sure what to set distance to on rescan. For sequential scan, I added a little reset function to the streaming read API (read_stream_reset()) that just releases all the buffers. Previously, it set finished to true before releasing the buffers (to indicate it was done) and then set it back to false after. Now, I'll set distance to 0 before releasing the buffers and !0 after. I could just restore whatever value distance had before I set it to 0. Or I could set it to 1. But, thinking about it, are we sure we want to ramp up in the same way on rescans? Maybe we want to use some information from the previous scan to determine what to set distance to? Maybe I'm overcomplicating it... > Here is a new proposal for the names, updated in v10: > > read_stream_begin_relation() > read_stream_next_buffer() > void read_stream_end() Personally, I'm happy with these. - Melanie
On Thu, Mar 28, 2024 at 9:43 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > For sequential scan, I added a little reset function to the streaming > read API (read_stream_reset()) that just releases all the buffers. > Previously, it set finished to true before releasing the buffers (to > indicate it was done) and then set it back to false after. Now, I'll > set distance to 0 before releasing the buffers and !0 after. I could > just restore whatever value distance had before I set it to 0. Or I > could set it to 1. But, thinking about it, are we sure we want to ramp > up in the same way on rescans? Maybe we want to use some information > from the previous scan to determine what to set distance to? Maybe I'm > overcomplicating it... I think 1 is good, as a rescan is even more likely to find the pages in cache, and if that turns out to be wrong it'll very soon adjust.
On Thu, Mar 28, 2024 at 10:52 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I think 1 is good, as a rescan is even more likely to find the pages > in cache, and if that turns out to be wrong it'll very soon adjust. Hmm, no I take that back, it probably won't be due to the strategy/ring... I see your point now... when I had a separate flag, the old distance was remembered across but now I'm zapping it. I was trying to minimise the number of variables that have to be tested in the fast path by consolidating. Hmm, it is signed -- would it be too weird if we used a negative number for "finished", so we can just flip it on reset?
On Mon, Mar 25, 2024 at 2:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > /* > > > * Skip the initial ramp-up phase if the caller says we're going to be > > > * reading the whole relation. This way we start out doing full-sized > > > * reads. > > > */ > > > if (flags & PGSR_FLAG_FULL) > > > pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers); > > > else > > > pgsr->distance = 1; > > > > Should this be "Max(MAX_BUFFERS_PER_TRANSFER, > > pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than > > MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So > > perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ? > > Right, done. BTW I forgot to mention that in v10 I changed my mind and debugged my way back to the original coding, which now looks like this: /* * Skip the initial ramp-up phase if the caller says we're going to be * reading the whole relation. This way we start out assuming we'll be * doing full io_combine_limit sized reads (behavior B). */ if (flags & READ_STREAM_FULL) stream->distance = Min(max_pinned_buffers, io_combine_limit); else stream->distance = 1; It's not OK for distance to exceed max_pinned_buffers. But if max_pinned_buffers is huge, remember that the goal here is to access 'behavior B' meaning wide read calls but no unnecessary extra look-ahead beyond what is needed for that, so we also don't want to exceed io_combine_limit. Therefore we want the minimum of those two numbers. In practice on a non-toy system, that's always going to be io_combine_limit. But I'm not sure how many users of READ_STREAM_FULL there will be, and I am starting to wonder if it's a good name for the flag, or even generally very useful. It's sort of saying "I expect to do I/O, and it'll be sequential, and I won't give up until the end". But how many users can really make those claims? pg_prewarm is unsual in that it contains an explicit assumption that the cache is cold and we want to warm it up. But maybe we should just let the adaptive algorithm do its thing. It only takes a few reads to go from 1 -> io_combine_limit. Thinking harder, if we're going to keep this and not just be fully adaptive, perhaps there should be a flag READ_STREAM_COLD, where you hint that the data is not expected to be cached, and you'd combine that with the _SEQUENTIAL hint. pg_prewarm hints _COLD | _SEQUENTIAL. Then the initial distance would be something uses the flag combinations to select initial behavior A, B, C (and we'll quickly adjust if you're wrong): if (!(flags & READ_STREAM_COLD)) stream->distance = 1; else if (flags & READ_STREAM_SEQUENTIAL) stream->distance = Min(max_pinned_buffers, io_combine_limit); else stream->distance = max_pinned_buffers; But probably almost all users especially in the executor haven't really got much of a clue what they're going to do so they'd use the initial starting position of 1 (A) and we'd soo figure it out. Maybe overengineering for pg_prewarm is a waste of time and we should just delete the flag instead and hard code 1.
On Thu, Mar 28, 2024 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... In practice on a non-toy system, that's always going to be > io_combine_limit. ... And to be more explicit about that: you're right that we initialise max_pinned_buffers such that it's usually at least io_combine_limit, but then if you have a very small buffer pool it gets clobbered back down again by LimitAdditionalBins() and may finish up as low as 1. You're not allowed to pin more than 1/Nth of the whole buffer pool, where N is approximately max connections (well it's not exactly that but that's the general idea). So it's a degenerate case, but it can happen that max_pinned_buffers is lower than io_combine_limit and then it's important not to set distance higher or you'd exceed the allowed limits (or more likely the circular data structure would implode).
New version with some cosmetic/comment changes, and Melanie's read_stream_reset() function merged, as required by her sequential scan user patch. I tweaked it slightly: it might as well share code with read_stream_end(). I think setting distance = 1 is fine for now, and we might later want to adjust that as we learn more about more interesting users of _reset().
Attachment
Small bug fix: the condition in the final test at the end of read_stream_look_ahead() wasn't quite right. In general when looking ahead, we don't need to start a read just because the pending read would bring us up to stream->distance if submitted now (we'd prefer to build it all the way up to size io_combine_limit if we can), but if that condition is met AND we have nothing pinned yet, then there is no chance for the read to grow bigger by a pinned buffer being consumed. Fixed, comment updated.
Attachment
On Fri, Mar 29, 2024 at 12:06 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Small bug fix: the condition in the final test at the end of > read_stream_look_ahead() wasn't quite right. In general when looking > ahead, we don't need to start a read just because the pending read > would bring us up to stream->distance if submitted now (we'd prefer to > build it all the way up to size io_combine_limit if we can), but if > that condition is met AND we have nothing pinned yet, then there is no > chance for the read to grow bigger by a pinned buffer being consumed. > Fixed, comment updated. Oops, I sent the wrong/unfixed version. This version has the fix described above.
Attachment
> I spent most of the past few days trying to regain some lost > performance. Thanks to Andres for some key observations and help! > That began with reports from Bilal and Melanie (possibly related to > things Tomas had seen too, not sure) of regressions in all-cached > workloads, which I already improved a bit with the ABC algorithm that > minimised pinning for this case. That is, if there's no recent I/O so > we reach what I call behaviour A, it should try to do as little magic > as possible. But it turns out that wasn't enough! It is very hard to > beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over > millions of already-cached blocks, if you have to do exactly the same > work AND extra instructions for management. I got a little nerd-sniped by that, and did some micro-benchmarking of my own. I tested essentially this, with small values of 'nblocks' so that all pages are in cache: for (int i = 0; i < niters; i++) { for (BlockNumber blkno = 0; blkno < nblocks; blkno++) { buf = ReadBuffer(rel, blkno); ReleaseBuffer(buf); } } The results look like this (lower is better, test program and script attached): master (213c959a29): 8.0 s streaming-api v13: 9.5 s This test exercises just the ReadBuffer() codepath, to check if there is a regression there. It does not exercise the new streaming APIs. So looks like the streaming API patches add some overhead to the simple non-streaming ReadBuffer() case. This is a highly narrow micro-benchmark, of course, so even though this is a very performance-sensitive codepath, we could perhaps accept a small regression there. In any real workload, you'd at least need to take the buffer lock and read something from the page. But can we do better? Aside from performance, I was never quite happy with the BMR_REL/BMR_SMGR stuff we introduced in PG v16. I like having one common struct like BufferManagerRelation that is used in all the functions, instead of having separate Relation and SMgrRelation variants of every function. But those macros feel a bit hacky and we are not consistently using them in all the functions. Why is there no ReadBuffer() variant that takes a BufferManagerRelation? The attached patch expands the use of BufferManagerRelations. The principle now is that before calling any bufmgr function, you first initialize a BufferManagerRelation struct, and pass that to the function. The initialization is done by the InitBMRForRel() or InitBMRForSmgr() function, which replace the BMR_REL/BMR_SMGR macros. They are full-blown functions now because they do more setup upfront than BMR_REL/BMR_SMGR. For example, InitBMRForRel() always initializes the 'smgr' field, so that you don't need to repeat this pattern in all the other functions: - /* Make sure our bmr's smgr and persistent are populated. */ - if (bmr.smgr == NULL) - { - bmr.smgr = RelationGetSmgr(bmr.rel); - bmr.relpersistence = bmr.rel->rd_rel->relpersistence; - } Initializing the BufferManagerRelation is still pretty cheap, so it's feasible to call it separately for every ReadBuffer() call. But you can also reuse it across calls, if you read multiple pages in a loop, for example. That saves a few cycles. The microbenchmark results with these changes: master (213c959a29): 8.0 s streaming-api v13: 9.5 s bmr-refactor 8.4 s bmr-refactor, InitBMR once 7.7 s The difference between the "bmr-refactor" and "initBMR once" is that in the "initBMR once" test, I modified the benchmark to call InitBMRForRel() just once, outside the loop. So that shows the benefit of reusing the BufferManagerRelation. This refactoring seems to make performance regression smaller, even if you don't take advantage of reusing the BufferManagerRelation. This also moves things around a little in ReadBuffer_common() (now called ReadBufferBMR). Instead of calling StartReadBuffer(), it calls PinBufferForBlock() directly. I tried doing that before the other refactorings, but that alone didn't seem to make much difference. Not sure if it's needed, it's perhaps an orthogonal refactoring, but it's included here nevertheless. What do you think? The first three attached patches are your v13 patches unchanged. The fourth is the micro-benchmark I used. The last patch is the interesting one. PS. To be clear, I'm happy with your v13 streaming patch set as it is. I don't think this BufferManagerRelation refactoring is a show-stopper. -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > master (213c959a29): 8.0 s > streaming-api v13: 9.5 s Hmm, that's not great, and I think I know one factor that has confounded my investigation and the conflicting reports I have received from a couple of people: some are using meson, which is defaulting to -O3 by default, and others are using make which gives you -O2 by default, but at -O2, GCC doesn't inline that StartReadBuffer specialisation that is used in the "fast path", and possibly more. Some of that gap is closed by using pg_attribute_inline_always. Clang fails to inline at any level. So I should probably use the "always" macro there because that is the intention. Still processing the rest of your email...
On 29/03/2024 09:01, Thomas Munro wrote: > On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> master (213c959a29): 8.0 s >> streaming-api v13: 9.5 s > > Hmm, that's not great, and I think I know one factor that has > confounded my investigation and the conflicting reports I have > received from a couple of people: some are using meson, which is > defaulting to -O3 by default, and others are using make which gives > you -O2 by default, but at -O2, GCC doesn't inline that > StartReadBuffer specialisation that is used in the "fast path", and > possibly more. Some of that gap is closed by using > pg_attribute_inline_always. Clang fails to inline at any level. So I > should probably use the "always" macro there because that is the > intention. Still processing the rest of your email... Ah yeah, I also noticed that the inlining didn't happen with some compilers and flags. I use a mix of gcc and clang and meson and autoconf in my local environment. The above micro-benchmarks were with meson and gcc -O3. GCC version: $ gcc --version gcc (Debian 12.2.0-14) 12.2.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. -- Heikki Linnakangas Neon (https://neon.tech)
1. I tried out Tomas's suggestion ALTER TABLESPACE ts SET (io_combine_limit = ...). I like it, it's simple and works nicely. Unfortunately we don't have support for units like '128kB' in reloptions.c, so for now it requires a number of blocks. That's not great, so we should probably fix that before merging it, so I'm leaving that patch (v14-0004) separate, hopefully for later. 2. I also tried Tomas's suggestion of inventing a way to tell PostgreSQL what the OS readahead window size is. That allows for better modelling of whether the kernel is going to consider an access pattern to be sequential. Again, the ALTER TABLESPACE version would probably need unit support. This is initially just for experimentation as it came up in discussions of BHS behaviour. I was against this idea originally as it seemed like more complexity to have to explain and tune and I'm not sure if it's worth the trouble... but in fact we are already in the business of second guessing kernel logic, so there doesn't seem to be any good reason not to do it a bit better if we can. Do you think I have correctly understood what Linux is doing? The name I came up with is: effective_io_readahead_window. I wanted to make clear that it's a property of the system we are telling it about, not to be confused with our own look-ahead concept or whatever. Better names very welcome. This is also in a separate patch (v14-0005), left for later. 3. Another question I wondered about while retesting: does this need to be so low? I don't think so, so I've added a patch for that. src/include/port/pg_iovec.h:#define PG_IOV_MAX Min(IOV_MAX, 32) Now that I'm not using an array full of arrays of that size, I don't care so much how big we make that 32 (= 256kB @ 8kB), which clamps io_combine_limit. I think 128 (= 1MB @ 8kB) might be a decent arbitrary number. Sometimes we use it to size stack arrays, so I don't want to make it insanely large, but 128 should be fine. I think it would be good to be able to at least experiment with up to 1MB (I'm not saying it's a good idea to do it, who knows?, just that there isn't a technical reason why not to allow it AFAIK). FWIW every system on our target list that has p{read,write}v has IOV_MAX == 1024 (I checked {Free,Net,Open}BSD, macOS, illumos and Linux), so the Min(IOV_MAX, ...) really only clamps the systems where pg_{read,write}v fall back to loop-based emulation (Windows, Solaris) which is fine. PG_IOV_MAX also affects the routine that initialises new WAL files. I don't currently see a downside to doing that in 1MB chunks, as there was nothing sacred about the previous arbitrary number and the code deals with short writes by retrying as it should. 4. I agree with Heikki's complaints about the BMR interface. It should be made more consistent and faster. I didn't want to make all of those changes touching AMs etc a dependency though, so I spent some time trying to squeeze out regressions using some of these clues about calling conventions, likely hints, memory access and batching. I'm totally open to later improvements and refactoring of that stuff later! Attached is the version with the best results I've managed to get. My test is GCC -O3, pg_prewarm of a table of 220_000_000 integers = 7.6GB, which sometimes comes out around the same ~250ms on master and streaming pg_prewarm v14 on a random cloud ARM box I'm testing with, but not always, sometimes it's ~5-7ms more. (Unfortunately I don't have access to good benchmarking equipment right now, better numbers welcome.) Two new ideas: * give fast path mode a single flag, instead of testing all the conditions for every block * give fast path mode a little buffer of future block numbers, so it can call the callback in batches I'd tried that batch-calling thing before, and results were inconclusive, but I think sometimes it helps a bit. Note that it replaces the 'unget' thing from before and it is possibly a tiny bit nicer anyway. I'm a bit stumped about how to improve this further -- if anyone has any ideas for further improvements I'm all ears. Zooming back out of micro-benchmark mode, it must be pretty hard to see in a real workload that actually does something with the buffers, like a sequential scan. Earlier complaints about all-cached sequential scan regressions were resolved many versions ago AFAIK by minimising pin count in that case. I just tried Melanie's streaming sequential scan patch, with a simple SELECT COUNT(*) WHERE i = -1, with the same all-cached table of 220 million integers. Patched consistently comes out ahead for all-in-kernel-cache none-in-PG-cache: ~14.7-> ~14.4, and all-in-PG-cache ~13.5s -> ~13.3s (which I don't have an explanation for). I don't claim any of that is particularly scientific, I just wanted to note that single digit numbers of milliseconds of regression while pinning a million pages is clearly lost in the noise of other effects once you add in real query execution. That's half a dozen nanoseconds per page if I counted right. So, I am finally starting to think we should commit this, and decide which user patches are candidates.
Attachment
- v14-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v14-0002-Provide-API-for-streaming-relation-data.patch
- v14-0003-Use-streaming-I-O-in-pg_prewarm.patch
- v14-0004-ALTER-TABLESPACE-.-SET-io_combine_limit.patch
- v14-0005-Add-effective_io_readahead_window-setting.patch
- v14-0006-Increase-PG_IOV_MAX-for-bigger-io_combine_limit.patch
I had been planning to commit v14 this morning but got cold feet with the BMR-based interface. Heikki didn't like it much, and in the end, neither did I. I have now removed it, and it seems much better. No other significant changes, just parameter types and inlining details. For example: * read_stream_begin_relation() now takes a Relation, likes its name says * StartReadBuffers()'s operation takes smgr and optional rel * ReadBuffer_common() takes smgr and optional rel ReadBuffer() (which calls ReadBuffer_common() which calls StartReadBuffer() as before) now shows no regression in a tight loop over ~1 million already-in-cache pages (something Heikki had observed before and could only completely fix with a change that affected all callers). The same test using read_stream.c is still slightly slower, ~1 million pages -in-cache pages 301ms -> 308ms, which seems acceptable to me and could perhaps be chased down with more study of inlining/specialisation. As mentioned before, it doesn't seem to be measurable once you actually do something with the pages. In some ways BMR was better than the "fake RelationData" concept (another attempt at wrestling with the relation vs storage duality, that is, the online vs recovery duality). But in other ways it was worse: a weird inconsistent mixture of pass-by-pointer and pass-by-value interfaces that required several code paths to handle it being only partially initialised, which turned out to be wasted cycles implicated in regressions, despite which it is not even very nice to use anyway. I'm sure it could be made to work better, but I'm not yet sure it's really needed. In later work for recovery I will need to add a separate constructor read_stream_begin_smgr_something() anyway for other reasons (multi-relation streaming, different callback) and perhaps also a separate StartReadBuffersSmgr() if it saves measurable cycles to strip out branches. Maybe it was all just premature pessimisation. So this is the version I'm going to commit shortly, barring objections.
Attachment
On Tue, Apr 2, 2024 at 9:39 PM Thomas Munro <thomas.munro@gmail.com> wrote: > So this is the version I'm going to commit shortly, barring objections. And done, after fixing a small snafu with smgr-only reads coming from CreateAndCopyRelationData() (BM_PERMANENT would be incorrectly/unnecessarily set for unlogged tables). Here are the remaining patches discussed in this thread. They give tablespace-specific io_combine_limit, effective_io_readahead_window (is this useful?), and up-to-1MB io_combine_limit (is this useful?). I think the first two would probably require teaching reloption.c how to use guc.c's parse_int() and unit flags, but I won't have time to look at that for this release so I'll just leave these here. On the subject of guc.c, this is a terrible error message... did I do something wrong? postgres=# set io_combine_limit = '42MB'; ERROR: 5376 8kB is outside the valid range for parameter "io_combine_limit" (1 .. 32)
Attachment
On Tue, Apr 2, 2024 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Here are the remaining patches discussed in this thread. They give > tablespace-specific io_combine_limit, effective_io_readahead_window > (is this useful?), and up-to-1MB io_combine_limit (is this useful?). > I think the first two would probably require teaching reloption.c how > to use guc.c's parse_int() and unit flags, but I won't have time to > look at that for this release so I'll just leave these here. > > On the subject of guc.c, this is a terrible error message... did I do > something wrong? > > postgres=# set io_combine_limit = '42MB'; > ERROR: 5376 8kB is outside the valid range for parameter > "io_combine_limit" (1 .. 32) Well, GUC_UNIT_BLOCKS interpolates the block limit into the error message string (get_config_unit_name()). But, I can't imagine this error message is clear for any of the GUCs using GUC_UNIT_BLOCKS. I would think some combination of the two would be helpful, like "43008 kB (5376 blocks) is outside of the valid range for parameter". The user can check what their block size is. I don't think we need to interpolate and print the block size in the error message. On another note, since io_combine_limit, when specified in size, rounds up to the nearest multiple of blocksize, it might be worth mentioning this in the io_combine_limit docs at some point. I checked docs for another GUC_UNIT_BLOCKS guc, backend_flush_after, and it alludes to this. - Melanie
Hi, On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote: > > I had been planning to commit v14 this morning but got cold feet with > the BMR-based interface. Heikki didn't like it much, and in the end, > neither did I. I have now removed it, and it seems much better. No > other significant changes, just parameter types and inlining details. > For example: > > * read_stream_begin_relation() now takes a Relation, likes its name says > * StartReadBuffers()'s operation takes smgr and optional rel > * ReadBuffer_common() takes smgr and optional rel Read stream objects can be created only using Relations now. There could be read stream users which do not have a Relation but SMgrRelations. So, I created another constructor for the read streams which use SMgrRelations instead of Relations. Related patch is attached. -- Regards, Nazir Bilal Yavuz Microsoft
Attachment
On Sun, Apr 7, 2024 at 1:33 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote: > > > > I had been planning to commit v14 this morning but got cold feet with > > the BMR-based interface. Heikki didn't like it much, and in the end, > > neither did I. I have now removed it, and it seems much better. No > > other significant changes, just parameter types and inlining details. > > For example: > > > > * read_stream_begin_relation() now takes a Relation, likes its name says > > * StartReadBuffers()'s operation takes smgr and optional rel > > * ReadBuffer_common() takes smgr and optional rel > > Read stream objects can be created only using Relations now. There > could be read stream users which do not have a Relation but > SMgrRelations. So, I created another constructor for the read streams > which use SMgrRelations instead of Relations. Related patch is > attached. This patch LGTM - Melanie
Hi, On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote: > > > > I had been planning to commit v14 this morning but got cold feet with > > the BMR-based interface. Heikki didn't like it much, and in the end, > > neither did I. I have now removed it, and it seems much better. No > > other significant changes, just parameter types and inlining details. > > For example: > > > > * read_stream_begin_relation() now takes a Relation, likes its name says > > * StartReadBuffers()'s operation takes smgr and optional rel > > * ReadBuffer_common() takes smgr and optional rel > > Read stream objects can be created only using Relations now. There > could be read stream users which do not have a Relation but > SMgrRelations. So, I created another constructor for the read streams > which use SMgrRelations instead of Relations. Related patch is > attached. After sending this, I realized that I forgot to add persistence value to the new constructor. While working on it I also realized that current code sets persistence in PinBufferForBlock() function and this function is called for each block, which can be costly. So, I moved setting persistence to the out of PinBufferForBlock() function. Setting persistence outside of the PinBufferForBlock() function (0001) and creating the new constructor that uses SMgrRelations (0002) are attached. -- Regards, Nazir Bilal Yavuz Microsoft
Attachment
Hi, On Mon, 8 Apr 2024 at 00:01, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > Hi, > > > > On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote: > > > > > > I had been planning to commit v14 this morning but got cold feet with > > > the BMR-based interface. Heikki didn't like it much, and in the end, > > > neither did I. I have now removed it, and it seems much better. No > > > other significant changes, just parameter types and inlining details. > > > For example: > > > > > > * read_stream_begin_relation() now takes a Relation, likes its name says > > > * StartReadBuffers()'s operation takes smgr and optional rel > > > * ReadBuffer_common() takes smgr and optional rel > > > > Read stream objects can be created only using Relations now. There > > could be read stream users which do not have a Relation but > > SMgrRelations. So, I created another constructor for the read streams > > which use SMgrRelations instead of Relations. Related patch is > > attached. > > After sending this, I realized that I forgot to add persistence value > to the new constructor. While working on it I also realized that > current code sets persistence in PinBufferForBlock() function and this > function is called for each block, which can be costly. So, I moved > setting persistence to the out of PinBufferForBlock() function. > > Setting persistence outside of the PinBufferForBlock() function (0001) > and creating the new constructor that uses SMgrRelations (0002) are > attached. Melanie noticed there was a 'sgmr -> smgr' typo in 0002. Fixed in attached. -- Regards, Nazir Bilal Yavuz Microsoft
Attachment
I've attached a patch with a few typo fixes and what looks like an incorrect type for max_ios. It's an int16 and I think it needs to be an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do anything when max_ios is int16. David
Attachment
On Wed, 24 Apr 2024 at 14:32, David Rowley <dgrowleyml@gmail.com> wrote: > I've attached a patch with a few typo fixes and what looks like an > incorrect type for max_ios. It's an int16 and I think it needs to be > an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do > anything when max_ios is int16. No feedback, so I'll just push this in a few hours unless anyone has anything. David
On Wed, May 1, 2024 at 2:51 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Wed, 24 Apr 2024 at 14:32, David Rowley <dgrowleyml@gmail.com> wrote: > > I've attached a patch with a few typo fixes and what looks like an > > incorrect type for max_ios. It's an int16 and I think it needs to be > > an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do > > anything when max_ios is int16. > > No feedback, so I'll just push this in a few hours unless anyone has anything. Patch looks correct, thanks. Please do. (Sorry, running a bit behind on email ATM... I also have a few more typos around here from an off-list email from Mr Lakhin, will get to that soon...)
On Wed, Nov 29, 2023 at 1:17 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Done. I like it, I just feel a bit bad about moving the p*v() > replacement functions around a couple of times already! I figured it > might as well be static inline even if we use the fallback (= Solaris > and Windows). Just for the record, since I'd said things like the above a few times while writing about this stuff: Solaris 11.4.69 has gained preadv() and pwritev(). That's interesting because it means that there will soon be no liive Unixoid operating systems left without them, and the fallback code in src/include/port/pg_iovec.h will, in practice, be only for Windows. I wondered if that might have implications for how we code or comment stuff like that, but it still seems to make sense as we have it. (I don't think Windows can have a real synchronous implementation; the kernel knows how to do scatter/gather, a feature implemented specifically for databases, but only in asynchronous ("overlapped") + direct I/O mode, a difference I don't know how to hide at this level. In later AIO work we should be able to use it as intended, but not by pretending to be Unix like this.)
Hi, It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch' [1] is partially applied, the top comment is not updated. The attached patch just updates it. [1] https://www.postgresql.org/message-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07%40iki.fi -- Regards, Nazir Bilal Yavuz Microsoft
Attachment
On Wed, Jul 10, 2024 at 07:21:59PM +0300, Nazir Bilal Yavuz wrote: > Hi, > > It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch' > [1] is partially applied, the top comment is not updated. The attached > patch just updates it. > > [1] https://www.postgresql.org/message-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07%40iki.fi Thanks, patch applied to master. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.