Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | b336fbe8-1ad0-c289-5ed8-bbb74513e5bb@iki.fi Whole thread Raw |
In response to | Streaming I/O, vectored I/O (WIP) (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
|
List | pgsql-hackers |
On 31/08/2023 07:00, Thomas Munro wrote: > Currently PostgreSQL reads (and writes) data files 8KB at a time. > That's because we call ReadBuffer() one block at a time, with no > opportunity for lower layers to do better than that. This thread is > about a model where you say which block you'll want next with a > callback, and then you pull the buffers out of a "stream". I love this idea! Makes it a lot easier to perform prefetch, as evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch: 13 files changed, 289 insertions(+), 637 deletions(-) I'm a bit disappointed and surprised by v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though: 4 files changed, 244 insertions(+), 78 deletions(-) The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped that this would simplify it. I didn't look closely at that patch, so maybe it's simpler even though it's more code. > There are more kinds of streaming I/O that would be useful, such as > raw unbuffered files, and of course writes, and I've attached some > early incomplete demo code for writes (just for fun), but the main > idea I want to share in this thread is the idea of replacing lots of > ReadBuffer() calls with the streaming model. All this makes sense. Some random comments on the patches: > + /* Avoid a slightly more expensive kernel call if there is no benefit. */ > + if (iovcnt == 1) > + returnCode = pg_pread(vfdP->fd, > + iov[0].iov_base, > + iov[0].iov_len, > + offset); > + else > + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); How about pushing down this optimization to pg_preadv() itself? pg_readv() is currently just a macro if the system provides preadv(), but it could be a "static inline" that does the above dance. I think that optimization is platform-dependent anyway, pread() might not be any faster on some OSs. In particular, if the system doesn't provide preadv() and we use the implementation in src/port/preadv.c, it's the same kernel call anyway. > v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch No smgrextendv()? I guess none of the patches here needed it. > /* > * Prepare to read a block. The buffer is pinned. If this is a 'hit', then > * the returned buffer can be used immediately. Otherwise, a physical read > * should be completed with CompleteReadBuffers(). PrepareReadBuffer() > * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the > * caller has the opportunity to coalesce reads of neighboring blocks into one > * CompleteReadBuffers() call. > * > * *found is set to true for a hit, and false for a miss. > * > * *allocated is set to true for a miss that allocates a buffer for the first > * time. If there are multiple calls to PrepareReadBuffer() for the same > * block before CompleteReadBuffers() or ReadBuffer_common() finishes the > * read, then only the first such call will receive *allocated == true, which > * the caller might use to issue just one prefetch hint. > */ > Buffer > PrepareReadBuffer(BufferManagerRelation bmr, > ForkNumber forkNum, > BlockNumber blockNum, > BufferAccessStrategy strategy, > bool *found, > bool *allocated) > If you decide you don't want to perform the read, after all, is there a way to abort it without calling CompleteReadBuffers()? Looking at the later patch that introduces the streaming read API, seems that it finishes all the reads, so I suppose we don't need an abort function. Does it all get cleaned up correctly on error? > /* > * Convert an array of buffer address into an array of iovec objects, and > * return the number that were required. 'iov' must have enough space for up > * to PG_IOV_MAX elements. > */ > static int > buffers_to_iov(struct iovec *iov, void **buffers, int nblocks) The comment is a bit inaccurate. There's an assertion that If nblocks <= PG_IOV_MAX, so while it's true that 'iov' must have enough space for up to PG_IOV_MAX elements, that's only because we also assume that nblocks <= PG_IOV_MAX. I don't see anything in the callers (mdreadv() and mdwritev()) to prevent them from passing nblocks > PG_IOV_MAX. in streaming_read.h: > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr, > uintptr_t pgsr_private, > void *per_io_private, > BufferManagerRelation *bmr, > ForkNumber *forkNum, > BlockNumber *blockNum, > ReadBufferMode *mode); I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on each read. I see that you used that in the WAL replay prefetching, so I guess that makes sense. > extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr); > extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private); > extern void pg_streaming_read_reset(PgStreamingRead *pgsr); > extern void pg_streaming_read_free(PgStreamingRead *pgsr); Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL replay prefetching patch, and only after calling pg_streaming_read_reset(). Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly? Is there any need to "reset" the stream, without also starting prefetching? In v1-0012-WIP-Use-streaming-reads-in-recovery.patch: > @@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id, > * If the WAL record contains a block reference with the given ID, *rlocator, > * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and > * returns true. Otherwise returns false. > + * > + * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership > + * of the pin is transferred to the caller. > */ > bool > XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id, > @@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id, > if (blknum) > *blknum = bkpb->blkno; > if (prefetch_buffer) > + { > *prefetch_buffer = bkpb->prefetch_buffer; > + > + /* > + * Clear this flag is so that we can assert that redo records take > + * ownership of all buffers pinned by xlogprefetcher.c. > + */ > + bkpb->prefetch_buffer = InvalidBuffer; > + } > return true; > } Could these changes be committed independently of all the other changes? -- Heikki Linnakangas Neon (https://neon.tech)
pgsql-hackers by date: