Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKG+00TZ=POYk309j0pPSKcLe1vQOZ5=4iTsEZoU9dxQDoA@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
Re: Streaming I/O, vectored I/O (WIP) |
List | pgsql-hackers |
On Wed, Nov 29, 2023 at 2:21 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > Thanks for posting a new version. I've included a review of 0004. Thanks! I committed the patches up as far as smgr.c before the holidays. The next thing to commit would be the bufmgr.c one for vectored ReadBuffer(), v5-0001. Here's my response to your review of that, which triggered quite a few changes. See also new version of the streaming_read.c patch, with change list at end. (I'll talk about v5-0002, the SMgrRelation lifetime one, over on the separate thread about that where Heikki posted a better version.) > ot -> to Fixed. > > - if (found) > > - pgBufferUsage.shared_blks_hit++; > > - else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG || > > - mode == RBM_ZERO_ON_ERROR) > > - pgBufferUsage.shared_blks_read++; > > You've lost this test in your new version. You can do the same thing > (avoid counting zeroed buffers as blocks read) by moving this > pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common() > where you know if you called ZeroBuffer() or CompleteReadBuffers(). Yeah, right. After thinking about that some more, that's true and that placement would be good, but only if we look just at this patch, where we are chopping ReadBuffer() into two parts (PrepareReadBuffer() and CompleteReadBuffers()), and then putting them back together again. However, soon we'll want to use the two functions separately, and we won't call ReadBuffer[_common](). New idea: PrepareReadBuffer() can continue to be in charge of bumping {local,shared}_blks_hit, but {local,shared}_blks_read++ can happen in CompleteReadBuffers(). There is no counter for zeroed buffers, but if there ever is in the future, it can go into ZeroBuffer(). In this version, I've moved that into CompleteReadBuffers(), along with a new comment to explain a pre-existing deficiency in the whole scheme: there is a race where you finish up counting a read but someone else actually does the read, and also counts it. I'm trying to preserve the existing bean counting logic to the extent possible across this refactoring. > > + } > > + else > > + { > > + bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum, > > + strategy, found, allocated, io_context); > > + if (*found) > > + pgBufferUsage.shared_blks_hit++; > > + else > > + pgBufferUsage.shared_blks_read++; > > + } > > + if (bmr.rel) > > + { > > + pgstat_count_buffer_read(bmr.rel); > > This is double-counting reads. You've left the call in > ReadBufferExtended() as well as adding this here. It should be fine to > remove it from ReadBufferExtended(). Because you test bmr.rel, leaving > the call here in PrepareReadBuffer() wouldn't have an effect on > ReadBuffer_common() callers who don't pass a relation (like recovery). > The other current callers of ReadBuffer_common() (by way of > ExtendBufferedRelTo()) who do pass a relation are visibility map and > freespace map extension, and I don't think we track relation stats for > the VM and FSM. Oh yeah. Right. Fixed. > This does continue the practice of counting zeroed buffers as reads in > table-level stats. But, that is the same as master. Right. It is a little strange that pgstast_count_buffer_read() finishes up in a different location than pgBufferUsage.{local,shared}_blks_read++, but that's precisely due to this pre-existing difference in accounting policy. That generally seems like POLA failure, so I've added a comment to help us remember about that, for another day. > > + io_start = pgstat_prepare_io_time(); > > + smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len); > > + pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1); > > I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes > will be BLCKSZ and multiplying that by the number of reads should > produce the number of bytes read. OK, thanks, fixed. > > BufferDesc * > > LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > > - bool *foundPtr) > > + bool *foundPtr, bool *allocPtr) > > { > > BufferTag newTag; /* identity of requested block */ > > LocalBufferLookupEnt *hresult; > > @@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, > > Assert(BufferTagsEqual(&bufHdr->tag, &newTag)); > > > > *foundPtr = PinLocalBuffer(bufHdr, true); > > + *allocPtr = false; > ... > > I would prefer you use consistent naming for > allocPtr/allocatedPtr/allocated. I also think that all the functions > taking it as an output argument should explain what it is > (BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of > digging around to figure it out. You have a nice comment about it above > PrepareReadBuffer(). I think you may need to resign yourself to > restating that bit (or some version of it) for all of the functions > taking it as an argument. I worked on addressing your complaints, but while doing so I had second thoughts about even having that argument. The need for it came up while working on streaming recovery. Without it, whenever there were repeated references to the same block (as happens very often in the WAL), it'd issue extra useless POSIX_FADV_WILLNEED syscalls, so I wanted to find a way to distinguish the *first* miss for a given block, to be able to issue the advice just once before the actual read happens. But now I see that other users of streaming reads probably won't repeatedly stream the same block, and I am not proposing streaming recovery for PG 17. Time to simplify. I decided to kick allocatedPtr out to think about some more, but I really hope we'll be able to start a real background I/O instead of issuing advice in PG 18 proposals, and that'll be negotiated via IO_IN_PROGRESS, so then we'd never need allocatedPtr. Thus, removed. > > #ifndef BUFMGR_H > > #define BUFMGR_H > > > > +#include "pgstat.h" > > I don't know what we are supposed to do, but I would have included this > in bufmgr.c (where I actually needed it) instead of including it here. Fixed. > > +#include "port/pg_iovec.h" > > #include "storage/block.h" > > #include "storage/buf.h" > > #include "storage/bufpage.h" > > @@ -47,6 +49,8 @@ typedef enum > > RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page > > * in "cleanup" mode */ > > RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */ > > > + RBM_WILL_ZERO, /* Don't read from disk, caller will call > > + * ZeroBuffer() */ > > It's confusing that this (RBM_WILL_ZERO) is part of this commit since it > isn't used in this commit. Yeah. Removed. Bikeshedding call: I am open to better suggestions for the names PrepareReadBuffer() and CompleteReadBuffers(), they seem a little grammatically clumsy. I now also have a much simplified version of the streaming read patch. The short version is that it has all advanced features removed, so that now it *only* does the clustering required to build up large CompleteReadBuffers() calls. That's short and sweet, and enough for pg_prewarm to demonstrate 128KB reads, a good first step. Then WILLNEED advice, and then ramp-up are added as separate patches, for easier review. I've got some more patches in development that would re-add "extended" multi-relation mode with wider callback that can also stream zeroed buffers, as required so far only by recovery -- but I can propose that later. Other improvements: * Melanie didn't like the overloaded term "cluster" (off-list feedback). Now I use "block range" to describe a range of contiguous blocks (blocknum, nblocks). * KK didn't like "prefetch" being used with various meanings (off-list feedback). Better words found. * pgsr_private might as well be void *; uintptr_t is theoretically more general but doesn't seem to buy much for realistic use. * per_io_data might as well be called per_buffer_data (it's not per "I/O" and that isn't the level of abstraction for this API anyway which is about blocks and buffers). * I reordered some function arguments that jumped out after the above changes. * Once I started down the path of using a flags field to control various policy stuff as discussed up-thread, it started to seem clearer that callers probably shouldn't directly control I/O depth, which was all a bit ad-hoc and unfinished before. I think we'd likely want that to be centralised. New idea: it should be enough to be able to specify policies as required now and in future with flags. Thoughts? I've also included 4 of the WIP patches from earlier (based on an obsolete version of the vacuum thing, sorry I know you have a much better one now), which can be mostly ignored. Here I just wanted to share working code to drive vectored reads with different access patterns, and also show the API interactions and how they might each set flag bits. For example, parallel seq scan uses PGSR_FLAG_SEQUENTIAL to insist that its scans are sequential despite appearances, pg_prewarm uses PGSR_FLAG_FULL to declare that it'll read the whole relation so there is no point in ramping up, and the vacuum-related stuff uses PGSR_FLAG_MAINTENANCE to select tuning based on maintenance_io_concurrency instead of effective_io_concurrency.
Attachment
- v5-0001-Provide-vectored-variant-of-ReadBuffer.patch
- v5-0002-Give-SMgrRelation-pointers-a-well-defined-lifetim.patch
- v5-0003-Provide-API-for-streaming-reads-of-relations.patch
- v5-0004-Use-streaming-reads-in-pg_prewarm.patch
- v5-0005-Issue-advice-for-random-streaming-reads.patch
- v5-0006-Allow-streaming-reads-to-ramp-up-in-size.patch
- v5-0007-WIP-Use-streaming-reads-in-heapam-scans.patch
- v5-0008-WIP-Use-streaming-reads-in-vacuum.patch
- v5-0009-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patch
- v5-0010-WIP-Use-streaming-reads-in-bitmap-heapscan.patch
pgsql-hackers by date: