Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Streaming I/O, vectored I/O (WIP) |
Date | |
Msg-id | CA+hUKGKOc4=TTNv_r4F03QafvK7qo7itN+OwUdMX5UMjYiqybA@mail.gmail.com Whole thread Raw |
In response to | Re: Streaming I/O, vectored I/O (WIP) (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Streaming I/O, vectored I/O (WIP)
(Heikki Linnakangas <hlinnaka@iki.fi>)
Re: Streaming I/O, vectored I/O (WIP) (Heikki Linnakangas <hlinnaka@iki.fi>) Re: Streaming I/O, vectored I/O (WIP) (Melanie Plageman <melanieplageman@gmail.com>) Re: Streaming I/O, vectored I/O (WIP) (Melanie Plageman <melanieplageman@gmail.com>) Re: Streaming I/O, vectored I/O (WIP) (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > + /* Avoid a slightly more expensive kernel call if there is no benefit. */ > > + if (iovcnt == 1) > > + returnCode = pg_pread(vfdP->fd, > > + iov[0].iov_base, > > + iov[0].iov_len, > > + offset); > > + else > > + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset); > > How about pushing down this optimization to pg_preadv() itself? > pg_readv() is currently just a macro if the system provides preadv(), > but it could be a "static inline" that does the above dance. I think > that optimization is platform-dependent anyway, pread() might not be any > faster on some OSs. In particular, if the system doesn't provide > preadv() and we use the implementation in src/port/preadv.c, it's the > same kernel call anyway. Done. I like it, I just feel a bit bad about moving the p*v() replacement functions around a couple of times already! I figured it might as well be static inline even if we use the fallback (= Solaris and Windows). > I don't see anything in the callers (mdreadv() and mdwritev()) to > prevent them from passing nblocks > PG_IOV_MAX. The outer loop in md*v() copes with segment boundaries and also makes sure lengthof(iov) AKA PG_IOV_MAX isn't exceeded (though that couldn't happen with the current caller). > in streaming_read.h: > > > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr, > > uintptr_t pgsr_private, > > void *per_io_private, > > BufferManagerRelation *bmr, > > ForkNumber *forkNum, > > BlockNumber *blockNum, > > ReadBufferMode *mode); > > I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on > each read. I see that you used that in the WAL replay prefetching, so I > guess that makes sense. In this version I have introduced an alternative simple callback. It's approximately what we had already tried out in an earlier version before I started streamifying recovery, but in this version you can choose, so recovery can opt for the wider callback. I've added some ramp-up logic. The idea is that after we streamify everything in sight, we don't want to penalise users that don't really need more than one or two blocks, but don't know that yet. Here is how the system calls look when you do pg_prewarm(): pread64(32, ..., 8192, 0) = 8192 <--- start with just one block pread64(32, ..., 16384, 8192) = 16384 pread64(32, ..., 32768, 24576) = 32768 pread64(32, ..., 65536, 57344) = 65536 pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16 blocks at a time pread64(32, ..., 131072, 253952) = 131072 pread64(32, ..., 131072, 385024) = 131072 I guess it could be done in quite a few different ways and I'm open to better ideas. This way inserts prefetching stalls but ramps up quickly and is soon out of the way. I wonder if we would want to make that a policy that a caller can disable, if you want to skip the ramp-up and go straight for the largest possible I/O size? Then I think we'd need a 'flags' argument to the streaming read constructor functions. A small detour: While contemplating how this interacts with parallel sequential scan, which also has a notion of ramping up, I noticed another problem. One parallel seq scan process does this: fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0 preadv(32, [...], 2, 35127296) = 131072 preadv(32, [...], 2, 35258368) = 131072 fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0 preadv(32, [...], 2, 36175872) = 131072 preadv(32, [...], 2, 36306944) = 131072 ... We don't really want those fadvise() calls. We don't get them with parallelism disabled, because streaming_read.c is careful not to generate advice for sequential workloads based on ancient wisdom from this mailing list, re-confirmed on recent Linux: WILLNEED hints actually get in the way of Linux's own prefetching and slow you down, so we only want them for truly random access. But the logic can't see that another process is making holes in this process's sequence. The two obvious solutions are (1) pass in a flag at the start saying "I promise this is sequential even if it doesn't look like it, no hints please" and (2) invent "shared" (cross-process) streaming reads, and teach all the parallel seq scan processes to get their buffers from there. Idea (2) is interesting to think about but even if it is a useful idea (not sure) it is certainly overkill just to solve this little problem for now. So perhaps I should implement (1), which would be another reason to add a flags argument. It's not a perfect solution though because some more 'data driven' parallel scans (indexes, bitmaps, ...) have a similar problem that is less amenable to top-down kludgery. I've included just the pg_prewarm example user for now while we discuss the basic infrastructure. The rest are rebased and in my public Github branch streaming-read (repo macdice/postgres) if anyone is interested (don't mind the red CI failures, they're just saying I ran out of monthly CI credits on the 29th, so close...)
Attachment
- v2-0001-Optimize-pg_readv-pg_pwritev-for-single-vector.patch
- v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri.patch
- v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch
- v2-0004-Provide-vectored-variant-of-ReadBuffer.patch
- v2-0005-Provide-multi-block-smgrprefetch.patch
- v2-0006-Give-SMgrRelation-pointers-a-well-defined-lifetim.patch
- v2-0007-Provide-basic-streaming-read-API.patch
- v2-0008-Use-streaming-reads-in-pg_prewarm.patch
pgsql-hackers by date: