On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > + /* Avoid a slightly more expensive kernel call if there is no benefit. */
> > + if (iovcnt == 1)
> > + returnCode = pg_pread(vfdP->fd,
> > + iov[0].iov_base,
> > + iov[0].iov_len,
> > + offset);
> > + else
> > + returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
>
> How about pushing down this optimization to pg_preadv() itself?
> pg_readv() is currently just a macro if the system provides preadv(),
> but it could be a "static inline" that does the above dance. I think
> that optimization is platform-dependent anyway, pread() might not be any
> faster on some OSs. In particular, if the system doesn't provide
> preadv() and we use the implementation in src/port/preadv.c, it's the
> same kernel call anyway.
Done. I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already! I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).
> I don't see anything in the callers (mdreadv() and mdwritev()) to
> prevent them from passing nblocks > PG_IOV_MAX.
The outer loop in md*v() copes with segment boundaries and also makes
sure lengthof(iov) AKA PG_IOV_MAX isn't exceeded (though that couldn't
happen with the current caller).
> in streaming_read.h:
>
> > typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
> > uintptr_t pgsr_private,
> > void *per_io_private,
> > BufferManagerRelation *bmr,
> > ForkNumber *forkNum,
> > BlockNumber *blockNum,
> > ReadBufferMode *mode);
>
> I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
> each read. I see that you used that in the WAL replay prefetching, so I
> guess that makes sense.
In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.
I've added some ramp-up logic. The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet. Here is
how the system calls look when you do pg_prewarm():
pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072
I guess it could be done in quite a few different ways and I'm open to
better ideas. This way inserts prefetching stalls but ramps up
quickly and is soon out of the way. I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size? Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.
A small detour: While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem. One parallel seq scan process does this:
fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...
We don't really want those fadvise() calls. We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access. But the logic can't see
that another process is making holes in this process's sequence. The
two obvious solutions are (1) pass in a flag at the start saying "I
promise this is sequential even if it doesn't look like it, no hints
please" and (2) invent "shared" (cross-process) streaming reads, and
teach all the parallel seq scan processes to get their buffers from
there.
Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now. So perhaps I should implement (1), which would be another
reason to add a flags argument. It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.
I've included just the pg_prewarm example user for now while we
discuss the basic infrastructure. The rest are rebased and in my
public Github branch streaming-read (repo macdice/postgres) if anyone
is interested (don't mind the red CI failures, they're just saying I
ran out of monthly CI credits on the 29th, so close...)