Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Streaming I/O, vectored I/O (WIP)
Date
Msg-id CA+hUKGJabs5ug9H8KGeoMtOyx-erHuVBSdHwpm3fdwVV4qjzoQ@mail.gmail.com
Whole thread Raw
In response to Re: Streaming I/O, vectored I/O (WIP)  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Streaming I/O, vectored I/O (WIP)
Re: Streaming I/O, vectored I/O (WIP)
List pgsql-hackers
On Tue, Mar 12, 2024 at 7:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I am planning to review this patch set, so started going through 0001,
> I have a question related to how we are issuing smgrprefetch in
> StartReadBuffers()

Thanks!

> + /*
> + * In theory we should only do this if PrepareReadBuffers() had to
> + * allocate new buffers above.  That way, if two calls to
> + * StartReadBuffers() were made for the same blocks before
> + * WaitReadBuffers(), only the first would issue the advice.
> + * That'd be a better simulation of true asynchronous I/O, which
> + * would only start the I/O once, but isn't done here for
> + * simplicity.  Note also that the following call might actually
> + * issue two advice calls if we cross a segment boundary; in a
> + * true asynchronous version we might choose to process only one
> + * real I/O at a time in that case.
> + */
> + smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
>   }
>
>  This is always issuing smgrprefetch starting with the input blockNum,
> shouldn't we pass the first blockNum which we did not find in the
>  Buffer pool?  So basically in the loop above this call where we are
> doing PrepareReadBuffer() we should track the first blockNum for which
>  the found is not true and pass that blockNum into the smgrprefetch()
> as a first block right?

I think you'd be right if StartReadBuffers() were capable of
processing a sequence consisting of a hit followed by misses, but
currently it always gives up after the first hit.  That is, it always
processes some number of misses (0-16) and then at most one hit.  So
for now the variable would always turn out to be the same as blockNum.

The reason is that I wanted to allows "full sized" read system calls
to form.  If you said "hey please read these 16 blocks" (I'm calling
that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
then it could only form a read of 14 blocks, but there might be more
blocks that could be read after those.  We would have some arbitrary
shorter read system calls, when we wanted to make them all as big as
possible.  So in the current patch you say "hey please read these 16
blocks" and it returns saying "only read 1", you call again with 15
and it says "only read 1", and you call again and says "read 16!"
(assuming 2 more were readable after the original range we started
with).  Then physical reads are maximised.  Maybe there is some nice
way to solve that, but I thought this way was the simplest (and if
there is some instruction-cache-locality/tight-loop/perf reason why we
should work harder to find ranges of hits, it could be for later).
Does that make sense?



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: UUID v7
Next
From: Thomas Munro
Date:
Subject: Re: Streaming I/O, vectored I/O (WIP)