Thread: Using read stream in autoprewarm

Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

08 August 2024, 07:32:16

Hi,

I am working on using the read stream in autoprewarm. I observed ~10%
performance gain with this change. The patch is attached.

The downside of the read stream approach is that a new read stream
object needs to be created for each database, relation and fork. I was
wondering if this would cause a regression but it did not (at least
depending on results of my testing). Another downside could be the
code getting complicated.

For the testing,
- I created 50 databases with each of them having 50 tables and the
size of the tables are 520KB.
    - patched: 51157 ms
    - master: 56769 ms
- I created 5 databases with each of them having 1 table and the size
of the tables are 3GB.
    - patched: 32679 ms
    - master: 36706 ms

I put debugging message with timing information in
autoprewarm_database_main() function, then run autoprewarm 100 times
(by restarting the server) and cleared the OS cache before each
restart. Also, I ensured that the block number of the buffer returning
from the read stream API is correct. I am not sure if that much
testing is enough for this kind of change.

Any feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

v1-0001-Use-read-stream-in-autoprewarm.patch

Re: Using read stream in autoprewarm

From

"Andrey M. Borodin"

Date:

31 October 2024, 21:18:21

> On 8 Aug 2024, at 11:32, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Any feedback would be appreciated.

I've took a look into the patch. It seems to me that you add new block numbers to the read stream until you have
buffers.So when there are no more buffers you will still have some queued blocks. 
Maybe can you change the logic so that number of free buffers must be enough to allocate all blocks in look-ahead
distance?

Thanks!

Best regards, Andrey Borodin.

Re: Using read stream in autoprewarm

From

Stepan Neretin

Date:

31 October 2024, 22:01:52

Dear Nazir,

At first A quick look it looks good. I will take a closer look at it tomorrow. Could you please let me know about the performance tests and graphics?

Best regards, Stepan Neretin!

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

01 November 2024, 11:51:01

Hi,

Thanks for looking into this!

On Thu, 31 Oct 2024 at 21:18, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 8 Aug 2024, at 11:32, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > Any feedback would be appreciated.
>
> I've took a look into the patch. It seems to me that you add new block numbers to the read stream until you have
buffers.So when there are no more buffers you will still have some queued blocks.

> Maybe can you change the logic so that number of free buffers must be enough to allocate all blocks in look-ahead
distance?

I see what you mean. When the have_free_buffer() function returns
false in the callback, there are still queued blocks in the stream
although there are no free buffers in the buffer pool. I think the
best way to solve this is to get the number of free buffers in the
buffer pool by 'BufferStrategyControl.lastFreeBuffer -
BufferStrategyControl.firstFreeBuffer' and then compare it with
'stream->pending_read_nblocks'. When the 'stream->pending_read_nblocks
== number_of_free_buffers_in_buffer_pool', end the stream. The problem
with that is stream->pending_read_nblocks isn't public, also I am not
sure whether 'BufferStrategyControl.lastFreeBuffer -
BufferStrategyControl.firstFreeBuffer' is safe to use.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

01 November 2024, 11:51:39

Hi,

Thanks for looking into this!

On Thu, 31 Oct 2024 at 22:02, Stepan Neretin <sndcppg@gmail.com> wrote:
>
> At first A quick look it looks good. I will take a closer look at it tomorrow. Could you please let me know about the
performancetests and graphics?

Sorry but I didn't understand what you mean by performance tests and
graphics. Do you want something else than the information in the first
email [1]?

[1] postgr.es/m/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Using read stream in autoprewarm

From

"Andrey M. Borodin"

Date:

01 November 2024, 21:06:27

> On 1 Nov 2024, at 12:51, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
>  am not
> sure whether 'BufferStrategyControl.lastFreeBuffer -
> BufferStrategyControl.firstFreeBuffer' is safe to use.

Ugh... it will work. But it seems to me too dirty hack. There's no scalable way to know size of a free list.
Let's just comment that we might read some more buffers if database does not fit into memory?
Alternatively we can count size of a free list on the start.

Best regards, Andrey Borodin.

Re: Using read stream in autoprewarm

From

Matheus Alcantara

Date:

27 November 2024, 16:50:29

Hi,

Newer reviewer here, trying to understand more about the read stream API.

On Tuesday, November 26th, 2024 at 11:07 AM, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

> Any feedback would be appreciated.

I've executed the same test of 5 databases with each of them having 1 table of
3GB of size and I've got very similar results.

I've also tested using a single database with 4 tables with ~60GB of size and
the results compared with master was more closer but still an improvement. Note
that I've also increased the default shared_buffers to 7GB to see how it works
with large buffer pools.
  - patched: 5.4259 s
  - master: 5.53186 s

Not to much to say about the code, I'm currently learning more about the read
stream API and Postgresql hacking itself. Just some minor points and questions
about the patches.


v2-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch
--- a/src/backend/storage/buffer/freelist.c
+/*
+ * get_number_of_free_buffers -- a lockless way to get the number of free
+ *                                 buffers in buffer pool.
+ *
+ * Note that result continuosly changes as free buffers are moved out by other
+ * operations.
+ */
+int
+get_number_of_free_buffers(void)

typo on continuosly -> continuously


v2-0001-Use-read-stream-in-autoprewarm.patch
+    bool       *rs_have_free_buffer = per_buffer_data;
+
+
+    *rs_have_free_buffer = true;
+

Not sure if I understand why this variable is needed, it seems that it is only
written and never read? Just as comparison, the block_range_read_stream_cb
callback used on pg_prewarm seems to not use the per_buffer_data parameter.


--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Re: Using read stream in autoprewarm

From

Matheus Alcantara

Date:

28 November 2024, 19:45:40

On Wednesday, November 27th, 2024 at 11:19 AM, Nazir Bilal Yavuz 
<byavuz81@gmail.com> wrote:
 > > v2-0001-Use-read-stream-in-autoprewarm.patch
 > > + bool *rs_have_free_buffer = per_buffer_data;
 > > +
 > > +
 > > + *rs_have_free_buffer = true;
 > > +
 > >
 > > Not sure if I understand why this variable is needed, it seems that 
it is only
 > > written and never read? Just as comparison, the 
block_range_read_stream_cb
 > > callback used on pg_prewarm seems to not use the per_buffer_data 
parameter.
 >
 >
 > Actually, it is read in the main loop of the
 > autoprewarm_database_main() function:
 >
 > /* There are no free buffers left in shared buffers, break the loop. */
 > else if (!(*rs_have_free_buffer))
 > break;
 >
 > apw_read_stream_next_block() callback function sets
 > rs_have_free_buffer's value to false when there is no free buffer left
 > in the shared buffers. And the code above terminates the main loop in
 > the autoprewarm_database_main() function when it is set to false.
 >
 > block_range_read_stream_cb() callback is used when the callback only
 > needs to loop over the block numbers. However, for the autoprewarm
 > case; the callback function needs to do additional checks so another
 > callback and the use of this variable are required.

Ohh, I see, thanks very much for the explanation.

 > v3 is attached.

Thanks.

I don't know if there is another way that this patch could be tested? 
Looking
forward on other reviews on this.

--
Matheus Alcantara
EDB: https://www.enterprisedb.com

Re: Using read stream in autoprewarm

From

Kirill Reshke

Date:

29 November 2024, 06:55:03

On Wed, 27 Nov 2024 at 19:20, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for looking into this!
>
> On Wed, 27 Nov 2024 at 16:50, Matheus Alcantara <mths.dev@pm.me> wrote:
> > I've executed the same test of 5 databases with each of them having 1 table of
> > 3GB of size and I've got very similar results.
> >
> > I've also tested using a single database with 4 tables with ~60GB of size and
> > the results compared with master was more closer but still an improvement. Note
> > that I've also increased the default shared_buffers to 7GB to see how it works
> > with large buffer pools.
> >   - patched: 5.4259 s
> >   - master: 5.53186 s
>
> Thanks for the testing.
>
> > Not to much to say about the code, I'm currently learning more about the read
> > stream API and Postgresql hacking itself. Just some minor points and questions
> > about the patches.
> >
> >
> > v2-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch
> > --- a/src/backend/storage/buffer/freelist.c
> > +/*
> > + * get_number_of_free_buffers -- a lockless way to get the number of free
> > + *                                                              buffers in buffer pool.
> > + *
> > + * Note that result continuosly changes as free buffers are moved out by other
> > + * operations.
> > + */
> > +int
> > +get_number_of_free_buffers(void)
> >
> > typo on continuosly -> continuously
>
> Done.
>
> > v2-0001-Use-read-stream-in-autoprewarm.patch
> > +       bool       *rs_have_free_buffer = per_buffer_data;
> > +
> > +
> > +       *rs_have_free_buffer = true;
> > +
> >
> > Not sure if I understand why this variable is needed, it seems that it is only
> > written and never read? Just as comparison, the block_range_read_stream_cb
> > callback used on pg_prewarm seems to not use the per_buffer_data parameter.
>
> Actually, it is read in the main loop of the
> autoprewarm_database_main() function:
>
>         /* There are no free buffers left in shared buffers, break the loop. */
>         else if (!(*rs_have_free_buffer))
>             break;
>
> apw_read_stream_next_block() callback function sets
> rs_have_free_buffer's value to false when there is no free buffer left
> in the shared buffers. And the code above terminates the main loop in
> the autoprewarm_database_main() function when it is set to false.
>
> block_range_read_stream_cb() callback is used when the callback only
> needs to loop over the block numbers. However, for the autoprewarm
> case; the callback function needs to do additional checks so another
> callback and the use of this variable are required.
>
> v3 is attached.
>
> --
> Regards,
> Nazir Bilal Yavuz
> Microsoft

Hi!

> + old_blk = &(p->block_info[p->pos - 1]);
> + cur_blk = &(p->block_info[p->pos]);
Should we Assert(p->pos > 0 && p->pos < *something*)

Patch tested with no regression.


-- 
Best regards,
Kirill Reshke

Re: Using read stream in autoprewarm

From

Kirill Reshke

Date:

02 December 2024, 16:16:56

On Fri, 29 Nov 2024 at 16:19, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

> v4 is attached.
Hi!

I feel like we are ready to mark this as RFC, WDYT?

-- 
Best regards,
Kirill Reshke

Re: Using read stream in autoprewarm

From

"Andrey M. Borodin"

Date:

02 December 2024, 16:28:24


> On 2 Dec 2024, at 16:16, Kirill Reshke <reshkekirill@gmail.com> wrote:
> 
> I feel like we are ready to mark this as RFC, WDYT?

+1

Best regards, Andrey Borodin.

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

02 December 2024, 17:55:32

Hi,

On Mon, 2 Dec 2024 at 16:30, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 2 Dec 2024, at 16:16, Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I feel like we are ready to mark this as RFC, WDYT?
>
> +1

Done.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

29 March, 23:09:56

On Fri, Nov 29, 2024 at 6:20 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> v4 is attached.

I've started looking at this version. What I don't like about this
structure is that we are forced to increment the position in the array
of BlockInfoRecords in both the callback and the main loop in
autoprewarm_database_main(). There isn't a way around it because we
have to return control to the user when we encounter a new relation
(we can't close the relation and destroy the read stream in the
callback). And in the main loop in autoprewarm_database_main(), we may
fail to open the next relation and then need to keep advancing the
position in the array of BlockInfoRecords.

It isn't just that we have to advance the position in both places --
we also have to have a special case for the first block. All in all,
given that in the current read stream API, a single stream must only
concern itself with a single relation, fork combo, I think there is no
elegant way to deal with this in autoprewarm.

One alternative is to loop through the array of BlockInfoRecords and
get the start and end positions of the blocks in the arary for a
single relation/fork combo. Then we could make the read stream and
pass those two positions and the array as callback_private_data. That
would mean we loop through the whole array twice, but I wonder if the
improvement in clarity is worth it?

Some review feedback on your v4: I don't think we need the
rs_have_free_buffer per_buffer_data. We can just check
have_free_buffers() in both the callback and main loop in
autoprewarm_database_main(). I also think you want a comment about why
first_block is needed. And I think you need to guard the
read_stream_end() after the loop -- what if we never made a read
stream because we errored out for all the block's relations or
something?

- Melanie

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

29 March, 23:33:09

On Sat, Mar 29, 2025 at 4:09 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> One alternative is to loop through the array of BlockInfoRecords and
> get the start and end positions of the blocks in the arary for a
> single relation/fork combo. Then we could make the read stream and
> pass those two positions and the array as callback_private_data. That
> would mean we loop through the whole array twice, but I wonder if the
> improvement in clarity is worth it?

An alternative to this alternative is to somehow include the length of
each "span" (BlockInfoRecords from a single relation/fork) in the
first BlockInfoRecord of that span when building the array. Dunno how
hard that would be, but then you wouldn't have to loop through it
twice.

- Melanie

Re: Using read stream in autoprewarm

From

Andres Freund

Date:

29 March, 23:44:16

Hi,

On 2025-03-29 16:09:56 -0400, Melanie Plageman wrote:
> I've started looking at this version. What I don't like about this
> structure is that we are forced to increment the position in the array
> of BlockInfoRecords in both the callback and the main loop in
> autoprewarm_database_main(). There isn't a way around it because we
> have to return control to the user when we encounter a new relation
> (we can't close the relation and destroy the read stream in the
> callback). And in the main loop in autoprewarm_database_main(), we may
> fail to open the next relation and then need to keep advancing the
> position in the array of BlockInfoRecords.
> 
> It isn't just that we have to advance the position in both places --
> we also have to have a special case for the first block. All in all,
> given that in the current read stream API, a single stream must only
> concern itself with a single relation, fork combo, I think there is no
> elegant way to deal with this in autoprewarm.

How about having an iterator function operating on a pointer to iterator state
that's used both by the main loop and the read stream callback? If the
iterator reaches the next relation, it returns InvalidBlockNumber and the main
loop starts the next stream?

Greetings,

Andres Freund

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

30 March, 03:42:52

On Sat, Mar 29, 2025 at 4:44 PM Andres Freund <andres@anarazel.de> wrote:
>
> How about having an iterator function operating on a pointer to iterator state
> that's used both by the main loop and the read stream callback? If the
> iterator reaches the next relation, it returns InvalidBlockNumber and the main
> loop starts the next stream?

I don't think that removes the need for the first_block special case.
And we still need to duplicate the logic for detecting the next
database, block, or filenumber in both places. It maybe reduces the
potential for error a little bit. But I don't think it improves the
clarity.

- Melanie

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

30 March, 17:00:45

Hi,

Thank you for looking into this!

On Sat, 29 Mar 2025 at 23:10, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Fri, Nov 29, 2024 at 6:20 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > v4 is attached.
>
> I've started looking at this version. What I don't like about this
> structure is that we are forced to increment the position in the array
> of BlockInfoRecords in both the callback and the main loop in
> autoprewarm_database_main(). There isn't a way around it because we
> have to return control to the user when we encounter a new relation
> (we can't close the relation and destroy the read stream in the
> callback). And in the main loop in autoprewarm_database_main(), we may
> fail to open the next relation and then need to keep advancing the
> position in the array of BlockInfoRecords.
>
> It isn't just that we have to advance the position in both places --
> we also have to have a special case for the first block. All in all,
> given that in the current read stream API, a single stream must only
> concern itself with a single relation, fork combo, I think there is no
> elegant way to deal with this in autoprewarm.
>
> One alternative is to loop through the array of BlockInfoRecords and
> get the start and end positions of the blocks in the arary for a
> single relation/fork combo. Then we could make the read stream and
> pass those two positions and the array as callback_private_data. That
> would mean we loop through the whole array twice, but I wonder if the
> improvement in clarity is worth it?

I think this is a good alternative. I will work on this and try to
propose a patch.

> Some review feedback on your v4: I don't think we need the
> rs_have_free_buffer per_buffer_data. We can just check
> have_free_buffers() in both the callback and main loop in
> autoprewarm_database_main(). I also think you want a comment about why
> first_block is needed. And I think you need to guard the
> read_stream_end() after the loop -- what if we never made a read
> stream because we errored out for all the block's relations or
> something?

All of these are addressed. One extra thing I noticed is we were not
checking if blocknum < number_of_block_in_relation at the first_block
case in the stream callback, this is fixed now.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

31 March, 17:42:02

On Sun, Mar 30, 2025 at 10:01 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> > Some review feedback on your v4: I don't think we need the
> > rs_have_free_buffer per_buffer_data. We can just check
> > have_free_buffers() in both the callback and main loop in
> > autoprewarm_database_main(). I also think you want a comment about why
> > first_block is needed. And I think you need to guard the
> > read_stream_end() after the loop -- what if we never made a read
> > stream because we errored out for all the block's relations or
> > something?
>
> All of these are addressed. One extra thing I noticed is we were not
> checking if blocknum < number_of_block_in_relation at the first_block
> case in the stream callback, this is fixed now.

I'm wondering why you need to check if have_free_buffer() in the else
branch after getting the buffer from the read stream API. Can't you
just put it back in the for loop condition? Seems like it would have
the same effect.

-   for (; pos < apw_state->prewarm_stop_idx; pos++)
+   for (; pos < apw_state->prewarm_stop_idx && have_free_buffer(); pos++)
    {
        BlockInfoRecord *blk = &block_info[pos];
        Buffer      buf;
@@ -640,9 +640,6 @@ autoprewarm_database_main(Datum main_arg)
            apw_state->prewarmed_blocks++;
            ReleaseBuffer(buf);
        }
-       /* There are no free buffers left in shared buffers, break the loop. */
-       else if (!have_free_buffer())
-           break;

Looking at the code some more, I feel stuck on my original point about
incrementing the position in two places.
AFAICT, you are going to end up going through the array twice with this design.
Because you don't set the pos variable in autoprewarm_database_main()
from the p->pos variable which the read stream callback is
incrementing, if the read stream callback increments p->pos it a few
positions yielding those blocks to the read stream machinery to read,
you are then going to iterate over those positions in the array again
in the autoprewarm_database_main() loop.

I think you can get around this by setting pos from p->pos in
autoprewarm_database_main() after read_stream_next_buffer(). Or by
using p->pos in the loop in autoprewarm_database_main() (which is
basically what Andres suggested). I'm not sure, though, if this has
any problems. Like not closing a relation in the right place.

- Melanie

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

31 March, 19:27:32

Hi,

Thank you for looking into this!

On Mon, 31 Mar 2025 at 17:42, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Sun, Mar 30, 2025 at 10:01 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > > Some review feedback on your v4: I don't think we need the
> > > rs_have_free_buffer per_buffer_data. We can just check
> > > have_free_buffers() in both the callback and main loop in
> > > autoprewarm_database_main(). I also think you want a comment about why
> > > first_block is needed. And I think you need to guard the
> > > read_stream_end() after the loop -- what if we never made a read
> > > stream because we errored out for all the block's relations or
> > > something?
> >
> > All of these are addressed. One extra thing I noticed is we were not
> > checking if blocknum < number_of_block_in_relation at the first_block
> > case in the stream callback, this is fixed now.
>
> I'm wondering why you need to check if have_free_buffer() in the else
> branch after getting the buffer from the read stream API. Can't you
> just put it back in the for loop condition? Seems like it would have
> the same effect.
>
> -   for (; pos < apw_state->prewarm_stop_idx; pos++)
> +   for (; pos < apw_state->prewarm_stop_idx && have_free_buffer(); pos++)
>     {
>         BlockInfoRecord *blk = &block_info[pos];
>         Buffer      buf;
> @@ -640,9 +640,6 @@ autoprewarm_database_main(Datum main_arg)
>             apw_state->prewarmed_blocks++;
>             ReleaseBuffer(buf);
>         }
> -       /* There are no free buffers left in shared buffers, break the loop. */
> -       else if (!have_free_buffer())
> -           break;
>

You are right, done. Attached as v6.

> Looking at the code some more, I feel stuck on my original point about
> incrementing the position in two places.
> AFAICT, you are going to end up going through the array twice with this design.
> Because you don't set the pos variable in autoprewarm_database_main()
> from the p->pos variable which the read stream callback is
> incrementing, if the read stream callback increments p->pos it a few
> positions yielding those blocks to the read stream machinery to read,
> you are then going to iterate over those positions in the array again
> in the autoprewarm_database_main() loop.
>
> I think you can get around this by setting pos from p->pos in
> autoprewarm_database_main() after read_stream_next_buffer(). Or by
> using p->pos in the loop in autoprewarm_database_main() (which is
> basically what Andres suggested). I'm not sure, though, if this has
> any problems. Like not closing a relation in the right place.

I worked on an alternative approach, I refactored code a bit. It does
not traverse the list two times and I think the code is more suitable
to use read streams now. I simply get how many blocks are processed by
read streams and move the list forward by this number, so the actual
loop skips these blocks. This approach is attached with 'alternative'
prefix.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

31 March, 21:15:23

On Mon, Mar 31, 2025 at 12:27 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> I worked on an alternative approach, I refactored code a bit. It does
> not traverse the list two times and I think the code is more suitable
> to use read streams now. I simply get how many blocks are processed by
> read streams and move the list forward by this number, so the actual
> loop skips these blocks. This approach is attached with 'alternative'
> prefix.

I am leaning toward the refactored approach because I don't think we
want to go through the array twice and I think it is hard to get it
right with incrementing p.pos in both places and being sure we
correctly close the relation etc.

Looking at your alternative approach, I don't see how the innermost
while loop in autoprewarm_database_main() is correct

        /* Check whether blocknum is valid and within fork file size. */
        while (cur_filenumber == blk->filenumber &&
               blk->blocknum >= nblocks_in_fork)
        {
            /* Move to next forknum. */
            pos++;
            continue;
        }

Won't this just infinitely loop?

- Melanie

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

31 March, 21:58:30

Hi,

On Mon, 31 Mar 2025 at 21:15, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Mar 31, 2025 at 12:27 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > I worked on an alternative approach, I refactored code a bit. It does
> > not traverse the list two times and I think the code is more suitable
> > to use read streams now. I simply get how many blocks are processed by
> > read streams and move the list forward by this number, so the actual
> > loop skips these blocks. This approach is attached with 'alternative'
> > prefix.
>
> I am leaning toward the refactored approach because I don't think we
> want to go through the array twice and I think it is hard to get it
> right with incrementing p.pos in both places and being sure we
> correctly close the relation etc.

I liked it more as well.

> Looking at your alternative approach, I don't see how the innermost
> while loop in autoprewarm_database_main() is correct
>
>         /* Check whether blocknum is valid and within fork file size. */
>         while (cur_filenumber == blk->filenumber &&
>                blk->blocknum >= nblocks_in_fork)
>         {
>             /* Move to next forknum. */
>             pos++;
>             continue;
>         }
>
> Won't this just infinitely loop?

Oops, you are right. It should be an if statement instead of a while
loop, fixed now. Also, moved 'dsm_detach(seg);' to its previous place
to reduce diff. Attached as v7. Do you think that I should continue to
attach both approaches?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

31 March, 22:27:54

On Mon, Mar 31, 2025 at 2:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Do you think that I should continue to
> attach both approaches?

No, for now let's try and get this approach to a good place and then
see which one we like.

I think there might be another problem with the code. We only set
cur_database in the loop in autoprewarm_databas_main() when it is 0

        if (cur_database != blk->database)
        {
            if (cur_database == 0)
                cur_database = blk->database;

I know that the read stream will return InvalidBlockNumber when we
move onto the next database, but I don't see how we will end up
actually stopping prewarming in autoprewarm_database_main() when we
move on to the next database.

Another thing:
I don't know if it is a correctness issue but in
autoprewarm_database_main(), in this case
        if (!rel && cur_filenumber != blk->filenumber)
        {
you have removed the Assert(rel == NULL) -- I worry that we will end
up with a rel from a previous iteration after failing to open th enext
rel. I think we want this assert.

And a last thing
I noticed is that the initial values for cur_database, cur_filenumber,
and nblocks_in_fork in autoprewarm_database_main() are all initialized
to different kinds of initial values for different reasons. I'm
thinking if there is a way to make it consistent.

    cur_database = block_info[pos].database;
    cur_filenumber = InvalidOid;
    nblocks_in_fork = InvalidBlockNumber;

cur_database is set to be the same as the first database in the array
so that we won't hit
        if (cur_database != blk->database)
on the first block.

However, we make cur_filenumber InvalidOid because for the first block
we want to hit code that forces us to open a new relation
        if (!rel && cur_filenumber != blk->filenumber)

And nblocks_in_fork to InvalidBlockNumber so that 1) we don't have to
get the number before starting the loop and 2) so we would move past
BlockInfoRecords with invalid filenumber and invalid blocknumber

        if (cur_filenumber == blk->filenumber &&
            blk->blocknum >= nblocks_in_fork)

So, I'm just thinking if it would be correct to initialize
cur_database to InvalidOid and to check for that before skipping a
block, or if that doesn't work when the first blocks' database is
InvalidOid.

- Melanie

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

31 March, 22:45:09

On Mon, Mar 31, 2025 at 3:27 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I think there might be another problem with the code. We only set
> cur_database in the loop in autoprewarm_databas_main() when it is 0
>
>         if (cur_database != blk->database)
>         {
>             if (cur_database == 0)
>                 cur_database = blk->database;
>
> I know that the read stream will return InvalidBlockNumber when we
> move onto the next database, but I don't see how we will end up
> actually stopping prewarming in autoprewarm_database_main() when we
> move on to the next database.

Whoops, this isn't right. It does work. I'm going to draft a version
suggesting slightly different variable naming and a couple comments to
make this more clear.

- Melanie

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

01 April, 05:14:07

On Mon, Mar 31, 2025 at 3:45 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Whoops, this isn't right. It does work. I'm going to draft a version
> suggesting slightly different variable naming and a couple comments to
> make this more clear.

Okay, so after further study, I think there are multiple issues still
with the code. We could end up comparing a blocknumber to nblocks
calculated from a different fork. To address this, you'll need to keep
track of the last fork_number. At that point, you kind of have to
bring back old_blk -- because that is what we are recreating with
multiple local variables.

But, I think, overall, what we actually want to do is actually be
really explicit about fast-forwarding in the failure cases (when we
want to skip ahead because a relation is invalid or a fork is
invalid). We were trying to use the main loop control and just add
special cases to allow us to do this fast-forwarding. But, I think
instead, we want to just go to a function or loop somewhere and fast
forward through those bad blocks until we get to the next run of
blocks from a different relation or fork.

I've sketched out an idea like this in the attached. I don't think it
is 100% correct. It does pass tests, but I think we might incorrectly
advance pos twice after skipping a run of blocks belonging to a bad
fork or relation -- and thus skip the first good block after some bad
blocks.

It also needs some more refactoring.

maybe instead of having the skip code like this
        skip_forknumber:;
            while ((blk = next_record(block_info, &i)) != NULL &&
blk->database == database && blk->filenumber == filenumber &&
blk->forknum == forknum);

we make it a function? to avoid the back-to-back while loop conditions
(because of the outer do while loop).

But the explicit looping for skipping the bad blocks and the nested
loops for rel and fork -- I think these are less error prone.

What do you think?

- Melanie

Attachment

pgsr-autoprewarm-nested-loops.patch

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

01 April, 14:20:55

Hi,

On Tue, 1 Apr 2025 at 05:14, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Mon, Mar 31, 2025 at 3:45 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Whoops, this isn't right. It does work. I'm going to draft a version
> > suggesting slightly different variable naming and a couple comments to
> > make this more clear.
>
> Okay, so after further study, I think there are multiple issues still
> with the code. We could end up comparing a blocknumber to nblocks
> calculated from a different fork. To address this, you'll need to keep
> track of the last fork_number. At that point, you kind of have to
> bring back old_blk -- because that is what we are recreating with
> multiple local variables.

Yes, I realized the same.

> But, I think, overall, what we actually want to do is actually be
> really explicit about fast-forwarding in the failure cases (when we
> want to skip ahead because a relation is invalid or a fork is
> invalid). We were trying to use the main loop control and just add
> special cases to allow us to do this fast-forwarding. But, I think
> instead, we want to just go to a function or loop somewhere and fast
> forward through those bad blocks until we get to the next run of
> blocks from a different relation or fork.
>
> I've sketched out an idea like this in the attached. I don't think it
> is 100% correct. It does pass tests, but I think we might incorrectly
> advance pos twice after skipping a run of blocks belonging to a bad
> fork or relation -- and thus skip the first good block after some bad
> blocks.

The test actually does not pass, it prewarms 4 of 211 blocks. It
prewarms all 211 blocks in the master.
From 'pg_prewarm/001_basic/log/001_basic_main.log':
'autoprewarm successfully prewarmed 4 of 211 previously-loaded blocks'

> It also needs some more refactoring.
>
> maybe instead of having the skip code like this
>         skip_forknumber:;
>             while ((blk = next_record(block_info, &i)) != NULL &&
> blk->database == database && blk->filenumber == filenumber &&
> blk->forknum == forknum);
>
> we make it a function? to avoid the back-to-back while loop conditions
> (because of the outer do while loop).

+1 for using the functions. I think it is hard to follow / maintain
this with the do-while loops and goto statements.

> But the explicit looping for skipping the bad blocks and the nested
> loops for rel and fork -- I think these are less error prone.

One question in my mind is, the outermost loop stops when the database
changes, we do not check if it is changed from the database oid = 0.
Handling this might require some structural changes.

> What do you think?

I think what you said is right but the current version of the patch
looks more complicated to me. I may be biased because I do not like
do-while loops and goto statements.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

01 April, 15:50:32

Hi,

On Tue, 1 Apr 2025 at 05:14, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Mon, Mar 31, 2025 at 3:45 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Whoops, this isn't right. It does work. I'm going to draft a version
> > suggesting slightly different variable naming and a couple comments to
> > make this more clear.
>
> Okay, so after further study, I think there are multiple issues still
> with the code. We could end up comparing a blocknumber to nblocks
> calculated from a different fork. To address this, you'll need to keep
> track of the last fork_number. At that point, you kind of have to
> bring back old_blk -- because that is what we are recreating with
> multiple local variables.

I am attaching v8, which is an updated version of the v7. I tried to
get rid of these local variables and refactored code to make logic
more straightforward instead of going back and forth.

0001 and 0002 are v8. 0003 is another refactoring attempt to make code
more straightforward. I did not squash 0003 to previous patches as you
might not like it.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachment

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

01 April, 16:54:47

On Tue, Apr 1, 2025 at 7:21 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Tue, 1 Apr 2025 at 05:14, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> +1 for using the functions. I think it is hard to follow / maintain
> this with the do-while loops and goto statements.

I'll take a look at your downthread proposal in a bit.

But the attached patch is a new version of what I proposed with the
functions. It's still not totally correct, but I wanted to see what
you thought.

> > But the explicit looping for skipping the bad blocks and the nested
> > loops for rel and fork -- I think these are less error prone.
>
> One question in my mind is, the outermost loop stops when the database
> changes, we do not check if it is changed from the database oid = 0.
> Handling this might require some structural changes.

I don't understand why each database has global objects at the
beginning. If there are global objects, they are global to all
databases, so surely the sort function would have put them all at the
beginning? One problem is we need a database connection to prewarm
these, but if the global objects are all at the beginning, then maybe
we can handle those with a special case and not force ourselves to
check for them when trying to load blocks from every database.

- Melanie

Attachment

pgsr-autoprewarm-loopsv2.patch

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

02 April, 01:35:48

On Tue, Apr 1, 2025 at 8:50 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> I am attaching v8, which is an updated version of the v7. I tried to
> get rid of these local variables and refactored code to make logic
> more straightforward instead of going back and forth.
>
> 0001 and 0002 are v8. 0003 is another refactoring attempt to make code
> more straightforward. I did not squash 0003 to previous patches as you
> might not like it.

I looked at the code on your github branch that has all three of these
squashed together.
I think our approaches are converging. I like that you are
fast-forwarding to the next filenumber or fork number explicitly when
there is a bad relation or fork. I've changed my version (see newest
one attached) to do the fast-forwarding inline instead of in a
separate function like yours (the function didn't save many LOC and
actually may have added to cognitive overhead).

Compared to my version, I think you avoided one level of loop nesting with your

        if (!rel)
        else if (smgrexists(RelationGetSmgr(rel), blk->forknum))
        else

but for starters, I don't think you can do this:

        else if (smgrexists(RelationGetSmgr(rel), blk->forknum))

because you didn't check if you have a legal forknum first

And, I actually kind of prefer the explicitly nested structure

loop through all relations
      loop through all forks
            loop through all buffers

While in the old structure, I liked your
autoprewarm_prewarm_relation() function, but I think it is nicer
inlined like in my version. It makes the loop through all buffers
explicit too.

I know you mentioned off-list that you don't like the handling of
global objects in my version, but I prefer doing it this way (even
though we have to check for in the loop condition) to having to set
the current database once we reach non-shared objects. It feels too
fiddly. This way seems less error prone. Looking at this version, what
do you think? Could we do it better?

Let me know what you think of this version. I think it is the best of
both our approaches. I've separated it into two commits -- the first
does all the refactoring without using the read stream API and the
second one uses the read stream API.

On another topic, what are the minimal places we need to call
have_free_buffers() (in this version)? I haven't even started looking
at the last patch you've been sending that is about checking the
freelist. I'll have to do that next.

- Melanie

Attachment

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

02 April, 13:26:42

Hi,

On Wed, 2 Apr 2025 at 01:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Tue, Apr 1, 2025 at 8:50 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > I am attaching v8, which is an updated version of the v7. I tried to
> > get rid of these local variables and refactored code to make logic
> > more straightforward instead of going back and forth.
> >
> > 0001 and 0002 are v8. 0003 is another refactoring attempt to make code
> > more straightforward. I did not squash 0003 to previous patches as you
> > might not like it.
>
> I looked at the code on your github branch that has all three of these
> squashed together.

Thank you!

> I think our approaches are converging. I like that you are
> fast-forwarding to the next filenumber or fork number explicitly when
> there is a bad relation or fork. I've changed my version (see newest
> one attached) to do the fast-forwarding inline instead of in a
> separate function like yours (the function didn't save many LOC and
> actually may have added to cognitive overhead).
>
> Compared to my version, I think you avoided one level of loop nesting with your
>
>         if (!rel)
>         else if (smgrexists(RelationGetSmgr(rel), blk->forknum))
>         else
>
> but for starters, I don't think you can do this:
>
>         else if (smgrexists(RelationGetSmgr(rel), blk->forknum))
>
> because you didn't check if you have a legal forknum first

You are right, I missed that. I think smgrexists() should return NULL
if the forknum is invalid but it is not a topic for this thread.

> And, I actually kind of prefer the explicitly nested structure
>
> loop through all relations
>       loop through all forks
>             loop through all buffers

I prefer this as well. We know when we opened the relation, so we do
not need to close it in two places like I did.

> While in the old structure, I liked your
> autoprewarm_prewarm_relation() function, but I think it is nicer
> inlined like in my version. It makes the loop through all buffers
> explicit too.

Yes, I liked your approach.

> I know you mentioned off-list that you don't like the handling of
> global objects in my version, but I prefer doing it this way (even
> though we have to check for in the loop condition) to having to set
> the current database once we reach non-shared objects. It feels too
> fiddly. This way seems less error prone. Looking at this version, what
> do you think? Could we do it better?

I think there might be a problem with that approach. Let's say that we
are able to open relation when database oid = 0 and filenumber = 18.
Then we are trying to find a valid fork now. We couldn't find a valid
fork immediately, so we continued looping. Then database oid is
changed from 0 to let's say 1 but filenumber remains the same. We are
still in the valid fork loop, so relation remains from the database
oid = 0. Isn't that wrong?

> Let me know what you think of this version. I think it is the best of
> both our approaches. I've separated it into two commits -- the first
> does all the refactoring without using the read stream API and the
> second one uses the read stream API.

Some comments,

0001:

ReadBufferExtended() can be called in its own minimal loop, otherwise
we end up doing unnecessary checks for each ReadBufferExtended() call.
This is not a problem when the 0002 is applied.

0002:

We don't skip blocks whose blocknum is more than nblocks_in_fork. We
can add that either to a stream callback like you did before or after
the read_stream_end. I prefer stream callback because of the reason
below [1].

> On another topic, what are the minimal places we need to call
> have_free_buffers() (in this version)? I haven't even started looking
> at the last patch you've been sending that is about checking the
> freelist. I'll have to do that next.

I think its current places are good enough. We may add one after the
read_stream_end if we want to handle blk->blocknum >= nblocks_in_fork
after the read stream finishes. If we handle that in the stream
callback then no need to add have_free_buffers() [1].

Other than these comments, I think the current structure looks good.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

02 April, 18:54:24

On Wed, Apr 2, 2025 at 6:26 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Wed, 2 Apr 2025 at 01:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > I know you mentioned off-list that you don't like the handling of
> > global objects in my version, but I prefer doing it this way (even
> > though we have to check for in the loop condition) to having to set
> > the current database once we reach non-shared objects. It feels too
> > fiddly. This way seems less error prone. Looking at this version, what
> > do you think? Could we do it better?
>
> I think there might be a problem with that approach. Let's say that we
> are able to open relation when database oid = 0 and filenumber = 18.
> Then we are trying to find a valid fork now. We couldn't find a valid
> fork immediately, so we continued looping. Then database oid is
> changed from 0 to let's say 1 but filenumber remains the same. We are
> still in the valid fork loop, so relation remains from the database
> oid = 0. Isn't that wrong?

Yep, you are totally right. The code was wrong. We could fix it by
setting current_db to the valid database once we've prewarmed the
global objects, but we need that logic in three places, so that seems
quite undesirable.

In attached v9, I've added a patch to apw_load_buffers() which invokes
autoprewarm_database_main() for the global objects alone but while
connected to the first valid database. It's not the best solution but
I think it is better than having that fiddly logic everywhere about
database 0.

This made me think, I wonder if we could connect to template0 or
template1 to prewarm the global objects. Then we could also prewarm if
only global objects are present (that doesn't seem very important but
it would be a side effect). It might be more clear to connect to
template0/1 instead of the first valid database to prewarm global
objects. I don't know if there is some reason not to do this -- like
maybe bg workers aren't allowed or something?

> 0001:
>
> ReadBufferExtended() can be called in its own minimal loop, otherwise
> we end up doing unnecessary checks for each ReadBufferExtended() call.
> This is not a problem when the 0002 is applied.

Could you provide a snippet of example code? If we call
ReadBufferExtended() in a loop, the block won't be changing, so I
don't see how that will help.

> 0002:
>
> We don't skip blocks whose blocknum is more than nblocks_in_fork. We
> can add that either to a stream callback like you did before or after
> the read_stream_end. I prefer stream callback because of the reason
> below [1].

Yep, I also thought we had to have that logic, but because we sort by
db,rel,fork,blkno, I think blocks with blocknumber >= nblocks_in_fork
will be last and so we just want to move on to the next fork.

> > On another topic, what are the minimal places we need to call
> > have_free_buffers() (in this version)? I haven't even started looking
> > at the last patch you've been sending that is about checking the
> > freelist. I'll have to do that next.
>
> I think its current places are good enough. We may add one after the
> read_stream_end if we want to handle blk->blocknum >= nblocks_in_fork
> after the read stream finishes. If we handle that in the stream
> callback then no need to add have_free_buffers() [1].

As long as we have it in the callback, I don't think that we need
have_free_buffers() after read_stream_end() since it is in the while
loop condition which we will immediately execute after
read_stream_end().

I was also wondering about the other patch in your earlier set that
set stop_idx from get_number_of_free_buffers(). Could you tell me more
about that? What does it do and why is it needed with the read stream
but wasn't needed before?

- Melanie

Attachment

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

02 April, 20:20:38

Hi,

On Wed, 2 Apr 2025 at 18:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Wed, Apr 2, 2025 at 6:26 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > On Wed, 2 Apr 2025 at 01:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
> > >
> > > I know you mentioned off-list that you don't like the handling of
> > > global objects in my version, but I prefer doing it this way (even
> > > though we have to check for in the loop condition) to having to set
> > > the current database once we reach non-shared objects. It feels too
> > > fiddly. This way seems less error prone. Looking at this version, what
> > > do you think? Could we do it better?
> >
> > I think there might be a problem with that approach. Let's say that we
> > are able to open relation when database oid = 0 and filenumber = 18.
> > Then we are trying to find a valid fork now. We couldn't find a valid
> > fork immediately, so we continued looping. Then database oid is
> > changed from 0 to let's say 1 but filenumber remains the same. We are
> > still in the valid fork loop, so relation remains from the database
> > oid = 0. Isn't that wrong?
>
> Yep, you are totally right. The code was wrong. We could fix it by
> setting current_db to the valid database once we've prewarmed the
> global objects, but we need that logic in three places, so that seems
> quite undesirable.
>
> In attached v9, I've added a patch to apw_load_buffers() which invokes
> autoprewarm_database_main() for the global objects alone but while
> connected to the first valid database. It's not the best solution but
> I think it is better than having that fiddly logic everywhere about
> database 0.

I liked this. I think it is better compared to handling global objects
in autoprewarm_database_main().

> This made me think, I wonder if we could connect to template0 or
> template1 to prewarm the global objects. Then we could also prewarm if
> only global objects are present (that doesn't seem very important but
> it would be a side effect). It might be more clear to connect to
> template0/1 instead of the first valid database to prewarm global
> objects. I don't know if there is some reason not to do this -- like
> maybe bg workers aren't allowed or something?

I am not sure but I think the current implementation is good enough.

> > 0001:
> >
> > ReadBufferExtended() can be called in its own minimal loop, otherwise
> > we end up doing unnecessary checks for each ReadBufferExtended() call.
> > This is not a problem when the 0002 is applied.
>
> Could you provide a snippet of example code? If we call
> ReadBufferExtended() in a loop, the block won't be changing, so I
> don't see how that will help.

I don't have an example code right now. But what I mean is we may call
ReadBufferExtended() in a loop for the blocks in the same fork. We
don't need to call smgrexists() and RelationGetNumberOfBlocksInFork()
for each block, we will call these for each fork not for each block.
However, like I said before, this is not important when the read
stream code is applied.

> > 0002:
> >
> > We don't skip blocks whose blocknum is more than nblocks_in_fork. We
> > can add that either to a stream callback like you did before or after
> > the read_stream_end. I prefer stream callback because of the reason
> > below [1].
>
> Yep, I also thought we had to have that logic, but because we sort by
> db,rel,fork,blkno, I think blocks with blocknumber >= nblocks_in_fork
> will be last and so we just want to move on to the next fork.

I agree that they will be the last, but won't we end up creating a
read stream object for each block?

> > > On another topic, what are the minimal places we need to call
> > > have_free_buffers() (in this version)? I haven't even started looking
> > > at the last patch you've been sending that is about checking the
> > > freelist. I'll have to do that next.
> >
> > I think its current places are good enough. We may add one after the
> > read_stream_end if we want to handle blk->blocknum >= nblocks_in_fork
> > after the read stream finishes. If we handle that in the stream
> > callback then no need to add have_free_buffers() [1].
>
> As long as we have it in the callback, I don't think that we need
> have_free_buffers() after read_stream_end() since it is in the while
> loop condition which we will immediately execute after
> read_stream_end().

This is okay for me.

> I was also wondering about the other patch in your earlier set that
> set stop_idx from get_number_of_free_buffers(). Could you tell me more
> about that? What does it do and why is it needed with the read stream
> but wasn't needed before?

In the read stream code, we use callbacks to create bigger I/Os. These
I/Os aren't processed until the io_combine_limit or we hit not
sequential blocknum. In other words, when the have_free_buffer()
function returns false in the callback; there are still queued blocks
in the stream, although there are no free buffers in the buffer pool.
We can end up creating I/Os bigger than free buffers in the shared
buffers.

To solve that a bit, we try to get a number of free buffers in the
shared buffers. So, we try to minimize the problem above by using the
actual free buffer count. That optimization has problems like if other
processes fill shared buffers at the same time while the read stream
is running, then this optimization will not work well.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

02 April, 22:34:22

On Wed, Apr 2, 2025 at 1:20 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> On Wed, 2 Apr 2025 at 18:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > On Wed, Apr 2, 2025 at 6:26 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > >
> I don't have an example code right now. But what I mean is we may call
> ReadBufferExtended() in a loop for the blocks in the same fork. We
> don't need to call smgrexists() and RelationGetNumberOfBlocksInFork()
> for each block, we will call these for each fork not for each block.
> However, like I said before, this is not important when the read
> stream code is applied.

Ah, you are so right. That was totally messed up in the last version.
I've fixed it in attached v10. I think having it correct in the 0002
patch makes it easier to understand how the read stream callback is
replacing it.

> > > We don't skip blocks whose blocknum is more than nblocks_in_fork. We
> > > can add that either to a stream callback like you did before or after
> > > the read_stream_end. I prefer stream callback because of the reason
> > > below [1].
> >
> > Yep, I also thought we had to have that logic, but because we sort by
> > db,rel,fork,blkno, I think blocks with blocknumber >= nblocks_in_fork
> > will be last and so we just want to move on to the next fork.
>
> I agree that they will be the last, but won't we end up creating a
> read stream object for each block?

Ah, yes, you are right. That would have been really broken. I think I
fixed it. See attached. Now we'll only do that for the first block if
it is invalid (which is probably okay IMO).

> > I was also wondering about the other patch in your earlier set that
> > set stop_idx from get_number_of_free_buffers(). Could you tell me more
> > about that? What does it do and why is it needed with the read stream
> > but wasn't needed before?
>
> In the read stream code, we use callbacks to create bigger I/Os. These
> I/Os aren't processed until the io_combine_limit or we hit not
> sequential blocknum. In other words, when the have_free_buffer()
> function returns false in the callback; there are still queued blocks
> in the stream, although there are no free buffers in the buffer pool.
> We can end up creating I/Os bigger than free buffers in the shared
> buffers.
>
> To solve that a bit, we try to get a number of free buffers in the
> shared buffers. So, we try to minimize the problem above by using the
> actual free buffer count. That optimization has problems like if other
> processes fill shared buffers at the same time while the read stream
> is running, then this optimization will not work well.

Hmm. Yea, I do find it confusing that it will get so easily out of
date. Let's circle back to this after getting the other patches to a
good place (but before committing all of this).

- Melanie

Attachment

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

03 April, 03:25:39

On Wed, Apr 2, 2025 at 3:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> attached v10

Attached v11 has an assortment of cosmetic updates.

- Melanie

Attachment

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

03 April, 17:31:04

On Wed, Apr 2, 2025 at 8:25 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Apr 2, 2025 at 3:34 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > attached v10
>
> Attached v11 has an assortment of cosmetic updates.

Attached v12 fixes a bug Bilal found off-list in 0002 related to
handling invalid blocks.

- Melanie

Attachment

Re: Using read stream in autoprewarm

From

Heikki Linnakangas

Date:

03 April, 18:17:08

On 03/04/2025 17:31, Melanie Plageman wrote:
> Attached v12 fixes a bug Bilal found off-list in 0002 related to
> handling invalid blocks.

I had a quick look at this. Looks good overall, some small remarks:

v12-0001-Autoprewarm-global-objects-separately.patch

> Instead, modify apw_load_buffers() to prewarm the shared objects in one
> invocation of autoprewarm_database_main() while connected to the first
> valid database.

So it effectively treats "global objects" as one extra database, 
launching a separate worker process to handle global objects. It took me 
a while to understand that. From the commit message, I understood that 
it still does that within the first worker process invocation, but no. A 
comment somewhere would be good.

One extra worker process invocation is obviously not an improvement 
performance-wise, but seems acceptable.

v12-0002-Refactor-autoprewarm_database_main-in-preparatio.patch

Yes, I agree this makes the logic more clear

v12-0003-Use-streaming-read-I-O-in-autoprewarm.patch

I wonder if the have_free_buffer() calls work correctly with read 
streams? Or will you "overshoot", prewarming a few more pages after the 
buffer cache is already full? I guess that depends on when exactly the 
read stream code allocates the buffer.

While reviewing this, I noticed a pre-existing bug: The code ignores 
'tablespace' when deciding if it's reached the end of the current 
relation. I believe it's possible to have two different relations with 
the same relnumber, in different tablespaces.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

03 April, 22:25:15

On Thu, Apr 3, 2025 at 11:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I had a quick look at this. Looks good overall, some small remarks:

Thanks for taking a look!

> v12-0001-Autoprewarm-global-objects-separately.patch
>
> > Instead, modify apw_load_buffers() to prewarm the shared objects in one
> > invocation of autoprewarm_database_main() while connected to the first
> > valid database.
>
> So it effectively treats "global objects" as one extra database,
> launching a separate worker process to handle global objects. It took me
> a while to understand that. From the commit message, I understood that
> it still does that within the first worker process invocation, but no. A
> comment somewhere would be good.

Yea, I could have been more explicit about that.

Actually, I was chatting about this with Andres off-list and he was
like, why do you need to check the database at all? Won't
prewarm_stop_idx already have that built in? And I think he's right.
In attached v13, I've added a separate patch (0002) which turns this
check into an assert. And I removed the check from all of the other
loops in the later patches.

> v12-0003-Use-streaming-read-I-O-in-autoprewarm.patch
>
> I wonder if the have_free_buffer() calls work correctly with read
> streams? Or will you "overshoot", prewarming a few more pages after the
> buffer cache is already full? I guess that depends on when exactly the
> read stream code allocates the buffer.

It does have some overshoot -- but a max of io_combine_limit blocks
will be evicted. The read stream code builds up an IO of up to
io_combine_limit blocks before calling StartReadBuffer(). So you could
be in a situation where you weren't quite out of buffers on the
freelist while you are building up the IO and then when you go to pin
those buffers, there aren't enough on the freelist. But I think that's
okay.

> While reviewing this, I noticed a pre-existing bug: The code ignores
> 'tablespace' when deciding if it's reached the end of the current
> relation. I believe it's possible to have two different relations with
> the same relnumber, in different tablespaces.

Good catch. I've included a fix for this in the attached set (0001)

- Melanie

Attachment

Re: Using read stream in autoprewarm

From

Daniel Gustafsson

Date:

03 April, 23:22:25

> On 3 Apr 2025, at 21:25, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Apr 3, 2025 at 11:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> I had a quick look at this. Looks good overall

Same here, this seemed like a good piece to bite into with my limited AIO
knowledge to learn more, and reading it over it seems like a good change.

A few small comments:

+    * `pos` is the read stream callback's index into block_info. Because the
I'm not a fan of markdown in code comments (also in a few more places).


+                       /* Time to try and open our new found relation */
s/new found/newfound/


+   while (p->pos < apw_state->prewarm_stop_idx)
+   {
+       BlockInfoRecord blk = p->block_info[p->pos];
+
+       CHECK_FOR_INTERRUPTS();
Isn't checking inside this loop increasing the frequency of checks compared to
the current version?


+           Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
Is there a non programmer-error case where this can happen?  The Assert right
after a loop around the same function seems to imply there is a race or toctou
case which if so could use a comment.

--
Daniel Gustafsson

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

03 April, 23:54:48

On Thu, Apr 3, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> >> I had a quick look at this. Looks good overall
>
> Same here, this seemed like a good piece to bite into with my limited AIO
> knowledge to learn more, and reading it over it seems like a good change.

Thanks for taking a look!

> A few small comments:
>
> +    * `pos` is the read stream callback's index into block_info. Because the
> I'm not a fan of markdown in code comments (also in a few more places).

Removed them. I got the idea of doing this to distinguish variable
names in comments from English words. But I can see how it is kind of
distracting -- since it is not common in the codebase.

> +                       /* Time to try and open our new found relation */
> s/new found/newfound/

Fixed

> +   while (p->pos < apw_state->prewarm_stop_idx)
> +   {
> +       BlockInfoRecord blk = p->block_info[p->pos];
> +
> +       CHECK_FOR_INTERRUPTS();
> Isn't checking inside this loop increasing the frequency of checks compared to
> the current version?

It's unclear. The current version does seem to execute the main while
loop (including the CFI) once per block -- even for blocks that it
doesn't end up reading for whatever reason. Things get murkier with
the read stream code. But I put it in the callback to keep the general
idea of doing a CFI once per block. In attached v14, I moved the CFI
to the top of the callback, outside of the loop, to make that
intention more clear.

> +           Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
> Is there a non programmer-error case where this can happen?  The Assert right
> after a loop around the same function seems to imply there is a race or toctou
> case which if so could use a comment.

Yep. Good call. At some point one read stream user had this assert
because its invocation of read_stream_buffer() was interleaved with
other stuff, so it wasn't obvious that the stream would be exhausted
when it was time to end it. And the assert helped defend that
invariant against future innovation :) I think I've copy-pasta'd this
assert around for no good reason to other read stream users. I've
removed it in v14 and I should probably do a follow-on commit to
master to remove it from the other places it obviously doesn't belong
and is a confusing distraction for future readers.

- Melanie

Attachment

Re: Using read stream in autoprewarm

From

Daniel Gustafsson

Date:

04 April, 10:59:08

> On 3 Apr 2025, at 22:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
> On Thu, Apr 3, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote:

>> +   while (p->pos < apw_state->prewarm_stop_idx)
>> +   {
>> +       BlockInfoRecord blk = p->block_info[p->pos];
>> +
>> +       CHECK_FOR_INTERRUPTS();
>> Isn't checking inside this loop increasing the frequency of checks compared to
>> the current version?
>
> It's unclear. The current version does seem to execute the main while
> loop (including the CFI) once per block -- even for blocks that it
> doesn't end up reading for whatever reason. Things get murkier with
> the read stream code. But I put it in the callback to keep the general
> idea of doing a CFI once per block. In attached v14, I moved the CFI
> to the top of the callback, outside of the loop, to make that
> intention more clear.

LGTM.

>> +           Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
>> Is there a non programmer-error case where this can happen?  The Assert right
>> after a loop around the same function seems to imply there is a race or toctou
>> case which if so could use a comment.
>
> Yep. Good call. At some point one read stream user had this assert
> because its invocation of read_stream_buffer() was interleaved with
> other stuff, so it wasn't obvious that the stream would be exhausted
> when it was time to end it. And the assert helped defend that
> invariant against future innovation :) I think I've copy-pasta'd this
> assert around for no good reason to other read stream users. I've
> removed it in v14 and I should probably do a follow-on commit to
> master to remove it from the other places it obviously doesn't belong
> and is a confusing distraction for future readers.

Makes sense, thanks for clarifying and I agree with removing the assertion.

This patch is already marked Ready for Committer and I concur with that.

--
Daniel Gustafsson

Re: Using read stream in autoprewarm

From

Nazir Bilal Yavuz

Date:

04 April, 11:17:06

Hi,

On Fri, 4 Apr 2025 at 10:59, Daniel Gustafsson <daniel@yesql.se> wrote:
>
> This patch is already marked Ready for Committer and I concur with that.

Same on my end, v14 LGTM.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Using read stream in autoprewarm

From

Melanie Plageman

Date:

04 April, 22:46:52

On Fri, Apr 4, 2025 at 4:17 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Same on my end, v14 LGTM.

Cool. Pushed and marked as such in the CF app.

- Melanie