Thread: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From

Cedric Villemain

Date:

04 January, 02:23:43

Hi,

I wonder what you think of making pg_prewarm use recent addition on 
smgrprefetch and readv ?


In order to try, I did it anyway in the attached patches. They contain 
no doc update, but I will proceed if it is of interest.

In summary:

1. The first one adds a new check on parameters (checking last block is 
indeed not before first block).
Consequence is an ERROR is raised instead of silently doing nothing.

2. The second one does implement smgrprefetch with range and loops by 
default per segment to still have a check for interrupts.

3. The third one provides smgrreadv instead of smgrread,  by default on 
a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Q: posix_fadvise may not work exactly the way you think it does, or does 
it ?


In details, and for the question:

It's not so obvious that the "feature" is really required or wanted, 
depending on what are the expectations from user point of view.

The kernel decides on what to do with posix_fadvise calls, and how we 
pass parameters does impact the decision.
With the current situation where prefetch is done step by step, block by 
block, they are very probably most of the time all loaded even if those 
from the beginning of the relation can be discarded at the end of the 
prefetch.

However,  if instead you provide a real range, or the magic len=0 to 
posix_fadvise, then blocks are "more" loaded according to effective vm 
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this 
is probably not what end-users expect despite being probably a good 
choice (you can still free cache beforehand to help the kernel).

An example, below I'm using vm_relation_cachestat() which provides linux 
cachestat output, and vm_relation_fadvise() to unload cache, and 
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache, 
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
           0 |       32768 |    65536 |        0
       32768 |       32768 |    65536 |        0
       65536 |       32768 |    65536 |        0
       98304 |       32768 |    65536 |        0
      131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
     132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
           0 |       32768 |    65536 |      320
       32768 |       32768 |    65536 |        0
       65536 |       32768 |    65536 |        0
       98304 |       32768 |    65536 |        0
      131072 |        1672 |     3344 |      320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral 
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
           0 |       32768 |    65536 |    65536
       32768 |       32768 |    65536 |    65536
       65536 |       32768 |    65536 |    65536
       98304 |       32768 |    65536 |    65536
      131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: full 
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have 
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work, 
here each call is from block 0 with len of 132744 and a range of 1 block 
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or 
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

Thank you for your time reading this longer than expected email.

Comments ?


---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

Attachment

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From

Nazir Bilal Yavuz

Date:

05 March, 14:07:01

Hi,

Thanks for working on this!

The patches are cleanly applied on top of the current master and all
tests are passed.

On Thu, 4 Jan 2024 at 02:23, Cedric Villemain
<Cedric.Villemain+pgsql@abcsql.com> wrote:
>
> Hi,
>
> I wonder what you think of making pg_prewarm use recent addition on
> smgrprefetch and readv ?
>
>
> In order to try, I did it anyway in the attached patches. They contain
> no doc update, but I will proceed if it is of interest.
>
> In summary:
>
> 1. The first one adds a new check on parameters (checking last block is
> indeed not before first block).
> Consequence is an ERROR is raised instead of silently doing nothing.

This is a general improvement and can be committed without other patches.

> 2. The second one does implement smgrprefetch with range and loops by
> default per segment to still have a check for interrupts.

It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
page of posix_fadvise [1] states that: "The amount of data read may be
decreased by the kernel depending on virtual memory load. (A few
megabytes will usually be fully satisfied, and more is rarely
useful.)". It is trying to prefetch 1GB data now. That could explain
your observation about differences between nr_cache numbers.

> 3. The third one provides smgrreadv instead of smgrread,  by default on
> a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Looks good to me.

> Q: posix_fadvise may not work exactly the way you think it does, or does
> it ?
>
>
> In details, and for the question:
>
> It's not so obvious that the "feature" is really required or wanted,
> depending on what are the expectations from user point of view.
>
> The kernel decides on what to do with posix_fadvise calls, and how we
> pass parameters does impact the decision.
> With the current situation where prefetch is done step by step, block by
> block, they are very probably most of the time all loaded even if those
> from the beginning of the relation can be discarded at the end of the
> prefetch.
>
> However,  if instead you provide a real range, or the magic len=0 to
> posix_fadvise, then blocks are "more" loaded according to effective vm
> pressure (which is not the case on the previous example).
> As a result only a small part of the relation might be loaded, and this
> is probably not what end-users expect despite being probably a good
> choice (you can still free cache beforehand to help the kernel).
>
> An example, below I'm using vm_relation_cachestat() which provides linux
> cachestat output, and vm_relation_fadvise() to unload cache, and
> pg_prewarm for the demo:
>
> # clear cache: (nr_cache is the number of file system pages in cache,
> not postgres blocks)
>
> ```
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -------------+-------------+----------+----------
>            0 |       32768 |    65536 |        0
>        32768 |       32768 |    65536 |        0
>        65536 |       32768 |    65536 |        0
>        98304 |       32768 |    65536 |        0
>       131072 |        1672 |     3344 |        0
> ```
>
> # load full relation with pg_prewarm (patched)
>
> ```
> postgres=# select pg_prewarm('foo','prefetch');
> pg_prewarm
> ------------
>      132744
> (1 row)
> ```
>
> # Checking results:
>
> ```
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -------------+-------------+----------+----------
>            0 |       32768 |    65536 |      320
>        32768 |       32768 |    65536 |        0
>        65536 |       32768 |    65536 |        0
>        98304 |       32768 |    65536 |        0
>       131072 |        1672 |     3344 |      320  <-- segment 1
>
> ```
>
> # Load block by block and check:
>
> ```
> postgres=# select from generate_series(0, 132743) g(n), lateral
> pg_prewarm('foo','prefetch', 'main', n, n);
> postgres=# select block_start, block_count, nr_pages, nr_cache from
> vm_relation_cachestat('foo',range:=1024*32);
> block_start | block_count | nr_pages | nr_cache
> -------------+-------------+----------+----------
>            0 |       32768 |    65536 |    65536
>        32768 |       32768 |    65536 |    65536
>        65536 |       32768 |    65536 |    65536
>        98304 |       32768 |    65536 |    65536
>       131072 |        1672 |     3344 |     3344
>
> ```
>
> The duration of the last example is also really significant: full
> relation is 0.3ms and block by block is 1550ms!
> You might think it's because of generate_series or whatever, but I have
> the exact same behavior with pgfincore.
> I can compare loading and unloading duration for similar "async" work,
> here each call is from block 0 with len of 132744 and a range of 1 block
> (i.e. posix_fadvise on 8kB at a time).
> So they have exactly the same number of operations doing DONTNEED or
> WILLNEED, but distinct duration on the first "load":
>
> ```
>
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
> vm_relation_fadvise
> ---------------------
>
> (1 row)
>
> Time: 25.202 ms
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> vm_relation_fadvise
> ---------------------
>
> (1 row)
>
> Time: 1523.636 ms (00:01.524) <----- not free !
> postgres=# select * from
> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> vm_relation_fadvise
> ---------------------
>
> (1 row)
>
> Time: 24.967 ms
> ```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From

Nazir Bilal Yavuz

Date:

07 March, 14:19:16

Hi,

On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
<Cedric.Villemain@abcsql.com> wrote:
>
> Hi Nazir,
>
>
> thank you for your review. I comment below.
>
>
> On 05/03/2024 12:07, Nazir Bilal Yavuz wrote:
> >> 2. The second one does implement smgrprefetch with range and loops by
> >> default per segment to still have a check for interrupts.
> > It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
> > page of posix_fadvise [1] states that: "The amount of data read may be
> > decreased by the kernel depending on virtual memory load. (A few
> > megabytes will usually be fully satisfied, and more is rarely
> > useful.)". It is trying to prefetch 1GB data now. That could explain
> > your observation about differences between nr_cache numbers.
>
>  From an "adminsys" point of view I will find beneficial to get a single
> syscall per file, respecting the logic and behavior of underlying system
> call.

I agree.

> The behavior is 100% OK, and in fact it might a bad idea to prefetch
> block by block as the result is just to put more pressure on a system if
> it is already under pressure.
>
> Though there are use cases and it's nice to be able to do that too at
> this per page level.

Yes, I do not know which one is more important, cache more blocks but
create more pressure or create less pressure but cache less blocks.
Also, pg_prewarm is designed to be run at startup so I guess there
will not be much pressure.

> About [1], it's very old statement about resources. And Linux manages a
> part of the problem for us here I think [2]:
>
> /*
>   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
>   * memory at once.
>   */
> void force_page_cache_ra(....)

Thanks for pointing out the actual code. Yes, it looks like the kernel
is already doing that. I would like to do more testing when you
forward vm_relation functions into pgfincore.

> >> Q: posix_fadvise may not work exactly the way you think it does, or does
> >> it ?
> >>
> >>
> >> In details, and for the question:
> >>
> >> However,  if instead you provide a real range, or the magic len=0 to
> >> posix_fadvise, then blocks are "more" loaded according to effective vm
> >> pressure (which is not the case on the previous example).
> >> As a result only a small part of the relation might be loaded, and this
> >> is probably not what end-users expect despite being probably a good
> >> choice (you can still free cache beforehand to help the kernel).
>
> I think it's a matter of documenting well the feature, and if at all
> possible, as usual, not let users be negatively impacted by default.
>
>
> >> An example, below I'm using vm_relation_cachestat() which provides linux
> >> cachestat output, and vm_relation_fadvise() to unload cache, and
> >> pg_prewarm for the demo:
> >>
> >> # clear cache: (nr_cache is the number of file system pages in cache,
> >> not postgres blocks)
> >>
> >> ```
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -------------+-------------+----------+----------
> >>             0 |       32768 |    65536 |        0
> >>         32768 |       32768 |    65536 |        0
> >>         65536 |       32768 |    65536 |        0
> >>         98304 |       32768 |    65536 |        0
> >>        131072 |        1672 |     3344 |        0
> >> ```
> >>
> >> # load full relation with pg_prewarm (patched)
> >>
> >> ```
> >> postgres=# select pg_prewarm('foo','prefetch');
> >> pg_prewarm
> >> ------------
> >>       132744
> >> (1 row)
> >> ```
> >>
> >> # Checking results:
> >>
> >> ```
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -------------+-------------+----------+----------
> >>             0 |       32768 |    65536 |      320
> >>         32768 |       32768 |    65536 |        0
> >>         65536 |       32768 |    65536 |        0
> >>         98304 |       32768 |    65536 |        0
> >>        131072 |        1672 |     3344 |      320  <-- segment 1
> >>
> >> ```
> >>
> >> # Load block by block and check:
> >>
> >> ```
> >> postgres=# select from generate_series(0, 132743) g(n), lateral
> >> pg_prewarm('foo','prefetch', 'main', n, n);
> >> postgres=# select block_start, block_count, nr_pages, nr_cache from
> >> vm_relation_cachestat('foo',range:=1024*32);
> >> block_start | block_count | nr_pages | nr_cache
> >> -------------+-------------+----------+----------
> >>             0 |       32768 |    65536 |    65536
> >>         32768 |       32768 |    65536 |    65536
> >>         65536 |       32768 |    65536 |    65536
> >>         98304 |       32768 |    65536 |    65536
> >>        131072 |        1672 |     3344 |     3344
> >>
> >> ```
> >>
> >> The duration of the last example is also really significant: full
> >> relation is 0.3ms and block by block is 1550ms!
> >> You might think it's because of generate_series or whatever, but I have
> >> the exact same behavior with pgfincore.
> >> I can compare loading and unloading duration for similar "async" work,
> >> here each call is from block 0 with len of 132744 and a range of 1 block
> >> (i.e. posix_fadvise on 8kB at a time).
> >> So they have exactly the same number of operations doing DONTNEED or
> >> WILLNEED, but distinct duration on the first "load":
> >>
> >> ```
> >>
> >> postgres=# select * from
> >> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
> >> vm_relation_fadvise
> >> ---------------------
> >>
> >> (1 row)
> >>
> >> Time: 25.202 ms
> >> postgres=# select * from
> >> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> >> vm_relation_fadvise
> >> ---------------------
> >>
> >> (1 row)
> >>
> >> Time: 1523.636 ms (00:01.524) <----- not free !
> >> postgres=# select * from
> >> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
> >> vm_relation_fadvise
> >> ---------------------
> >>
> >> (1 row)
> >>
> >> Time: 24.967 ms
> >> ```
> > I confirm that there is a time difference between calling pg_prewarm
> > by full relation and block by block, but IMO this is expected. When
> > pg_prewarm is called by full relation, it does the initialization part
> > just once but when it is called block by block, it does initialization
> > for each call, right?
>
>
> Not sure what initialization is here exactly, in my example with
> WILLNEED/DONTNEED there are exactly the same code pattern and syscall
> request(s), just the flag is distinct, so initialization cost are
> expected to be very similar.

Sorry, there was a miscommunication. I was talking about pg_prewarm's
initialization, meaning if the pg_prewarm is called block by block (by
using generate_series); it will make block_count times initialization
and if it is called by full relation it will just do it once but it
seems that is not the case, see below.

> I'll try to move forward on those vm_relation functions into pgfincore
> so it'll be easier to run similar tests and compare.

Thanks, that will be helpful for the testing.

> >
> > I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
> > generate_series(0, 132744)n;' a couple of times consecutively but I
> > could not see the time difference between first run (first load) and
> > the consecutive runs. Am I doing something wrong?
>
>
> Maybe the system is overloaded and thus by the time you're done
> prefetching tail blocks, the heads ones have been dropped already. So
> looping on that leads to similar duration.
> If it's already in cache and not removed from it, execution time is
> stable. This point (in cache or not) is hard to guess right until you do
> check the status, or you ensure to clean it first.

My bad. I was trying to drop buffers from the postgres cache, not from
the kernel cache. See my results now:

patched | prefetch test

$ create_the_data [3]
$ drop_kernel_cache [4]
$ first_run_full_relation_prefetch [5] -> Time: 11.395 ms
$ second_run_full_relation_prefetch [5] -> Time: 0.887 ms

master | prefetch test

$ create_the_data [3]
$ drop_kernel_cache [4]
$ first_run_full_relation_prefetch [5] -> Time: 3208.944 ms
$ second_run_full_relation_prefetch [5] -> Time: 283.905 ms

I did more perf tests about comparison between first and second run
for the prefetch and found this on master:

first run:
- 86.40% generic_fadvise
    - 86.24% force_page_cache_ra
        - 85.99% page_cache_ra_unbounded
            + 37.36% filemap_add_folio
            + 34.14% read_pages
            + 8.31% folio_alloc
            + 4.55% up_read
                0.77% xa_load

second run:
- 20.64% generic_fadvise
    - 18.64% force_page_cache_ra
        - 17.46% page_cache_ra_unbounded
            + 8.54% xa_load
            2.82% down_read
            2.29% read_pages
            1.45% up_read

So, it looks like the difference between the first and the second run
comes from kernel optimization that does not do prefetch if the page
is already in the cache [6]. Saying that, I do not know the difference
between WILLNEED/DONTNEED and I do not have enough materials to test
it but I guess it is something similar.

I did not test read performance but I am planning to do that soon.

> > [1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION
>
> [2] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303

[3]
CREATE EXTENSION pg_prewarm;
drop table if exists foo;
create table foo ( id int, c text) with (autovacuum_enabled=false);
insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;

[4] echo 3 | sudo tee /proc/sys/vm/drop_caches

[5] select pg_prewarm('foo', 'prefetch', 'main');

[6] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L232

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From

Cédric Villemain

Date:

07 March, 15:26:00

Hi Nazir,

On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:
> On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
> <Cedric.Villemain@abcsql.com> wrote:
>> The behavior is 100% OK, and in fact it might a bad idea to prefetch
>> block by block as the result is just to put more pressure on a system if
>> it is already under pressure.
>>
>> Though there are use cases and it's nice to be able to do that too at
>> this per page level.
> Yes, I do not know which one is more important, cache more blocks but
> create more pressure or create less pressure but cache less blocks.
> Also, pg_prewarm is designed to be run at startup so I guess there
> will not be much pressure.

autowarm is designed for that purpose but pg_prewarm is free to use when 
neeed.

>> About [1], it's very old statement about resources. And Linux manages a
>> part of the problem for us here I think [2]:
>>
>> /*
>>    * Chunk the readahead into 2 megabyte units, so that we don't pin too much
>>    * memory at once.
>>    */
>> void force_page_cache_ra(....)
> Thanks for pointing out the actual code. Yes, it looks like the kernel
> is already doing that. I would like to do more testing when you
> forward vm_relation functions into pgfincore.


I hope to be able to get back there next week max.


>>>> An example, below I'm using vm_relation_cachestat() which provides linux
>>>> cachestat output, and vm_relation_fadvise() to unload cache, and
>>>> pg_prewarm for the demo:
>>>>
>>>> # clear cache: (nr_cache is the number of file system pages in cache,
>>>> not postgres blocks)
>>>>
>>>> ```
>>>> postgres=# select block_start, block_count, nr_pages, nr_cache from
>>>> vm_relation_cachestat('foo',range:=1024*32);
>>>> block_start | block_count | nr_pages | nr_cache
>>>> -------------+-------------+----------+----------
>>>>              0 |       32768 |    65536 |        0
>>>>          32768 |       32768 |    65536 |        0
>>>>          65536 |       32768 |    65536 |        0
>>>>          98304 |       32768 |    65536 |        0
>>>>         131072 |        1672 |     3344 |        0
>>>> ```
>>>>
>>>> # load full relation with pg_prewarm (patched)
>>>>
>>>> ```
>>>> postgres=# select pg_prewarm('foo','prefetch');
>>>> pg_prewarm
>>>> ------------
>>>>        132744
>>>> (1 row)
>>>> ```
>>>>
>>>> # Checking results:
>>>>
>>>> ```
>>>> postgres=# select block_start, block_count, nr_pages, nr_cache from
>>>> vm_relation_cachestat('foo',range:=1024*32);
>>>> block_start | block_count | nr_pages | nr_cache
>>>> -------------+-------------+----------+----------
>>>>              0 |       32768 |    65536 |      320
>>>>          32768 |       32768 |    65536 |        0
>>>>          65536 |       32768 |    65536 |        0
>>>>          98304 |       32768 |    65536 |        0
>>>>         131072 |        1672 |     3344 |      320  <-- segment 1
>>>>
>>>> ```
>>>>
>>>> # Load block by block and check:
>>>>
>>>> ```
>>>> postgres=# select from generate_series(0, 132743) g(n), lateral
>>>> pg_prewarm('foo','prefetch', 'main', n, n);
>>>> postgres=# select block_start, block_count, nr_pages, nr_cache from
>>>> vm_relation_cachestat('foo',range:=1024*32);
>>>> block_start | block_count | nr_pages | nr_cache
>>>> -------------+-------------+----------+----------
>>>>              0 |       32768 |    65536 |    65536
>>>>          32768 |       32768 |    65536 |    65536
>>>>          65536 |       32768 |    65536 |    65536
>>>>          98304 |       32768 |    65536 |    65536
>>>>         131072 |        1672 |     3344 |     3344
>>>>
>>>> ```
>>>>
>>>> The duration of the last example is also really significant: full
>>>> relation is 0.3ms and block by block is 1550ms!
>>>> You might think it's because of generate_series or whatever, but I have
>>>> the exact same behavior with pgfincore.
>>>> I can compare loading and unloading duration for similar "async" work,
>>>> here each call is from block 0 with len of 132744 and a range of 1 block
>>>> (i.e. posix_fadvise on 8kB at a time).
>>>> So they have exactly the same number of operations doing DONTNEED or
>>>> WILLNEED, but distinct duration on the first "load":
>>>>
>>>> ```
>>>>
>>>> postgres=# select * from
>>>> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
>>>> vm_relation_fadvise
>>>> ---------------------
>>>>
>>>> (1 row)
>>>>
>>>> Time: 25.202 ms
>>>> postgres=# select * from
>>>> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
>>>> vm_relation_fadvise
>>>> ---------------------
>>>>
>>>> (1 row)
>>>>
>>>> Time: 1523.636 ms (00:01.524) <----- not free !
>>>> postgres=# select * from
>>>> vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
>>>> vm_relation_fadvise
>>>> ---------------------
>>>>
>>>> (1 row)
>>>>
>>>> Time: 24.967 ms
>>>> ```
>>> I confirm that there is a time difference between calling pg_prewarm
>>> by full relation and block by block, but IMO this is expected. When
>>> pg_prewarm is called by full relation, it does the initialization part
>>> just once but when it is called block by block, it does initialization
>>> for each call, right?
>>
>> Not sure what initialization is here exactly, in my example with
>> WILLNEED/DONTNEED there are exactly the same code pattern and syscall
>> request(s), just the flag is distinct, so initialization cost are
>> expected to be very similar.
> Sorry, there was a miscommunication. I was talking about pg_prewarm's
> initialization, meaning if the pg_prewarm is called block by block (by
> using generate_series); it will make block_count times initialization
> and if it is called by full relation it will just do it once but it
> seems that is not the case, see below.


OK.

>> I'll try to move forward on those vm_relation functions into pgfincore
>> so it'll be easier to run similar tests and compare.
> Thanks, that will be helpful for the testing.
>
>>> I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
>>> generate_series(0, 132744)n;' a couple of times consecutively but I
>>> could not see the time difference between first run (first load) and
>>> the consecutive runs. Am I doing something wrong?
>>
>> Maybe the system is overloaded and thus by the time you're done
>> prefetching tail blocks, the heads ones have been dropped already. So
>> looping on that leads to similar duration.
>> If it's already in cache and not removed from it, execution time is
>> stable. This point (in cache or not) is hard to guess right until you do
>> check the status, or you ensure to clean it first.
> My bad. I was trying to drop buffers from the postgres cache, not from
> the kernel cache. See my results now:
>
> patched | prefetch test
>
> $ create_the_data [3]
> $ drop_kernel_cache [4]
> $ first_run_full_relation_prefetch [5] -> Time: 11.395 ms
> $ second_run_full_relation_prefetch [5] -> Time: 0.887 ms
>
> master | prefetch test
>
> $ create_the_data [3]
> $ drop_kernel_cache [4]
> $ first_run_full_relation_prefetch [5] -> Time: 3208.944 ms
> $ second_run_full_relation_prefetch [5] -> Time: 283.905 ms
>
> I did more perf tests about comparison between first and second run
> for the prefetch and found this on master:
>
> first run:
> - 86.40% generic_fadvise
>      - 86.24% force_page_cache_ra
>          - 85.99% page_cache_ra_unbounded
>              + 37.36% filemap_add_folio
>              + 34.14% read_pages
>              + 8.31% folio_alloc
>              + 4.55% up_read
>                  0.77% xa_load
>
> second run:
> - 20.64% generic_fadvise
>      - 18.64% force_page_cache_ra
>          - 17.46% page_cache_ra_unbounded
>              + 8.54% xa_load
>              2.82% down_read
>              2.29% read_pages
>              1.45% up_read
>
> So, it looks like the difference between the first and the second run
> comes from kernel optimization that does not do prefetch if the page
> is already in the cache [6]. Saying that, I do not know the difference
> between WILLNEED/DONTNEED and I do not have enough materials to test
> it but I guess it is something similar.

Patched: Clearly, only a small part has been read and put into VM during 
the first pass, but still some pages, and the second one probably did 
nothing at all.
Master: Apparently it takes around 3.2 seconds to read all (which 
outlines that the first pass, patched, read few). On the second pass 
it's already in cache, so it goes fast. you're correct. But given it 
still required 2803ms, there is something.
You may want to test the status with vm_relation_cachestat() [7], it's 
in a branch, not main or master. It requires linux 6.5, but allows to 
get information about memory eviction, which is super handy (and super 
fast)!
It returns:
  - nr_cache is Number of cached pages
  - nr_dirty is Number of dirty pages
  - nr_writeback is Number of pages marked for writeback
  - nr_evicted is Number of pages evicted from the cache
  - nr_recently_evicted is Number of pages recently evicted from the cache
/*
  * A page is recently evicted if its last eviction was recent enough 
that its
  * reentry to the cache would indicate that it is actively being used 
by the
  * system, and that there is memory pressure on the system.
  */

WILLNEED posix fadvise flag leads to what used to be call "prefetch": 
reading from disk, and put into VM. (it's not as simple, but this is the 
idea).
DONTNEED flushes from VM.

Might be interesting to compare with prewarm called on each block of the 
relation, one way to do it with current path is to change the constant:
#define PREWARM_PREFETCH_RANGE    RELSEG_SIZE

RELSEG_SIZE is 131071 IIRC

Here you can set to 1 and you'll have prewarm working on all pages, one 
by one, which should be similar to current behavior.
In pgfincore I have a "range" parameter for that purpose so end-user can 
adjust exactly as desired.
I was not looking after change to prewarm function parameters but if 
it's better...

> I did not test read performance but I am planning to do that soon.


Nice, thank you for the effort!

>>> [1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION
>> [2] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303
> [3]
> CREATE EXTENSION pg_prewarm;
> drop table if exists foo;
> create table foo ( id int, c text) with (autovacuum_enabled=false);
> insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;
>
> [4] echo 3 | sudo tee /proc/sys/vm/drop_caches
>
> [5] select pg_prewarm('foo', 'prefetch', 'main');
>
> [6] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L232

[7] 
https://github.com/klando/pgfincore/blob/vm_relation_cachestat/pgfincore--1.3.1--1.4.0.sql#L54

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From

Nazir Bilal Yavuz

Date:

15 March, 15:12:31

Hi,

On Thu, 7 Mar 2024 at 15:26, Cédric Villemain
<Cedric.Villemain@abcsql.com> wrote:
>
> On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:
> >
> > I did not test read performance but I am planning to do that soon.

I did not have the time to check other things you mentioned but I
tested the read performance. The table size is 5.5GB, I did 20 runs in
total.

When the OS cache is cleared:

Master -> Median: 2266.293ms, Avg: 2265.5038ms
Patched -> Median: 2166.493ms, Avg: 2183.6208ms

When the buffers are in the OS cache:

Master -> Median: 585.719ms, Avg: 583.5032ms
Patched -> Median: 533.071ms, Avg: 532.7119ms

Patched version is better on both. ~4% when the OS cache is cleared,
~%9 when the buffers are in the OS cache.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From

"Andrey M. Borodin"

Date:

07 April, 08:29:40

> On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> I did not have the time to check other things you mentioned but I
> tested the read performance. The table size is 5.5GB, I did 20 runs in
> total.

Hi Nazir!

Do you plan to review anything else? Or do you think it worth to look at by someone else? Or is the patch Ready for
Committer?If so, please swap CF entry [0] to status accordingly, currently it's "Waiting on Author". 

Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/47/4763/

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

From

Nazir Bilal Yavuz

Date:

07 April, 11:26:34

Hi Andrey,

On Sun, 7 Apr 2024 at 08:29, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
> > On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > I did not have the time to check other things you mentioned but I
> > tested the read performance. The table size is 5.5GB, I did 20 runs in
> > total.
>
> Hi Nazir!
>
> Do you plan to review anything else? Or do you think it worth to look at by someone else? Or is the patch Ready for
Committer?If so, please swap CF entry [0] to status accordingly, currently it's "Waiting on Author".
 

Thanks for reminding me! I think this needs review by someone else
(especially the prefetch part) so I changed it to 'Needs review'.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft