Thread: linux cachestat in file Readv and Prefetch

linux cachestat in file Readv and Prefetch

From

Cedric Villemain

Date:

18 January 2024, 00:25:41

Hi,

I was testing the index prefetch and streamIO patches and I added 
cachestat() syscall to get a better view of the prefetching.

It's a new linux syscall, it requires 6.5, it provides numerous 
interesting information from the VM for the range of pages examined.
It's way way faster than the old mincore() and provides much more 
valuable information:

     uint64 nr_cache;        /* Number of cached pages */
     uint64 nr_dirty;           /* Number of dirty pages */
     uint64 nr_writeback;  /* Number of pages marked for writeback. */
     uint64 nr_evicted;       /* Number of pages evicted from the cache. */
     /*
     * Number of recently evicted pages. A page is recently evicted if its
     * last eviction was recent enough that its reentry to the cache would
     * indicate that it is actively being used by the system, and that there
     * is memory pressure on the system.
     */
     uint64 nr_recently_evicted;


While here I also added some quick tweaks to suspend prefetching on 
memory pressure.
It's working but I have absolutely not checked the performance impact of 
my additions.

Sharing here for others to tests and adding in CF in case there is 
interest to go further in this direction.


---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

-- 
---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

Attachment

v1-0001-Add-cachestat-requires-linux-6.5.patch

Re: linux cachestat in file Readv and Prefetch

From

Tomas Vondra

Date:

17 February 2024, 23:10:44

On 1/18/24 01:25, Cedric Villemain wrote:
> Hi,
>
> I was testing the index prefetch and streamIO patches and I added
> cachestat() syscall to get a better view of the prefetching.
>
> It's a new linux syscall, it requires 6.5, it provides numerous
> interesting information from the VM for the range of pages examined.
> It's way way faster than the old mincore() and provides much more
> valuable information:
>
>     uint64 nr_cache;        /* Number of cached pages */
>     uint64 nr_dirty;           /* Number of dirty pages */
>     uint64 nr_writeback;  /* Number of pages marked for writeback. */
>     uint64 nr_evicted;       /* Number of pages evicted from the cache. */
>     /*
>     * Number of recently evicted pages. A page is recently evicted if its
>     * last eviction was recent enough that its reentry to the cache would
>     * indicate that it is actively being used by the system, and that
there
>     * is memory pressure on the system.
>     */
>     uint64 nr_recently_evicted;
>
>
> While here I also added some quick tweaks to suspend prefetching on
> memory pressure.

I may be missing some important bit behind this idea, but this does not
seem like a great idea to me. The comment added to FilePrefetch says this:

  /*
   * last time we visit this file (somewhere), nr_recently_evicted pages
   * of the range were just removed from vm cache, it's a sign a memory
   * pressure. so do not prefetch further.
   * it is hard to guess if it is always the right choice in absence of
   * more information like:
   *  - prefetching distance expected overall
   *  - access pattern/backend maybe
   */

Firstly, is this even a good way to detect memory pressure? It's clearly
limited to a single 1GB segment, so what's the chance we'll even see the
"pressure" on a big database with many files?

If we close/reopen the file (which on large databases we tend to do very
often) how does that affect the data reported for the file descriptor?

I'm not sure I even agree with the idea that we should stop prefetching
when there is memory pressure. IMHO it's perfectly fine to keep
prefetching stuff even if it triggers eviction of unnecessary pages from
page cache. That's kinda why the eviction exists.

> It's working but I have absolutely not checked the performance impact of
> my additions.
>

Well ... I'd argue at least some basic evaluation of performance is a
rather important / expected part of a proposal for a patch that aims to
improve a performance-focused feature. It's impossible to have any sort
of discussion about such patch without that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: linux cachestat in file Readv and Prefetch

From

Robert Haas

Date:

14 March 2024, 18:04:15

On Sat, Feb 17, 2024 at 6:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> I may be missing some important bit behind this idea, but this does not
> seem like a great idea to me. The comment added to FilePrefetch says this:
>
>   /*
>    * last time we visit this file (somewhere), nr_recently_evicted pages
>    * of the range were just removed from vm cache, it's a sign a memory
>    * pressure. so do not prefetch further.
>    * it is hard to guess if it is always the right choice in absence of
>    * more information like:
>    *  - prefetching distance expected overall
>    *  - access pattern/backend maybe
>    */
>
> Firstly, is this even a good way to detect memory pressure? It's clearly
> limited to a single 1GB segment, so what's the chance we'll even see the
> "pressure" on a big database with many files?
>
> If we close/reopen the file (which on large databases we tend to do very
> often) how does that affect the data reported for the file descriptor?
>
> I'm not sure I even agree with the idea that we should stop prefetching
> when there is memory pressure. IMHO it's perfectly fine to keep
> prefetching stuff even if it triggers eviction of unnecessary pages from
> page cache. That's kinda why the eviction exists.

I agree with all of these criticisms. I think it's the job of
pg_prewarm to do what the user requests, not to second-guess whether
the user requested the right thing. One of the things that frustrates
people about the ring-buffer system is that it's hard to get all of
your data cached in shared_buffers by just reading it, e.g. SELECT *
FROM my_table. If pg_prewarm also isn't guaranteed to actually read
your data, and may decide that your data didn't need to be read after
all, then what exactly is a user supposed to do if they're absolutely
sure that they know better than PostgreSQL itself and want to
guarantee that their data actually does get read?

So I think a feature like this would at the very least need to be
optional, but it's unclear to me why we'd want it at all, and I feel
like Cedric's email doesn't really answer that question. I suppose
that if you could detect useless prefetching and skip it, you'd save a
bit of work, but under what circumstances does anyone use pg_prewarm
so aggressively as to make that a problem, and why wouldn't the
solution be for the user to just calm down a little bit? There
shouldn't be any particular reason why the user can't know both the
size of shared_buffers and the approximate size of the OS cache;
indeed, they can probably know the latter much more reliably than
PostgreSQL itself can. So it should be fairly easy to avoid just
prefetching more data than will fit, and then you don't have to worry
about any of this. And you'll probably get a better result, too,
because, along the same lines as Tomas's remarks above, I doubt that
this would be an accurate method anyway.

> Well ... I'd argue at least some basic evaluation of performance is a
> rather important / expected part of a proposal for a patch that aims to
> improve a performance-focused feature. It's impossible to have any sort
> of discussion about such patch without that.

Right.

I'm going to mark this patch as Rejected in the CommitFest application
for now. If in subsequent discussion that comes to seem like the wrong
result, then we can revise accordingly, but right now it looks
extremely unlikely to me that this is something that we'd want.

--
Robert Haas
EDB: http://www.enterprisedb.com