Re: Direct I/O - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Direct I/O
Date
Msg-id CA+TgmoYa5HTvrBEFD6VSVVAvB_kX+vsrVUk6=kt8gXrOVQR3zA@mail.gmail.com
Whole thread Raw
In response to Re: Direct I/O  (Andres Freund <andres@anarazel.de>)
Responses Re: Direct I/O
List pgsql-hackers
On Wed, Apr 19, 2023 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
> Interestingly, I haven't seen that as much in more recent benchmarks as it
> used to. Partially I think because common s_b settings have gotten bigger, I
> guess. But I wonder if we also accidentally improved something else, e.g. by
> pin/unpin-ing the same buffer a bit less frequently.

I think the problem with the algorithm is pretty fundamental. The rate
of usage count increase is tied to how often we access buffers, and
the rate of usage count decrease is tied to buffer eviction. But a
given workload can have no eviction at all (in which case the usage
counts must converge to 5) or it can evict on every buffer access (in
which case the usage counts must mostly converget to 0, because we'll
be decreasing usage counts at least once per buffer and generally
more). ISTM that the only way that you can end up with a good mix of
usage counts is if you have a workload that falls into some kind of a
sweet spot where the rate of usage count bumps and the rate of usage
count de-bumps are close enough together that things don't skew all
the way to one end or the other. Bigger s_b might make that more
likely to happen in practice, but it seems like bad algorithm design
on a theoretical level. We should be comparing the frequency of access
of buffers to the frequency of access of other buffers, not to the
frequency of buffer eviction. Or to put the same thing another way,
the average value of the usage count shouldn't suddenly change from 5
to 1 when the server evicts 1 buffer.

> I do think we likely should (as IIRC Peter Geoghegan suggested) provide more
> information to the buffer replacement layer:
>
> Independent of the concrete buffer replacement algorithm, the recency
> information we do provide is somewhat lacking. In some places we do repeated
> ReadBuffer() calls for the same buffer, leading to over-inflating usagecount.

Yeah, that would be good to fix. I don't think it solves the whole
problem by itself, but it seems like a good step.

> We should seriously consider using the cost of the IO into account, basically
> making it more likely that s_b is increased when we need to synchronously wait
> for IO. The cost of a miss is much lower for e.g. a sequential scan or a
> bitmap heap scan, because both can do some form of prefetching. Whereas index
> pages and the heap fetch for plain index scans aren't prefetchable (which
> could be improved some, but not generally).

I guess that's reasonable if we can pass the information around well
enough, but I still think the algorithm ought to get some fundamental
improvement first.

> FWIW, my experience is that linux' page replacement doesn't work very well
> either. Partially because we "hide" a lot of the recency information from
> it. But also just because it doesn't scale all that well to large amounts of
> memory (there's ongoing work on that though).  So I am not really convinced by
> this argument - for plenty workloads just caching in PG will be far better
> than caching both in the kernel and in PG, as long as some adaptiveness to
> memory pressure avoids running into OOMs.

Even if the Linux algorithm is bad, and it may well be, the Linux
cache is often a lot bigger than our cache. Which can cover a
multitude of problems.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Wrong results from Parallel Hash Full Join
Next
From: Andres Freund
Date:
Subject: pg_stat_io not tracking smgrwriteback() is confusing