Re: Direct I/O - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Direct I/O
Date
Msg-id 20230419165438.x4x7dix7wqkozuhv@awork3.anarazel.de
Whole thread Raw
In response to Re: Direct I/O  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Direct I/O
List pgsql-hackers
Hi,

On 2023-04-19 10:11:32 -0400, Robert Haas wrote:
> On this point specifically, one fairly large problem that we have
> currently is that our buffer replacement algorithm is terrible. In
> workloads I've examined, either almost all buffers end up with a usage
> count of 5 or almost all buffers end up with a usage count of 0 or 1.
> Either way, we lose all or nearly all information about which buffers
> are actually hot, and we are not especially unlikely to evict some
> extremely hot buffer. This is quite bad for performance as it is, and
> it would be a lot worse if recovering from a bad eviction decision
> routinely required rereading from disk instead of only rereading from
> the OS buffer cache.

Interestingly, I haven't seen that as much in more recent benchmarks as it
used to. Partially I think because common s_b settings have gotten bigger, I
guess. But I wonder if we also accidentally improved something else, e.g. by
pin/unpin-ing the same buffer a bit less frequently.


> I've sometimes wondered whether our current algorithm is just a more
> expensive version of random eviction. I suspect that's a bit too
> pessimistic, but I don't really know.

I am quite certain that it's better than that. If you e.g. have pkey lookup
workload >> RAM you can actually end up seeing inner index pages staying
reliably in s_b. But clearly we can do better.


I do think we likely should (as IIRC Peter Geoghegan suggested) provide more
information to the buffer replacement layer:

Independent of the concrete buffer replacement algorithm, the recency
information we do provide is somewhat lacking. In some places we do repeated
ReadBuffer() calls for the same buffer, leading to over-inflating usagecount.

We should seriously consider using the cost of the IO into account, basically
making it more likely that s_b is increased when we need to synchronously wait
for IO. The cost of a miss is much lower for e.g. a sequential scan or a
bitmap heap scan, because both can do some form of prefetching. Whereas index
pages and the heap fetch for plain index scans aren't prefetchable (which
could be improved some, but not generally).


> I'm not saying that it isn't possible to fix this. I bet it is, and I
> hope someone does. I'm just making the point that even if we knew the
> amount of kernel memory pressure and even if we also had the ability
> to add and remove shared_buffers at will, it probably wouldn't help
> much as things stand today, because we're not in a good position to
> judge how large the cache would need to be in order to be useful, or
> what we ought to be storing in it.

FWIW, my experience is that linux' page replacement doesn't work very well
either. Partially because we "hide" a lot of the recency information from
it. But also just because it doesn't scale all that well to large amounts of
memory (there's ongoing work on that though).  So I am not really convinced by
this argument - for plenty workloads just caching in PG will be far better
than caching both in the kernel and in PG, as long as some adaptiveness to
memory pressure avoids running into OOMs.

Some forms of adaptive s_b sizing aren't particularly hard, I think. Instead
of actually changing the s_b shmem allocation - which would be very hard in a
process based model - we can tell the kernel that some parts of that memory
aren't currently in use with madvise(MADV_REMOVE).  It's not quite as trivial
as it sounds, because we'd have to free in multiple of huge_page_size.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Fabrízio de Royes Mello
Date:
Subject: Re: Remove io prefix from pg_stat_io columns
Next
From: Andres Freund
Date:
Subject: Re: Direct I/O