Re: Direct I/O - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Direct I/O |
Date | |
Msg-id | 20230419165438.x4x7dix7wqkozuhv@awork3.anarazel.de Whole thread Raw |
In response to | Re: Direct I/O (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Direct I/O
|
List | pgsql-hackers |
Hi, On 2023-04-19 10:11:32 -0400, Robert Haas wrote: > On this point specifically, one fairly large problem that we have > currently is that our buffer replacement algorithm is terrible. In > workloads I've examined, either almost all buffers end up with a usage > count of 5 or almost all buffers end up with a usage count of 0 or 1. > Either way, we lose all or nearly all information about which buffers > are actually hot, and we are not especially unlikely to evict some > extremely hot buffer. This is quite bad for performance as it is, and > it would be a lot worse if recovering from a bad eviction decision > routinely required rereading from disk instead of only rereading from > the OS buffer cache. Interestingly, I haven't seen that as much in more recent benchmarks as it used to. Partially I think because common s_b settings have gotten bigger, I guess. But I wonder if we also accidentally improved something else, e.g. by pin/unpin-ing the same buffer a bit less frequently. > I've sometimes wondered whether our current algorithm is just a more > expensive version of random eviction. I suspect that's a bit too > pessimistic, but I don't really know. I am quite certain that it's better than that. If you e.g. have pkey lookup workload >> RAM you can actually end up seeing inner index pages staying reliably in s_b. But clearly we can do better. I do think we likely should (as IIRC Peter Geoghegan suggested) provide more information to the buffer replacement layer: Independent of the concrete buffer replacement algorithm, the recency information we do provide is somewhat lacking. In some places we do repeated ReadBuffer() calls for the same buffer, leading to over-inflating usagecount. We should seriously consider using the cost of the IO into account, basically making it more likely that s_b is increased when we need to synchronously wait for IO. The cost of a miss is much lower for e.g. a sequential scan or a bitmap heap scan, because both can do some form of prefetching. Whereas index pages and the heap fetch for plain index scans aren't prefetchable (which could be improved some, but not generally). > I'm not saying that it isn't possible to fix this. I bet it is, and I > hope someone does. I'm just making the point that even if we knew the > amount of kernel memory pressure and even if we also had the ability > to add and remove shared_buffers at will, it probably wouldn't help > much as things stand today, because we're not in a good position to > judge how large the cache would need to be in order to be useful, or > what we ought to be storing in it. FWIW, my experience is that linux' page replacement doesn't work very well either. Partially because we "hide" a lot of the recency information from it. But also just because it doesn't scale all that well to large amounts of memory (there's ongoing work on that though). So I am not really convinced by this argument - for plenty workloads just caching in PG will be far better than caching both in the kernel and in PG, as long as some adaptiveness to memory pressure avoids running into OOMs. Some forms of adaptive s_b sizing aren't particularly hard, I think. Instead of actually changing the s_b shmem allocation - which would be very hard in a process based model - we can tell the kernel that some parts of that memory aren't currently in use with madvise(MADV_REMOVE). It's not quite as trivial as it sounds, because we'd have to free in multiple of huge_page_size. Greetings, Andres Freund
pgsql-hackers by date: