Thread: Increase default maintenance_io_concurrency to 16
On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote: > Hi, > > On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote: > > This commit makes our default random_page_cost = 4 out of line with > > these new settings (assumes modern SSD/NAS/SAN hardware) and more out of > > line with reality. > > How so? That seems like an independent consideration to me. [thread moved to hackers] Uh, I think our old random_page_cost and *_io_concurrency assumed magnetic disks --- now *_io_concurrency assumes more modern hardware and random_page_cost assumes magnetic. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Hi, On 2025-03-18 16:22:45 -0400, Bruce Momjian wrote: > On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote: > > Hi, > > > > On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote: > > > This commit makes our default random_page_cost = 4 out of line with > > > these new settings (assumes modern SSD/NAS/SAN hardware) and more out of > > > line with reality. > > > > How so? That seems like an independent consideration to me. > > [thread moved to hackers] > > Uh, I think our old random_page_cost and *_io_concurrency assumed > magnetic disks --- now *_io_concurrency assumes more modern hardware and > random_page_cost assumes magnetic. The cost difference between random and non-random IO is actually still reasonably accurate with NVMEs. You can argue that random_page_cost should be 2.5, but that really depends on the specific hardware. Particularly for cloud style networked storage, you could even argue that the difference between sequential and random IO has *grow* given recent changes in PG (io combining in PG 17), as random IOs much more quickly lead to exhausting IOPS quotas. I still don't think adjusting random_page_cost has any meaningful relation to the change at hand. Greetings, Andres Freund
On Tue, Mar 18, 2025 at 04:27:18PM -0400, Andres Freund wrote: > Hi, > > On 2025-03-18 16:22:45 -0400, Bruce Momjian wrote: > > On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote: > > > Hi, > > > > > > On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote: > > > > This commit makes our default random_page_cost = 4 out of line with > > > > these new settings (assumes modern SSD/NAS/SAN hardware) and more out of > > > > line with reality. > > > > > > How so? That seems like an independent consideration to me. > > > > [thread moved to hackers] > > > > Uh, I think our old random_page_cost and *_io_concurrency assumed > > magnetic disks --- now *_io_concurrency assumes more modern hardware and > > random_page_cost assumes magnetic. > > The cost difference between random and non-random IO is actually still > reasonably accurate with NVMEs. You can argue that random_page_cost should be > 2.5, but that really depends on the specific hardware. Uh, the random_page_cost = 4 assumes caching, so it is assuming actual random I/O to be 40x slower, which I doubt is true for SSDs: https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS Random access to mechanical disk storage is normally much more expensive than four times sequential access. However, a lower default is used (4.0) because the majority of random accesses to disk, such as indexed reads, are assumed to be in cache. The default value can be thought of as modeling random access as 40 times slower than sequential, while expecting 90% of random reads to be cached. > Particularly for cloud style networked storage, you could even argue that the > difference between sequential and random IO has *grow* given recent changes in > PG (io combining in PG 17), as random IOs much more quickly lead to exhausting > IOPS quotas. > > I still don't think adjusting random_page_cost has any meaningful relation to > the change at hand. Okay, I was just asking. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Hi, On 2025-03-18 16:35:29 -0400, Bruce Momjian wrote: > Uh, the random_page_cost = 4 assumes caching, so it is assuming actual > random I/O to be 40x slower, which I doubt is true for SSDs: Uh, huh: > https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS > > Random access to mechanical disk storage is normally much more expensive > than four times sequential access. However, a lower default is used > (4.0) because the majority of random accesses to disk, such as indexed > reads, are assumed to be in cache. The default value can be thought of > as modeling random access as 40 times slower than sequential, while > expecting 90% of random reads to be cached. Is that actually a good description of what we assume? I don't know where that 90% is coming from? Briefly skimming through selfuncs.c and costsize.c I don't see anything. The relevant change: commit c1d9df4fa227781b31be44a5a3024865a7f48049 Author: Bruce Momjian <bruce@momjian.us> Date: 2012-02-14 16:54:54 -0500 Document random page cost is only 4x seqeuntial, and not 40x. The relevant discussion seems to be: https://postgr.es/m/4F31A05A.1060506%402ndQuadrant.com But I don't see any origin of that number in that thread. I am not sure if I found the correct email for Greg Smith? Greetings, Andres Freund
On Tue, Mar 18, 2025 at 05:04:46PM -0400, Andres Freund wrote: > Hi, > > On 2025-03-18 16:35:29 -0400, Bruce Momjian wrote: > > Uh, the random_page_cost = 4 assumes caching, so it is assuming actual > > random I/O to be 40x slower, which I doubt is true for SSDs: > > Uh, huh: > > > https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS > > > > Random access to mechanical disk storage is normally much more expensive > > than four times sequential access. However, a lower default is used > > (4.0) because the majority of random accesses to disk, such as indexed > > reads, are assumed to be in cache. The default value can be thought of > > as modeling random access as 40 times slower than sequential, while > > expecting 90% of random reads to be cached. > > Is that actually a good description of what we assume? I don't know where that > 90% is coming from? Briefly skimming through selfuncs.c and costsize.c I don't > see anything. The next paragraph is: If you believe a 90% cache rate is an incorrect assumption for your workload, you can increase random_page_cost to better reflect the true cost of random storage reads. Correspondingly, if your data is likely to be completely in cache, such as when the database is smaller than the total server memory, decreasing random_page_cost can be appropriate. Storage that has a low random read cost relative to sequential, e.g., solid-state drives, might also be better modeled with a lower value for random_page_cost, e.g., 1.1. > The relevant change: > > commit c1d9df4fa227781b31be44a5a3024865a7f48049 > Author: Bruce Momjian <bruce@momjian.us> > Date: 2012-02-14 16:54:54 -0500 > > Document random page cost is only 4x seqeuntial, and not 40x. > > The relevant discussion seems to be: > https://postgr.es/m/4F31A05A.1060506%402ndQuadrant.com > > But I don't see any origin of that number in that thread. > > I am not sure if I found the correct email for Greg Smith? Yes, I can't say there is much research behind the value, and even if there was, the assumed hardware is unlikely to be relevant today. 8 -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Bruce Momjian <bruce@momjian.us> writes: >>> https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS >>> >>> Random access to mechanical disk storage is normally much more expensive >>> than four times sequential access. However, a lower default is used >>> (4.0) because the majority of random accesses to disk, such as indexed >>> reads, are assumed to be in cache. The default value can be thought of >>> as modeling random access as 40 times slower than sequential, while >>> expecting 90% of random reads to be cached. This explanation is, in fact, made up out of whole cloth. It has diddly-squat to do with where the number came from, although maybe it's an okay rationalization for continuing to use 4.0 on modern hardware. Where the number came from is that I did a bunch of testing back in the late 90s and random_page_cost = 4.0 made the planner's ratios of seqscan vs indexscan costs match up with observed timings. Of course, those observations were made with spinning-rust drives, so you'd not get the same results on SSDs. > Yes, I can't say there is much research behind the value, and even if > there was, the assumed hardware is unlikely to be relevant today. Yes as to the latter, no as to the former. regards, tom lane
On Tue, Mar 18, 2025 at 5:04 PM Andres Freund <andres@anarazel.de> wrote:
Is that actually a good description of what we assume? I don't know where that
90% is coming from?
I agree with your observation that the underlying cost of a non-sequential read stall on cloud storage is not markedly better than the original random: sequential ratio of mechanical drives. And the PG17 refactoring to improve I/O chunking worked to magnify that further.
The end of this problem I'm working on again is assembling some useful mix of workloads such that I can try changing one of these magic constants with higher confidence. My main working set so far is write performance regression test sets against the Open Street Map loading workload, that I've been blogging about, plus the old read-only queries of the SELECT-only spaced along a scale/client grid. My experiments so far have been around another Tom special, the maximum buffer usage count limit, which turned into another black hole full of work I have only recently escaped. I haven't really thought much yet about a workload set that would allow adjusting random_page_cost. On the query side we've been pretty heads down on the TPC-H and Clickbench sets. I don't have buffer internals data from those yet though, will have to add that to the work queue.
--
Greg Smith
Director of Open Source Strategy, Crunchy Data