Thread: Increase default maintenance_io_concurrency to 16

Increase default maintenance_io_concurrency to 16

From

Bruce Momjian

Date:

18 March, 23:22:45

On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote:
> > This commit makes our default random_page_cost = 4 out of line with
> > these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
> > line with reality.
> 
> How so? That seems like an independent consideration to me.

[thread moved to hackers]

Uh, I think our old random_page_cost and *_io_concurrency assumed
magnetic disks --- now *_io_concurrency assumes more modern hardware and
random_page_cost assumes magnetic.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Increase default maintenance_io_concurrency to 16

From

Andres Freund

Date:

18 March, 23:27:18

Hi,

On 2025-03-18 16:22:45 -0400, Bruce Momjian wrote:
> On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote:
> > > This commit makes our default random_page_cost = 4 out of line with
> > > these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
> > > line with reality.
> > 
> > How so? That seems like an independent consideration to me.
> 
> [thread moved to hackers]
> 
> Uh, I think our old random_page_cost and *_io_concurrency assumed
> magnetic disks --- now *_io_concurrency assumes more modern hardware and
> random_page_cost assumes magnetic.

The cost difference between random and non-random IO is actually still
reasonably accurate with NVMEs. You can argue that random_page_cost should be
2.5, but that really depends on the specific hardware.

Particularly for cloud style networked storage, you could even argue that the
difference between sequential and random IO has *grow* given recent changes in
PG (io combining in PG 17), as random IOs much more quickly lead to exhausting
IOPS quotas.

I still don't think adjusting random_page_cost has any meaningful relation to
the change at hand.

Greetings,

Andres Freund

Re: Increase default maintenance_io_concurrency to 16

From

Bruce Momjian

Date:

18 March, 23:35:29

On Tue, Mar 18, 2025 at 04:27:18PM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-03-18 16:22:45 -0400, Bruce Momjian wrote:
> > On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote:
> > > > This commit makes our default random_page_cost = 4 out of line with
> > > > these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
> > > > line with reality.
> > > 
> > > How so? That seems like an independent consideration to me.
> > 
> > [thread moved to hackers]
> > 
> > Uh, I think our old random_page_cost and *_io_concurrency assumed
> > magnetic disks --- now *_io_concurrency assumes more modern hardware and
> > random_page_cost assumes magnetic.
> 
> The cost difference between random and non-random IO is actually still
> reasonably accurate with NVMEs. You can argue that random_page_cost should be
> 2.5, but that really depends on the specific hardware.

Uh, the random_page_cost = 4 assumes caching, so it is assuming actual
random I/O to be 40x slower, which I doubt is true for SSDs:

    https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
    
    Random access to mechanical disk storage is normally much more expensive
    than four times sequential access. However, a lower default is used
    (4.0) because the majority of random accesses to disk, such as indexed
    reads, are assumed to be in cache. The default value can be thought of
    as modeling random access as 40 times slower than sequential, while
    expecting 90% of random reads to be cached.

> Particularly for cloud style networked storage, you could even argue that the
> difference between sequential and random IO has *grow* given recent changes in
> PG (io combining in PG 17), as random IOs much more quickly lead to exhausting
> IOPS quotas.
> 
> I still don't think adjusting random_page_cost has any meaningful relation to
> the change at hand.

Okay, I was just asking.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Increase default maintenance_io_concurrency to 16

From

Andres Freund

Date:

19 March, 00:04:46

Hi,

On 2025-03-18 16:35:29 -0400, Bruce Momjian wrote:
> Uh, the random_page_cost = 4 assumes caching, so it is assuming actual
> random I/O to be 40x slower, which I doubt is true for SSDs:

Uh, huh:

>     https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
>
>     Random access to mechanical disk storage is normally much more expensive
>     than four times sequential access. However, a lower default is used
>     (4.0) because the majority of random accesses to disk, such as indexed
>     reads, are assumed to be in cache. The default value can be thought of
>     as modeling random access as 40 times slower than sequential, while
>     expecting 90% of random reads to be cached.

Is that actually a good description of what we assume? I don't know where that
90% is coming from? Briefly skimming through selfuncs.c and costsize.c I don't
see anything.

The relevant change:

commit c1d9df4fa227781b31be44a5a3024865a7f48049
Author: Bruce Momjian <bruce@momjian.us>
Date:   2012-02-14 16:54:54 -0500

    Document random page cost is only 4x seqeuntial, and not 40x.

The relevant discussion seems to be:
https://postgr.es/m/4F31A05A.1060506%402ndQuadrant.com

But I don't see any origin of that number in that thread.

I am not sure if I found the correct email for Greg Smith?

Greetings,

Andres Freund

Re: Increase default maintenance_io_concurrency to 16

From

Bruce Momjian

Date:

19 March, 00:19:41

On Tue, Mar 18, 2025 at 05:04:46PM -0400, Andres Freund wrote:
> Hi,
> 
> On 2025-03-18 16:35:29 -0400, Bruce Momjian wrote:
> > Uh, the random_page_cost = 4 assumes caching, so it is assuming actual
> > random I/O to be 40x slower, which I doubt is true for SSDs:
> 
> Uh, huh:
> 
> >     https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
> >
> >     Random access to mechanical disk storage is normally much more expensive
> >     than four times sequential access. However, a lower default is used
> >     (4.0) because the majority of random accesses to disk, such as indexed
> >     reads, are assumed to be in cache. The default value can be thought of
> >     as modeling random access as 40 times slower than sequential, while
> >     expecting 90% of random reads to be cached.
> 
> Is that actually a good description of what we assume? I don't know where that
> 90% is coming from? Briefly skimming through selfuncs.c and costsize.c I don't
> see anything.

The next paragraph is:

    If you believe a 90% cache rate is an incorrect assumption
    for your workload, you can increase random_page_cost to better
    reflect the true cost of random storage reads. Correspondingly,
    if your data is likely to be completely in cache, such as when
    the database is smaller than the total server memory, decreasing
    random_page_cost can be appropriate. Storage that has a low random
    read cost relative to sequential, e.g., solid-state drives, might
    also be better modeled with a lower value for random_page_cost,
    e.g., 1.1.

> The relevant change:
> 
> commit c1d9df4fa227781b31be44a5a3024865a7f48049
> Author: Bruce Momjian <bruce@momjian.us>
> Date:   2012-02-14 16:54:54 -0500
> 
>     Document random page cost is only 4x seqeuntial, and not 40x.
> 
> The relevant discussion seems to be:
> https://postgr.es/m/4F31A05A.1060506%402ndQuadrant.com
> 
> But I don't see any origin of that number in that thread.
> 
> I am not sure if I found the correct email for Greg Smith?

Yes, I can't say there is much research behind the value, and even if
there was, the assumed hardware is unlikely to be relevant today.
8
-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: Increase default maintenance_io_concurrency to 16

From

Tom Lane

Date:

19 March, 00:47:23

Bruce Momjian <bruce@momjian.us> writes:
>>> https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
>>>
>>> Random access to mechanical disk storage is normally much more expensive
>>> than four times sequential access. However, a lower default is used
>>> (4.0) because the majority of random accesses to disk, such as indexed
>>> reads, are assumed to be in cache. The default value can be thought of
>>> as modeling random access as 40 times slower than sequential, while
>>> expecting 90% of random reads to be cached.

This explanation is, in fact, made up out of whole cloth.  It has
diddly-squat to do with where the number came from, although maybe
it's an okay rationalization for continuing to use 4.0 on modern
hardware.

Where the number came from is that I did a bunch of testing back in
the late 90s and random_page_cost = 4.0 made the planner's ratios of
seqscan vs indexscan costs match up with observed timings.  Of course,
those observations were made with spinning-rust drives, so you'd not
get the same results on SSDs.

> Yes, I can't say there is much research behind the value, and even if
> there was, the assumed hardware is unlikely to be relevant today.

Yes as to the latter, no as to the former.

            regards, tom lane

Re: Increase default maintenance_io_concurrency to 16

From

Gregory Smith

Date:

19 March, 02:52:04

On Tue, Mar 18, 2025 at 5:04 PM Andres Freund <andres@anarazel.de> wrote:

Is that actually a good description of what we assume? I don't know where that
90% is coming from?

That one's all my fault. It was an attempt to curve-fit backwards why the 4.0 number Tom set with his initial commit worked as well as it did given that underlying storage was closer to 50X as slow, and I sold the idea well enough for Bruce to follow the reasoning and commit it. Back then there was a regular procession of people who measured the actual rate and wondered why there was the order of magnitude difference between those measurements and the parameter. Pointing them toward thinking in terms of the cached read percentage too did a reasonable job of deflecting them onto why the model was more complicated than it seems. I intended to follow that up with more measurements, only to lose the whole project into a non-disclosure void I have only recently escaped

I agree with your observation that the underlying cost of a non-sequential read stall on cloud storage is not markedly better than the original random: sequential ratio of mechanical drives. And the PG17 refactoring to improve I/O chunking worked to magnify that further.

The end of this problem I'm working on again is assembling some useful mix of workloads such that I can try changing one of these magic constants with higher confidence. My main working set so far is write performance regression test sets against the Open Street Map loading workload, that I've been blogging about, plus the old read-only queries of the SELECT-only spaced along a scale/client grid. My experiments so far have been around another Tom special, the maximum buffer usage count limit, which turned into another black hole full of work I have only recently escaped. I haven't really thought much yet about a workload set that would allow adjusting random_page_cost. On the query side we've been pretty heads down on the TPC-H and Clickbench sets. I don't have buffer internals data from those yet though, will have to add that to the work queue.

Greg Smith

Director of Open Source Strategy, Crunchy Data

greg.smith@crunchydata.com