Thread: Changing the default random_page_cost value

Changing the default random_page_cost value

From

Greg Sabino Mullane

Date:

27 September 2024, 17:07:14

tl;dr let's assume SSDs are popular and HDDs are the exception and flip our default

As I write this email, it's the year 2024. I think it is time we lower our "default" setting of random_page_cost (as set in postgresql.conf.sample and the docs). Even a decade ago, the current default of 4 was considered fairly conservative and often lowered. The git logs shows that this value was last touched in 2006, during the age of spinning metal. We are now in a new era, the age of SSDs, and thus we should lower this default value to reflect the fact that the vast majority of people using Postgres these days are doing so on solid state drives. We tend to stay ultra-conservative in all of our settings, but we also need to recognize when there has been a major shift in the underlying hardware - and calculations that our defaults are based on.

Granted, there are other factors involved, and yes, perhaps we should tweak some of the similar settings as well, but ranom_page_cost is the one setting most out of sync with today's hardware realities. So I'll be brave and throw a number out there: 1.2. And change our docs to say wordage like "if you are using an older hard disk drive technology, you may want to try raising rpc" to replace our fairly-hidden note about SSDs buried in the last sentence - of the fourth paragraph - of the rpc docs.

Real data about performance on today's SSDs are welcome, and/or some way to generate a more accurate default.

Cheers,

Greg

Re: Changing the default random_page_cost value

From

Roberto Mello

Date:

27 September 2024, 17:26:38

On Fri, Sep 27, 2024 at 8:07 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:

tl;dr let's assume SSDs are popular and HDDs are the exception and flip our default

<snip>

Granted, there are other factors involved, and yes, perhaps we should tweak some of the similar settings as well, but ranom_page_cost is the one setting most out of sync with today's hardware realities. So I'll be brave and throw a number out there: 1.2. And change our docs to say wordage like "if you are using an older hard disk drive technology, you may want to try raising rpc" to replace our fairly-hidden note about SSDs buried in the last sentence - of the fourth paragraph - of the rpc docs.

I suggest a slightly nicer comment in the default conf file, like "For spinning hard drives, raise this to at least 3 and test"

Roberto

Re: Changing the default random_page_cost value

From

Laurenz Albe

Date:

27 September 2024, 18:35:51

On Fri, 2024-09-27 at 10:07 -0400, Greg Sabino Mullane wrote:
> So I'll be brave and throw a number out there: 1.2.

+1

Laurenz Albe

Re: Changing the default random_page_cost value

From

Dagfinn Ilmari Mannsåker

Date:

27 September 2024, 19:03:33

Greg Sabino Mullane <htamfids@gmail.com> writes:

> So I'll be brave and throw a number out there: 1.2. And change our
> docs to say wordage like "if you are using an older hard disk drive
> technology, you may want to try raising rpc" to replace our
> fairly-hidden note about SSDs buried in the last sentence - of the
> fourth paragraph - of the rpc docs.

It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
which is typically backed by SSDs, but has extra network latency.

- ilmari

Re: Changing the default random_page_cost value

From

Bruce Momjian

Date:

15 October 2024, 00:15:18

On Mon, Sep 30, 2024 at 10:05:29AM -0400, Greg Sabino Mullane wrote:
> On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
> wrote:
> 
>     It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
>     which is typically backed by SSDs, but has extra network latency.
> 
> 
> That seems a little too in the weeds for me, but wording suggestions are
> welcome. To get things moving forward, I made a doc patch which changes a few
> things, namely:
> 
> * Mentions the distinction between ssd and hdd right up front.
> * Moves the tablespace talk to the very end, as tablespace use is getting rarer
> (again, thanks in part to ssds)
> * Mentions the capability to set per-database and per-role since we mention
> per-tablespace.
> * Removes a lot of the talk of caches and justifications for the 4.0 setting.
> While those are interesting, I've been tuning this parameter for many years and
> never really cared about the "90% cache rate". The proof is in the pudding: rpc
> is the canonical "try it and see" parameter. Tweak. Test. Repeat.

I am not a fan of this patch.  I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen.  I
would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it.  Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve? 
Also, per-tablespace is not a connection-activated item like the other
two.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Changing the default random_page_cost value

From

David Rowley

Date:

15 October 2024, 05:20:31

On Tue, 15 Oct 2024 at 10:15, Bruce Momjian <bruce@momjian.us> wrote:
> I am not a fan of this patch.  I don't see why _removing_ the magnetic
> part helps because you then have no logic for any 1.2 was chosen.  I
> would put the magnetic in a separate paragraph perhaps, and recommend
> 4.0 for it.  Also, per-tablespace makes sense because of physical media
> differences, but what purpose would per-database and per-role serve?
> Also, per-tablespace is not a connection-activated item like the other
> two.

Yeah, I think any effort to change the default value for this setting
would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2
and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?

I don't think just providing evidence that random read times are
closer to sequential read times on SSDs are closer than they are with
HDDs is going to be enough. What we want to know is if the planner
costs become more closely related to the execution time or not. From
my experience, random_page_cost really only has a loose grasp on
reality, so you might find that it's hard to prove this with any
degree of confidence (just have a look at how inconsiderate
index_pages_fetched() is to other queries running on the database, for
example).

I suggest first identifying all the locations that use
random_page_cost then coming up with some test cases that run queries
that exercise those locations in a way that does things like vary the
actual selectivity of some value to have the planner switch plans then
try varying the random_page_cost to show that the switchover point is
more correct with the new value than it is with the old value. It
would be nice to have this as a script so that other people could
easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.
You'll likely need to do this with varying index sizes. I imagine to
properly test this so that we'd have any useful degree of confidence
that the new value is better than the old one would likely require a
benchmark that runs for several hours. At the upper end, you'd likely
want the data sizes to exceed the size of RAM. Another dimension that
the tests should likely explore is varying data locality.

David

Re: Changing the default random_page_cost value

From

Tom Lane

Date:

15 October 2024, 05:38:01

David Rowley <dgrowleyml@gmail.com> writes:
> Yeah, I think any effort to change the default value for this setting
> would require some analysis to prove that the newly proposed default
> is a more suitable setting than the current default. I mean, why 1.2
> and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
> for this?

Yeah, that's been my main concern about this proposal too.

I recall that when we settled on 4.0 as a good number for
spinning-rust drives, it came out of some experimentation that
I'd done that involved multiple-day-long tests.  I don't recall any
more details than that sadly, but perhaps trawling the mailing list
archives would yield useful info.  It looks like the 4.0 value came
in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
time frame to look in.

            regards, tom lane

Re: Changing the default random_page_cost value

From

Tom Lane

Date:

15 October 2024, 06:03:59

I wrote:
> I recall that when we settled on 4.0 as a good number for
> spinning-rust drives, it came out of some experimentation that
> I'd done that involved multiple-day-long tests.  I don't recall any
> more details than that sadly, but perhaps trawling the mailing list
> archives would yield useful info.  It looks like the 4.0 value came
> in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
> time frame to look in.

I tried asking https://www.postgresql.org/search/ about
random_page_cost, and got nothing except search engine timeouts :-(.
However, some digging in my own local archives yielded

https://www.postgresql.org/message-id/flat/25387.948414692%40sss.pgh.pa.us

https://www.postgresql.org/message-id/flat/14601.949786166%40sss.pgh.pa.us

That confirms my recollection of multiple-day test runs, but doesn't
offer much more useful detail than that :-(.  What I think I did
though was to create some large tables (much bigger than the RAM on
the machine I had) and actually measure the runtime of seq scans
versus full-table index scans, repeating plenty 'o times to try to
average out the noise.  There was some talk in those threads of
reducing that to a publishable script, but it was never followed up
on.

            regards, tom lane

Re: Changing the default random_page_cost value

From

Greg Sabino Mullane

Date:

25 October 2024, 03:01:11

On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:

I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen.

Okay, but we have no documented logic on why 4.0 was chosen either. :)

I would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it.

Sounds doable. Even in the pre-SSD age I recall lowering this as a fairly standard practice, but I'm fine with a recommendation of 4. Partly because I doubt anyone will use it much.

Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.

Good point, I withdraw that part.

Cheers,

Greg

Re: Changing the default random_page_cost value

From

Greg Sabino Mullane

Date:

25 October 2024, 03:13:28

On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:

Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?

As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems right as we keep things very conservative). I agree we should make a best effort to have an accurate, defendable default. We all know (I hope) that 4.0 is wrong for SSDs.

I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer than they are with
HDDs is going to be enough.

...

It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.

Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html)

Remember this is still just a default, and we should encourage people to tweak it themselves based on their own workloads. I just want people to start in the right neighborhood. I'll see about working on some more research / generating a script, but help from others is more than welcome.

Cheers,

Greg

Re: Changing the default random_page_cost value

From

David Rowley

Date:

25 October 2024, 04:29:14

On Fri, 25 Oct 2024 at 13:14, Greg Sabino Mullane <htamfids@gmail.com> wrote:
>
> On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the
newlyproposed default 
>> is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that
1.2is the best value 
>> for this?
>
> As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems
rightas we keep things very conservative). I agree we should make a best effort to have an accurate, defendable
default.We all know (I hope) that 4.0 is wrong for SSDs. 

I don't think we're going to find the correct new value for this
setting by throwing randomly chosen numbers at each other on an email
thread. Unfortunately, someone is going to have to do some work to
figure out what the number should be, and then hopefully someone else
can verify that work to check that person is correct.

I'm not trying to be smart or funny here, but I just am failing to
comprehend why you think you offering a number without any information
about how you selected that number to set as the new default
random_page_cost would be acceptable.  Are you expecting someone else
to go and do the work to prove that your selected number is the
correct one? It's been 4 weeks since your first email and nobody has
done that yet, so maybe you might need to consider other ways to
achieve your goal.

>> I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer
thanthey are with 
>> HDDs is going to be enough.
>
> ...
>>
>> It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that
random_page_costwe 
>> choose as the new default is representative of the average hardware.
>
>
> Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html)

I don't see the similarity. Changing the default random_page_cost
requires analysis to find what the new default should be. The
execution of the actual change in default is dead simple.  With
belling the cat, it seems like the execution is the hard part and
nobody is debating the idea itself.

> Remember this is still just a default, and we should encourage people to tweak it themselves based on their own
workloads.I just want people to start in the right neighborhood. I'll see about working on some more research /
generatinga script, but help from others is more than welcome. 

You might be mistakenly thinking that the best random_page_cost is an
exact ratio of how much slower a random seek and read is from a
sequential read. There are unfortunately many other factors to
consider. The correct setting is going to be the one where the chosen
plan uses the scan method that's the fastest and knowing the answer to
that is going to take some benchmarks on PostgreSQL. Our cost model
simply just isn't perfect enough for you to assume that I/O is the
only factor that changes between an Index Scan and a Seq Scan.

I'd say it's not overly difficult to come up with test cases that go
to prove the value you select is "correct". I've done this before for
CPU-related costs. I think with I/O the main difference will be that
your tests should be much larger, and doing that will mean getting the
results takes much more time. Here's a link to some analysis I did to
help solve a problem relating to partition-wise aggregates [1]. Maybe
you can use a similar method to determine random_page_cost.

David

[1] https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com

Re: Changing the default random_page_cost value

From

wenhui qiu

Date:

25 October 2024, 06:56:57

HI Greg Sabino Mullane

Another thing is that you simply change the configuration template is not effective,

need to modify the DEFAULT_RANDOM_PAGE_COST values

 {
 {"random_page_cost", PGC_USERSET, QUERY_TUNING_COST,
gettext_noop("Sets the planner's estimate of the cost of a "
"nonsequentially fetched disk page."),
NULL,
GUC_EXPLAIN
 },
&random_page_cost,
DEFAULT_RANDOM_PAGE_COST, 0, DBL_MAX,
NULL, NULL, NULL
 },

src/include/optimizer/cost.h
/* defaults for costsize.c's Cost parameters */
/* NB: cost-estimation code should use the variables, not these constants! */
/* If you change these, update backend/utils/misc/postgresql.conf.sample */
#define DEFAULT_SEQ_PAGE_COST 1.0
#define DEFAULT_RANDOM_PAGE_COST 4.0
#define DEFAULT_CPU_TUPLE_COST 0.01
#define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0

Thanks

Re: Changing the default random_page_cost value

From

Bruce Momjian

Date:

31 October 2024, 20:43:56

On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
> On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:
> 
>     I am not a fan of this patch.  I don't see why _removing_ the magnetic
>     part helps because you then have no logic for any 1.2 was chosen.
> 
> 
> Okay, but we have no documented logic on why 4.0 was chosen either. :)

Uh, we do, and it is in the docs:

        Random access to mechanical disk storage is normally much more expensive
        than four times sequential access.  However, a lower default is used
        (4.0) because the majority of random accesses to disk, such as indexed
        reads, are assumed to be in cache.  The default value can be thought of
        as modeling random access as 40 times slower than sequential, while
        expecting 90% of random reads to be cached.

You surely saw this when you created the patch and removed the text.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"

Re: Changing the default random_page_cost value

From

Tom Lane

Date:

31 October 2024, 20:53:16

Bruce Momjian <bruce@momjian.us> writes:
> On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
>> Okay, but we have no documented logic on why 4.0 was chosen either. :)

> Uh, we do, and it is in the docs:

>         Random access to mechanical disk storage is normally much more expensive
>         than four times sequential access.  However, a lower default is used
>         (4.0) because the majority of random accesses to disk, such as indexed
>         reads, are assumed to be in cache.  The default value can be thought of
>         as modeling random access as 40 times slower than sequential, while
>         expecting 90% of random reads to be cached.

Meh.  Reality is that that is somebody's long-after-the-fact apologia
for a number that was obtained by experimentation.

            regards, tom lane

Re: Changing the default random_page_cost value

From

Greg Sabino Mullane

Date:

01 November 2024, 16:54:41

On Thu, Oct 31, 2024 at 1:43 PM Bruce Momjian <bruce@momjian.us> wrote:

> I am not a fan of this patch. I don't see why _removing_ the magnetic
> part helps because you then have no logic for any 1.2 was chosen.
>
> Okay, but we have no documented logic on why 4.0 was chosen either. :)

Uh, we do, and it is in the docs:

Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.

You surely saw this when you created the patch and removed the text.

Yes, I did, but there is no documentation to support those numbers. Is it really 40 times slower? Which hard disks, in what year? Where did the 90% come from, and is that really an accurate average for production systems with multiple caching layers? I know Tom ran actual tests many years ago, but at the end of the day, all of these numbers will vary wildly depending on a host of factors, so we have to make some educated guesses.

I guess my point was that my 1.2 (based on many years of tuning many client systems) seems no less imprecise than 4.0 (based on actual tests many years ago, with hardware that has changed a lot), for a default value that people should tune for their specific system anyway.

Maybe these new tests people are asking for in this thread to determine a better default rpc for SSDs can also help us determine a better rpc for HDs as well.

Cheers,

Greg