Thread: Changing the default random_page_cost value
tl;dr let's assume SSDs are popular and HDDs are the exception and flip our default
As I write this email, it's the year 2024. I think it is time we lower our "default" setting of random_page_cost (as set in postgresql.conf.sample and the docs). Even a decade ago, the current default of 4 was considered fairly conservative and often lowered. The git logs shows that this value was last touched in 2006, during the age of spinning metal. We are now in a new era, the age of SSDs, and thus we should lower this default value to reflect the fact that the vast majority of people using Postgres these days are doing so on solid state drives. We tend to stay ultra-conservative in all of our settings, but we also need to recognize when there has been a major shift in the underlying hardware - and calculations that our defaults are based on.
Granted, there are other factors involved, and yes, perhaps we should tweak some of the similar settings as well, but ranom_page_cost is the one setting most out of sync with today's hardware realities. So I'll be brave and throw a number out there: 1.2. And change our docs to say wordage like "if you are using an older hard disk drive technology, you may want to try raising rpc" to replace our fairly-hidden note about SSDs buried in the last sentence - of the fourth paragraph - of the rpc docs.
Real data about performance on today's SSDs are welcome, and/or some way to generate a more accurate default.
Cheers,
Greg
On Fri, Sep 27, 2024 at 8:07 AM Greg Sabino Mullane <htamfids@gmail.com> wrote:
tl;dr let's assume SSDs are popular and HDDs are the exception and flip our default
<snip>
Granted, there are other factors involved, and yes, perhaps we should tweak some of the similar settings as well, but ranom_page_cost is the one setting most out of sync with today's hardware realities. So I'll be brave and throw a number out there: 1.2. And change our docs to say wordage like "if you are using an older hard disk drive technology, you may want to try raising rpc" to replace our fairly-hidden note about SSDs buried in the last sentence - of the fourth paragraph - of the rpc docs.
+1
I suggest a slightly nicer comment in the default conf file, like "For spinning hard drives, raise this to at least 3 and test"
Roberto
On Fri, 2024-09-27 at 10:07 -0400, Greg Sabino Mullane wrote: > So I'll be brave and throw a number out there: 1.2. +1 Laurenz Albe
Greg Sabino Mullane <htamfids@gmail.com> writes: > So I'll be brave and throw a number out there: 1.2. And change our > docs to say wordage like "if you are using an older hard disk drive > technology, you may want to try raising rpc" to replace our > fairly-hidden note about SSDs buried in the last sentence - of the > fourth paragraph - of the rpc docs. It might also be worth mentioning cloudy block storage (e.g. AWS' EBS), which is typically backed by SSDs, but has extra network latency. - ilmari
On Mon, Sep 30, 2024 at 10:05:29AM -0400, Greg Sabino Mullane wrote: > On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> > wrote: > > It might also be worth mentioning cloudy block storage (e.g. AWS' EBS), > which is typically backed by SSDs, but has extra network latency. > > > That seems a little too in the weeds for me, but wording suggestions are > welcome. To get things moving forward, I made a doc patch which changes a few > things, namely: > > * Mentions the distinction between ssd and hdd right up front. > * Moves the tablespace talk to the very end, as tablespace use is getting rarer > (again, thanks in part to ssds) > * Mentions the capability to set per-database and per-role since we mention > per-tablespace. > * Removes a lot of the talk of caches and justifications for the 4.0 setting. > While those are interesting, I've been tuning this parameter for many years and > never really cared about the "90% cache rate". The proof is in the pudding: rpc > is the canonical "try it and see" parameter. Tweak. Test. Repeat. I am not a fan of this patch. I don't see why _removing_ the magnetic part helps because you then have no logic for any 1.2 was chosen. I would put the magnetic in a separate paragraph perhaps, and recommend 4.0 for it. Also, per-tablespace makes sense because of physical media differences, but what purpose would per-database and per-role serve? Also, per-tablespace is not a connection-activated item like the other two. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Tue, 15 Oct 2024 at 10:15, Bruce Momjian <bruce@momjian.us> wrote: > I am not a fan of this patch. I don't see why _removing_ the magnetic > part helps because you then have no logic for any 1.2 was chosen. I > would put the magnetic in a separate paragraph perhaps, and recommend > 4.0 for it. Also, per-tablespace makes sense because of physical media > differences, but what purpose would per-database and per-role serve? > Also, per-tablespace is not a connection-activated item like the other > two. Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the newly proposed default is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value for this? I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer than they are with HDDs is going to be enough. What we want to know is if the planner costs become more closely related to the execution time or not. From my experience, random_page_cost really only has a loose grasp on reality, so you might find that it's hard to prove this with any degree of confidence (just have a look at how inconsiderate index_pages_fetched() is to other queries running on the database, for example). I suggest first identifying all the locations that use random_page_cost then coming up with some test cases that run queries that exercise those locations in a way that does things like vary the actual selectivity of some value to have the planner switch plans then try varying the random_page_cost to show that the switchover point is more correct with the new value than it is with the old value. It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that random_page_cost we choose as the new default is representative of the average hardware. You'll likely need to do this with varying index sizes. I imagine to properly test this so that we'd have any useful degree of confidence that the new value is better than the old one would likely require a benchmark that runs for several hours. At the upper end, you'd likely want the data sizes to exceed the size of RAM. Another dimension that the tests should likely explore is varying data locality. David
David Rowley <dgrowleyml@gmail.com> writes: > Yeah, I think any effort to change the default value for this setting > would require some analysis to prove that the newly proposed default > is a more suitable setting than the current default. I mean, why 1.2 > and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value > for this? Yeah, that's been my main concern about this proposal too. I recall that when we settled on 4.0 as a good number for spinning-rust drives, it came out of some experimentation that I'd done that involved multiple-day-long tests. I don't recall any more details than that sadly, but perhaps trawling the mailing list archives would yield useful info. It looks like the 4.0 value came in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the time frame to look in. regards, tom lane
I wrote: > I recall that when we settled on 4.0 as a good number for > spinning-rust drives, it came out of some experimentation that > I'd done that involved multiple-day-long tests. I don't recall any > more details than that sadly, but perhaps trawling the mailing list > archives would yield useful info. It looks like the 4.0 value came > in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the > time frame to look in. I tried asking https://www.postgresql.org/search/ about random_page_cost, and got nothing except search engine timeouts :-(. However, some digging in my own local archives yielded https://www.postgresql.org/message-id/flat/25387.948414692%40sss.pgh.pa.us https://www.postgresql.org/message-id/flat/14601.949786166%40sss.pgh.pa.us That confirms my recollection of multiple-day test runs, but doesn't offer much more useful detail than that :-(. What I think I did though was to create some large tables (much bigger than the RAM on the machine I had) and actually measure the runtime of seq scans versus full-table index scans, repeating plenty 'o times to try to average out the noise. There was some talk in those threads of reducing that to a publishable script, but it was never followed up on. regards, tom lane
On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:
I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen.
Okay, but we have no documented logic on why 4.0 was chosen either. :)
I would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it.
Sounds doable. Even in the pre-SSD age I recall lowering this as a fairly standard practice, but I'm fine with a recommendation of 4. Partly because I doubt anyone will use it much.
Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.
Good point, I withdraw that part.
Cheers,
Greg
On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?
As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems right as we keep things very conservative). I agree we should make a best effort to have an accurate, defendable default. We all know (I hope) that 4.0 is wrong for SSDs.
I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer than they are with
HDDs is going to be enough.
...
It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.
Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html)
Remember this is still just a default, and we should encourage people to tweak it themselves based on their own workloads. I just want people to start in the right neighborhood. I'll see about working on some more research / generating a script, but help from others is more than welcome.
Cheers,
Greg
On Fri, 25 Oct 2024 at 13:14, Greg Sabino Mullane <htamfids@gmail.com> wrote: > > On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote: >> >> Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the newlyproposed default >> is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that 1.2is the best value >> for this? > > As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems rightas we keep things very conservative). I agree we should make a best effort to have an accurate, defendable default.We all know (I hope) that 4.0 is wrong for SSDs. I don't think we're going to find the correct new value for this setting by throwing randomly chosen numbers at each other on an email thread. Unfortunately, someone is going to have to do some work to figure out what the number should be, and then hopefully someone else can verify that work to check that person is correct. I'm not trying to be smart or funny here, but I just am failing to comprehend why you think you offering a number without any information about how you selected that number to set as the new default random_page_cost would be acceptable. Are you expecting someone else to go and do the work to prove that your selected number is the correct one? It's been 4 weeks since your first email and nobody has done that yet, so maybe you might need to consider other ways to achieve your goal. >> I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer thanthey are with >> HDDs is going to be enough. > > ... >> >> It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that random_page_costwe >> choose as the new default is representative of the average hardware. > > > Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html) I don't see the similarity. Changing the default random_page_cost requires analysis to find what the new default should be. The execution of the actual change in default is dead simple. With belling the cat, it seems like the execution is the hard part and nobody is debating the idea itself. > Remember this is still just a default, and we should encourage people to tweak it themselves based on their own workloads.I just want people to start in the right neighborhood. I'll see about working on some more research / generatinga script, but help from others is more than welcome. You might be mistakenly thinking that the best random_page_cost is an exact ratio of how much slower a random seek and read is from a sequential read. There are unfortunately many other factors to consider. The correct setting is going to be the one where the chosen plan uses the scan method that's the fastest and knowing the answer to that is going to take some benchmarks on PostgreSQL. Our cost model simply just isn't perfect enough for you to assume that I/O is the only factor that changes between an Index Scan and a Seq Scan. I'd say it's not overly difficult to come up with test cases that go to prove the value you select is "correct". I've done this before for CPU-related costs. I think with I/O the main difference will be that your tests should be much larger, and doing that will mean getting the results takes much more time. Here's a link to some analysis I did to help solve a problem relating to partition-wise aggregates [1]. Maybe you can use a similar method to determine random_page_cost. David [1] https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com
HI Greg Sabino Mullane
Another thing is that you simply change the configuration template is not effective,
need to modify the DEFAULT_RANDOM_PAGE_COST values
{
{"random_page_cost", PGC_USERSET, QUERY_TUNING_COST,
gettext_noop("Sets the planner's estimate of the cost of a "
"nonsequentially fetched disk page."),
NULL,
GUC_EXPLAIN
},
&random_page_cost,
DEFAULT_RANDOM_PAGE_COST, 0, DBL_MAX,
NULL, NULL, NULL
},
src/include/optimizer/cost.h
/* defaults for costsize.c's Cost parameters */
/* NB: cost-estimation code should use the variables, not these constants! */
/* If you change these, update backend/utils/misc/postgresql.conf.sample */
#define DEFAULT_SEQ_PAGE_COST 1.0
#define DEFAULT_RANDOM_PAGE_COST 4.0
#define DEFAULT_CPU_TUPLE_COST 0.01
#define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0