Re: Should we update the random_page_cost default value? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Should we update the random_page_cost default value? |
Date | |
Msg-id | u52uxp6dcug3xjshe7qricmzs7cwjouuci5qlxesmqaaahng62@web4zmsl3rzw Whole thread Raw |
In response to | Should we update the random_page_cost default value? (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: Should we update the random_page_cost default value?
Re: Should we update the random_page_cost default value? |
List | pgsql-hackers |
Hi, On 2025-10-06 02:59:16 +0200, Tomas Vondra wrote: > So I decided to try doing this on a couple different devices, and see > what random_page_cost values that gives me. Attached is a script doing > such benchmark: > > (1) initializes a new cluster, with a couple parameters adjusted > > (2) creates a random table (with uniform distribution) > > (3) runs a sequential scan > > SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows; > > (4) runs an index scan > > SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows; Why compare an unordered with an ordered scan? ISTM that if you want to actually compare the cost of two different approaches to the same query, you'd not want to actually change what the query does? Yes, you say that CPU time never gets above 50%, but 50% isn't nothing. It also seems that due to the ordering inside the table (the order by random()) during the table creation, you're going to see vastly different number of page accesses. While that's obviously something worth taking into account for planning purposes, I don't think it'd be properly done by the random_page_cost itself. I think doing this kind of measurement via normal SQL query processing is almost always going to have too much other influences. I'd measure using fio or such instead. It'd be interesting to see fio numbers for your disks... fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5--ioengine pvsync --iodepth 1 vs --rw randread gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another. > It obviously contradicts the advice to set the value closer to 1.0. But > why is that? SSDs are certainly better with random I/0, even if the I/O > is not concurrent and the SSD is not fully utilized. So the 4.0 seems > off, the value should be higher than what we got for SSDs ... I'd guess that the *vast* majority of PG workloads these days run on networked block storage. For those typically the actual latency at the storage level is a rather small fraction of the overall IO latency, which is instead dominated by network and other related cost (like the indirection to which storage system to go to and crossing VM/host boundaries). Because the majority of the IO latency is not affected by the storage latency, but by network lotency, the random IO/non-random IO difference will play less of a role. > From a robustness point of view, wouldn't it be better to actually err > on the side of using a higher random_page_cost value? That'd mean we > flip to "more-sequential" scans sooner, with much "flatter" behavior. > That doesn't need to be a seqscan (which is about as flat as it gets), > but e.g. a bitmap scan - which probably silently "fixes" many cases > where the index scan gets costed too low. I think it's often the exact opposite - folks use a lower random page cost to *prevent* the planner from going to sequential (or bitmap heap) scans. In many real-world queries our selectivity estimates aren't great and the performance penalties of switching from an index scan to a sequential scan are really severe. As you note, this is heavily exascerbated by the hot data often being cached, but cold data not. Obviously the seqscan will process the cold data too. Greetings, Andres Freund
pgsql-hackers by date: