Re: Should we update the random_page_cost default value? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Should we update the random_page_cost default value?
Date
Msg-id d43bdf2f-aed3-46c2-8cfe-49383abde3a1@vondra.me
Whole thread Raw
In response to Re: Should we update the random_page_cost default value?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers

On 10/7/25 17:32, Andres Freund wrote:
> Hi,
> 
> On 2025-10-07 16:23:36 +0200, Tomas Vondra wrote:
>> On 10/7/25 14:08, Tomas Vondra wrote:
>>> ...
>>>>>>>> I think doing this kind of measurement via normal SQL query processing is
>>>>>>>> almost always going to have too much other influences. I'd measure using fio
>>>>>>>> or such instead.  It'd be interesting to see fio numbers for your disks...
>>>>>>>>
>>>>>>>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0
--time_based=1--runtime=5 --ioengine pvsync  --iodepth 1
 
>>>>>>>> vs --rw randread
>>>>>>>>
>>>>>>>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
>>>>>>>>
>>>>>>>
>>>>>>> I can give it a try. But do we really want to strip "our" overhead with
>>>>>>> reading data?
>>>
>>> I got this on the two RAID devices (NVMe and SATA):
>>>
>>> NVMe: 83.5k / 15.8k
>>> SATA: 28.6k /  8.5k
>>>
>>> So the same ballpark / ratio as your test. Not surprising, really.
>>>
>>
>> FWIW I do see about this number in iostat. There's a 500M test running
>> right now, and iostat reports this:
>>
>>   Device      r/s     rkB/s  ...  rareq-sz  ...  %util
>>   md1    15273.10 143512.80  ...      9.40  ...  93.64
>>
>> So it's not like we're issuing far fewer I/Os than the SSD can handle.
> 
> Not really related to this thread:
> 
> IME iostat's utilization is pretty much useless for anything other than "is
> something happening at all", and even that is not reliable. I don't know the
> full reason for it, but I long learned to just discount it.
> 
> I ran
> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0
--time_based=1--runtime=100 --ioengine pvsync  --iodepth 1 --rate_iops=40000
 
> 
> a few times in a row, while watching iostat. Sometimes utilization is 100%,
> sometimes it's 0.2%.  Whereas if I run without rate limiting, utilization
> never goes above 71%, despite doing more iops.
> 
> 
> And then gets completely useless if you use a deeper iodepth, because there's
> just not a good way to compute something like a utilization number once
> you take parallel IO processing into account.
> 
> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0
--time_based=1--runtime=100 --ioengine io_uring  --iodepth 1 --rw randread
 
> iodepth        util    iops
> 1               94%     9.3k
> 2               99.6%   18.4k
> 4               100%    35.9k
> 8               100%    68.0k
> 16              100%    123k
> 

Yeah. Interpreting %util is hard, the value on it's own is borderline
useless. I only included it because it's the last thing on the line.

AFAIK the reason why it doesn't say much is that it says "device is
doing something", nothing about the bandwidth/throughput. It's very
obvious on RAID storage, where you can see 100% util on the md device,
but the members are used only at 25%. SSDs are similar internally,
except that the members are not visible.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Robert Treat
Date:
Subject: Re: Add mode column to pg_stat_progress_vacuum
Next
From: Tomas Vondra
Date:
Subject: Re: Should we update the random_page_cost default value?