Re: Should we update the random_page_cost default value? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Should we update the random_page_cost default value?
Date
Msg-id 3ec322cd-e368-4d55-b9ba-da252a390867@vondra.me
Whole thread Raw
In response to Re: Should we update the random_page_cost default value?  (Andres Freund <andres@anarazel.de>)
Responses Re: Should we update the random_page_cost default value?
Re: Should we update the random_page_cost default value?
List pgsql-hackers

On 10/7/25 01:56, Andres Freund wrote:
> Hi,
> 
> On 2025-10-07 00:52:26 +0200, Tomas Vondra wrote:
>> On 10/6/25 23:27, Andres Freund wrote:
>>> On 2025-10-06 20:20:15 +0200, Tomas Vondra wrote:
>>>> On 10/6/25 17:14, Andres Freund wrote:
>>>>> It also seems that due to the ordering inside the table (the order by
>>>>> random()) during the table creation, you're going to see vastly different
>>>>> number of page accesses.  While that's obviously something worth taking into
>>>>> account for planning purposes, I don't think it'd be properly done by the
>>>>> random_page_cost itself.
>>>>>
>>>>
>>>> I don't understand your argument. Yes, the index scan does way more page
>>>> accesses. Essentially, each index tuple reads a new page - and without a
>>>> cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap
>>>> page, which means we have to read it ~21x. In other words, it reads as
>>>> many pages as there are tuples.
>>>>
>>>> This is why my formulas divide the seqscan timing by number of pages,
>>>> but indexscan is divided by number of tuples. And AFAIK the costing does
>>>> exactly that too. So the random_page_cost doesn't need to do anything
>>>> special, we estimate the number of page reads. (And at least in the case
>>>> of random table it's pretty accurate.)
>>>
>>> The thing is that your queries have vastly different timings *regardless* of
>>> IO. Here's an example, fully cached:
>>>
>>> postgres[2718803][1]=# EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM test_table) OFFSET 100000000;
>>>
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
>>> │                                                           QUERY PLAN
             │
 
>>>
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
>>> │ Limit  (cost=288096.16..288096.16 rows=1 width=41) (actual time=392.066..392.067 rows=0.00 loops=1)
             │
 
>>> │   Buffers: shared hit=238096
             │
 
>>> │   ->  Seq Scan on test_table  (cost=0.00..288096.16 rows=5000016 width=41) (actual time=0.032..269.031
rows=5000000.00loops=1) │
 
>>> │         Buffers: shared hit=238096
             │
 
>>> │ Planning Time: 0.087 ms
             │
 
>>> │ Execution Time: 392.082 ms
             │
 
>>>
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>>> (6 rows)
>>>
>>> postgres[2759534][1]=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM (SELECT ctid,* FROM test_table ORDER BY id) offset
100000000;
>>>
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
>>> │                                                                         QUERY PLAN
                                         │
 
>>>
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
>>> │ Limit  (cost=1082201.28..1082201.28 rows=1 width=47) (actual time=3025.902..3025.903 rows=0.00 loops=1)
                                         │
 
>>> │   Buffers: shared hit=5013643
                                         │
 
>>> │   ->  Index Scan using test_table_id_idx on test_table  (cost=0.43..1082201.28 rows=5000016 width=47) (actual
time=0.035..2907.699rows=5000000.00 loops=1) │
 
>>> │         Index Searches: 1
                                         │
 
>>> │         Buffers: shared hit=5013643
                                         │
 
>>> │ Planning Time: 0.131 ms
                                         │
 
>>> │ Execution Time: 3025.930 ms
                                         │
 
>>>
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
>>> (7 rows)
>>>
>>> You have a ~8x difference in time, even without a single IO. A lot of that is
>>> probably just the 20x difference in buffer accesses, but also the index path
>>> just being a lot less predictable for the CPU.
>>>
>>
>> Right. And that's why the timing is divided by *pages* for sequential
>> scans, and by *rows* for index scan. Which for the timings you show ends
>> up doing this:
>>
>>      seqscan = 392 / 238096 = 0.0016
>>   indexescan = 3025 / 5M = 0.0006
>>
>> So the per-page overhead with index scans is 1/3 of seqscans. AFAICS
>> this directly contradicts your argument that random_page_cost ends up
>> too high due to indexscans paying more per page.
> 
> I don't think that's contradicting it. The seqscan is doing ~21 tuples per
> page, the indexscan ~1. There is more work happening for each sequentially
> scanned page.  And of course the number of pages pinned isn't the only factor.
> 
> Just look at a profile of the index query in the cached case:
> 
> -   99.98%     0.00%  postgres  postgres           [.] standard_ExecutorRun
>      standard_ExecutorRun
>    - ExecProcNodeInstr
>       - 99.85% ExecLimit
>          - 99.26% ExecProcNodeInstr
>             - 98.06% ExecScan
>                - 92.13% IndexNext
>                   - 91.07% index_getnext_slot
>                      - 85.55% heapam_index_fetch_tuple
>                         - 47.69% ReleaseAndReadBuffer
>                            + 42.17% StartReadBuffer
>                              1.78% hash_bytes
>                              1.63% UnpinBufferNoOwner
>                           15.94% heap_hot_search_buffer
>                           13.22% heap_page_prune_opt
>                         + 3.70% ExecStoreBufferHeapTuple
>                           2.15% LWLockAcquire
>                           1.33% LWLockRelease
>                      - 3.96% index_getnext_tid
>                         + 3.44% btgettuple
>                + 4.50% ExecInterpExpr
> 
> and compare that with the seqscan:
> 
> -   99.93%     0.00%  postgres  postgres           [.] standard_ExecutorRun
>      standard_ExecutorRun
>    - ExecProcNodeInstr
>       - 98.64% ExecLimit
>          - 93.10% ExecProcNodeInstr
>             - 81.23% ExecSeqScan
>                - 70.30% heap_getnextslot
>                   - 56.28% heapgettup_pagemode
>                      - 30.99% read_stream_next_buffer
>                         + 29.16% StartReadBuffer
>                      - 10.99% heap_prepare_pagescan
>                           7.13% heap_page_prune_opt
>                           1.08% LWLockAcquire
>                        1.42% LWLockRelease
>                        1.13% ReleaseBuffer
>                   - 8.28% ExecStoreBufferHeapTuple
>                        2.00% UnpinBufferNoOwner
>                  2.26% MemoryContextReset
>                  1.31% heapgettup_pagemode
>                  1.27% ExecStoreBufferHeapTuple
>               3.16% InstrStopNode
>               2.82% InstrStartNode
>               1.37% heap_getnextslot
>            1.00% InstrStartNode
>         1.30% ExecProcNodeInstr
> 
> 

That's exactly my point. The index scan is doing *less* work per page.
It's accessing more pages (21x in total), but the per-page cost is
significantly lower.

> 
>> I'd bet these costs to be pretty negligible compared to the cost of
>> actual I/O. So I don't see how would this explain the timing difference
>> with cold data.
> 
> Of course there's an increased cost of random IO, I never doubted that! I'm
> just saying that your method of measurement seems to over-estimate it.
> 

If the per-page overhead is lower for random I/O (0.0006 vs. 0.0016),
why would it lead to over-estimation? AFAIK it leads exactly to the
opposite thing.

Let's try to eliminate this overhead from the calculations. I measured
sequential and index scan on a 10M table, with everything cached (in
shared buffers). I got 260ms for sequential, 5,200ms for index. The 50M
table would need ~13,000ms and 260,000ms. Let's assume all of this is
the "unfair" overhead, and let's subtract that from the timings, leaving
just the "clean" I/O costs. That gives us this (for the NVMe RAID0):

                seqscan (s)      index scan (s)  random_page_cost
 -----------------------------------------------------------------
  NVMe/RAID0             11               25202             103.4

Which is about twice of what it used to be:

                seqscan (s)      index scan (s)  random_page_cost
 -----------------------------------------------------------------
  NVMe/RAID0             24               25462             49.3

So, what am I missing?

> 
>>> I don't think random/seq_page_cost should account for difference in processing
>>> costs that aren't actually related to it being a sequential or a random IO.
>>>
>>
>> Perhaps. If it's not directly related to randomness of I/O, then maybe
>> random_page_cost is not the right cost parameter.
>>
>> But it needs to be included *somewhere*, right? And it's per-page cost,
>> and per the experiments there's very clear difference between sequential
>> and random access. So why wouldn't random_page_cost be a good fit?
> 
> We are already accounting for the increased CPU cost for index scans etc to
> some degree, via cpu_* costs. If we under-estimate the CPU cost of index scans
> we should fix that, regardless of random_page_cost, as a well correlated index
> scan *still* is a lot more expensive than a sequential scan.
> 

Yeah, that makes sense. If this really is a CPU cost, then it should be
reflected in some cpu_ parameter. But which one? I don't think we have a
per-page CPU cost?

> 
>>>>> I think doing this kind of measurement via normal SQL query processing is
>>>>> almost always going to have too much other influences. I'd measure using fio
>>>>> or such instead.  It'd be interesting to see fio numbers for your disks...
>>>>>
>>>>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0
--time_based=1--runtime=5 --ioengine pvsync  --iodepth 1
 
>>>>> vs --rw randread
>>>>>
>>>>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
>>>>>
>>>>
>>>> I can give it a try. But do we really want to strip "our" overhead with
>>>> reading data?

I got this on the two RAID devices (NVMe and SATA):

NVMe: 83.5k / 15.8k
SATA: 28.6k /  8.5k

So the same ballpark / ratio as your test. Not surprising, really.

>>>
>>> I think the relevant differences in efficiency here don't come from something
>>> inherent in postgres, it's that you're comparing completely different code
>>> paths that have different IO independent costs.  You could totally measure it
>>> in postgres, just not easily via "normal" SQL queries on tables/indexes.
>>>
>>
>> If the two paths for seq/index access have so vastly different
>> overheads, why shouldn't that be reflected in the seq_page_cost and
>> random_page_cost?
> 
> No, because a) random_page_cost is applied to other things than index access
> b) some index scans are not dominated by random accesses.
> 
> 
> 
>>> You can do a decent approximation of this via postgres too, by utilizing
>>> pg_prewarm and/or pageinspect. E.g. something like
>>>
>>> random IO:
>>> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT
generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY random(), blkno);
 
>>> and for sequential IO
>>> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT
generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY blkno, random());
 
>>>
>>> Time for sequential IO: 8066.720 ms
>>> Time for random IO: 31623.122 ms
>>>
>>> That's, almost eerily, close to 4x :)
>>>
>>>
>>> FWIW, when cached, the difference in runtime between the two variants is
>>> neglegible.
>>>
>>
>> The problem is the sequential query forces the I/O blocks to be 8kB
>> each, instead of issuing larger requests (like a regular seqscan). On my
>> laptops that makes the seqscan 10x slower, so for regular seqscan the
>> ratio is way higher than 4x.
>>
>> Maybe this the kind of overhead you claim should not be considered when
>> setting random_page_cost, but I'm not convinced of that.
> 
> A correlated index scan today will not do IO combining, despite being
> accounted as seq_page_cost. So just doing individual 8kB IOs actually seems to
> be the appropriate comparison.  Even with table fetches in index scans doing
> IO combining as part by your work, the reads of the index data itself won't be
> combined. And I'm sure other things won't be either.
> 

But that's the point. If the sequential reads do I/O combining and index
scans don't (and I don't think that will change anytime soon), then that
makes sequential I/O much more efficient / cheaper. And we better
reflect that in the cost somehow. Maybe increasing the random_page_cost
is not the right/best solution? That's possible.

> 
> I'd be on-board with trying to improve the cost accounting to take IO
> combining into account in some form. I don't quite know how, off-the-cuff: The
> advantage of combining seems to quickly drop off after a few pages, but by how
> much and where exactly seems to very heavily depend on the specific
> hardware. Just dividing the number of sequential accesses by io_combine_limit
> seems like it'd be over-estimating the effect substantially.
>
> 
> I do wonder if we ought to split off the CPU cost associated with both
> sequential and random_page_cost into a cpu_page_cost or such. Right now we
> IIRC largely disregard the cost of accessing some pages repeatedly, just
> because we estimate that to not incur IO costs. But of course a plan that has
> fewer pages accesses while otherwise doing the same amount of work will be
> faster, the profiles further up clearly show that we spend a fair amount of
> time in buffer handling even when fully cached.
> 

Seems reasonable.


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Viktor Holmberg
Date:
Subject: Re: INSERT ... ON CONFLICT DO SELECT [FOR ...] take 2
Next
From: Amit Kapila
Date:
Subject: Re: Logical Replication of sequences