Re: Should we update the random_page_cost default value? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Should we update the random_page_cost default value?
Date
Msg-id 1fbc1207-a738-46be-835b-c133df0df892@vondra.me
Whole thread Raw
In response to Re: Should we update the random_page_cost default value?  (Andres Freund <andres@anarazel.de>)
Responses Re: Should we update the random_page_cost default value?
List pgsql-hackers
On 10/6/25 23:27, Andres Freund wrote:
> Hi,
> 
> On 2025-10-06 20:20:15 +0200, Tomas Vondra wrote:
>> On 10/6/25 17:14, Andres Freund wrote:
>>> It also seems that due to the ordering inside the table (the order by
>>> random()) during the table creation, you're going to see vastly different
>>> number of page accesses.  While that's obviously something worth taking into
>>> account for planning purposes, I don't think it'd be properly done by the
>>> random_page_cost itself.
>>>
>>
>> I don't understand your argument. Yes, the index scan does way more page
>> accesses. Essentially, each index tuple reads a new page - and without a
>> cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap
>> page, which means we have to read it ~21x. In other words, it reads as
>> many pages as there are tuples.
>>
>> This is why my formulas divide the seqscan timing by number of pages,
>> but indexscan is divided by number of tuples. And AFAIK the costing does
>> exactly that too. So the random_page_cost doesn't need to do anything
>> special, we estimate the number of page reads. (And at least in the case
>> of random table it's pretty accurate.)
> 
> The thing is that your queries have vastly different timings *regardless* of
> IO. Here's an example, fully cached:
> 
> postgres[2718803][1]=# EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM test_table) OFFSET 100000000;
>
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                           QUERY PLAN
           │
 
>
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Limit  (cost=288096.16..288096.16 rows=1 width=41) (actual time=392.066..392.067 rows=0.00 loops=1)
           │
 
> │   Buffers: shared hit=238096
           │
 
> │   ->  Seq Scan on test_table  (cost=0.00..288096.16 rows=5000016 width=41) (actual time=0.032..269.031
rows=5000000.00loops=1) │
 
> │         Buffers: shared hit=238096
           │
 
> │ Planning Time: 0.087 ms
           │
 
> │ Execution Time: 392.082 ms
           │
 
>
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> (6 rows)
> 
> postgres[2759534][1]=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM (SELECT ctid,* FROM test_table ORDER BY id) offset
100000000;
>
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                                         QUERY PLAN
                                       │
 
>
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Limit  (cost=1082201.28..1082201.28 rows=1 width=47) (actual time=3025.902..3025.903 rows=0.00 loops=1)
                                       │
 
> │   Buffers: shared hit=5013643
                                       │
 
> │   ->  Index Scan using test_table_id_idx on test_table  (cost=0.43..1082201.28 rows=5000016 width=47) (actual
time=0.035..2907.699rows=5000000.00 loops=1) │
 
> │         Index Searches: 1
                                       │
 
> │         Buffers: shared hit=5013643
                                       │
 
> │ Planning Time: 0.131 ms
                                       │
 
> │ Execution Time: 3025.930 ms
                                       │
 
>
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> (7 rows)
> 
> You have a ~8x difference in time, even without a single IO. A lot of that is
> probably just the 20x difference in buffer accesses, but also the index path
> just being a lot less predictable for the CPU.
> 

Right. And that's why the timing is divided by *pages* for sequential
scans, and by *rows* for index scan. Which for the timings you show ends
up doing this:

     seqscan = 392 / 238096 = 0.0016
  indexescan = 3025 / 5M = 0.0006

So the per-page overhead with index scans is 1/3 of seqscans. AFAICS
this directly contradicts your argument that random_page_cost ends up
too high due to indexscans paying more per page.

I'd bet these costs to be pretty negligible compared to the cost of
actual I/O. So I don't see how would this explain the timing difference
with cold data.

> I don't think random/seq_page_cost should account for difference in processing
> costs that aren't actually related to it being a sequential or a random IO.
> 

Perhaps. If it's not directly related to randomness of I/O, then maybe
random_page_cost is not the right cost parameter.

But it needs to be included *somewhere*, right? And it's per-page cost,
and per the experiments there's very clear difference between sequential
and random access. So why wouldn't random_page_cost be a good fit?

> 
>>> I think doing this kind of measurement via normal SQL query processing is
>>> almost always going to have too much other influences. I'd measure using fio
>>> or such instead.  It'd be interesting to see fio numbers for your disks...
>>>
>>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0
--time_based=1--runtime=5 --ioengine pvsync  --iodepth 1 
>>> vs --rw randread
>>>
>>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.
>>>
>>
>> I can give it a try. But do we really want to strip "our" overhead with
>> reading data?
> 
> I think the relevant differences in efficiency here don't come from something
> inherent in postgres, it's that you're comparing completely different code
> paths that have different IO independent costs.  You could totally measure it
> in postgres, just not easily via "normal" SQL queries on tables/indexes.
> 

If the two paths for seq/index access have so vastly different
overheads, why shouldn't that be reflected in the seq_page_cost and
random_page_cost?

But as shown above, the per-page timings IMHO contradict this.


> Maybe we should have a function that measures some basic IO "factors" for each
> tablespace and for pg_wal.
> 

I'm not opposed to having functions like this.

> 
> You can do a decent approximation of this via postgres too, by utilizing
> pg_prewarm and/or pageinspect. E.g. something like
> 
> random IO:
> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT
generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY random(), blkno);
 
> and for sequential IO
> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT
generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY blkno, random());
 
> 
> Time for sequential IO: 8066.720 ms
> Time for random IO: 31623.122 ms
> 
> That's, almost eerily, close to 4x :)
> 
> 
> FWIW, when cached, the difference in runtime between the two variants is
> neglegible.
> 

The problem is the sequential query forces the I/O blocks to be 8kB
each, instead of issuing larger requests (like a regular seqscan). On my
laptops that makes the seqscan 10x slower, so for regular seqscan the
ratio is way higher than 4x.

Maybe this the kind of overhead you claim should not be considered when
setting random_page_cost, but I'm not convinced of that.

> 
> I guess it'd actually be better to use mode=>'read' (to avoid the constant
> cost of the memcpy into the buffer pool from reducing the impact of sequential
> vs random IO), in which case the difference in IO time increases to a more
> substantial 9x for the uncached case.
> 
> 
>>>> From a robustness point of view, wouldn't it be better to actually err
>>>> on the side of using a higher random_page_cost value? That'd mean we
>>>> flip to "more-sequential" scans sooner, with much "flatter" behavior.
>>>> That doesn't need to be a seqscan (which is about as flat as it gets),
>>>> but e.g. a bitmap scan - which probably silently "fixes" many cases
>>>> where the index scan gets costed too low.
>>>
>>> I think it's often the exact opposite - folks use a lower random page cost to
>>> *prevent* the planner from going to sequential (or bitmap heap) scans. In many
>>> real-world queries our selectivity estimates aren't great and the performance
>>> penalties of switching from an index scan to a sequential scan are really
>>> severe. As you note, this is heavily exascerbated by the hot data often being
>>> cached, but cold data not. Obviously the seqscan will process the cold data
>>> too.
>>>
>>
>> I understand your point, but I'm not convinced random_page_cost is the
>> right tool to fix incorrect estimates.
> 
> I don't think it is - but I do think that if we just increased
> random_page_cost substantially, without other accompanying changes, we'll
> cause a lot of problems for folks, due to the above issue.
> 

OK.


-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: psql: Count all table footer lines in pager setup
Next
From: Andres Freund
Date:
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)