Re: Should we update the random_page_cost default value? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Should we update the random_page_cost default value? |
Date | |
Msg-id | 3ec322cd-e368-4d55-b9ba-da252a390867@vondra.me Whole thread Raw |
In response to | Re: Should we update the random_page_cost default value? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Should we update the random_page_cost default value?
Re: Should we update the random_page_cost default value? |
List | pgsql-hackers |
On 10/7/25 01:56, Andres Freund wrote: > Hi, > > On 2025-10-07 00:52:26 +0200, Tomas Vondra wrote: >> On 10/6/25 23:27, Andres Freund wrote: >>> On 2025-10-06 20:20:15 +0200, Tomas Vondra wrote: >>>> On 10/6/25 17:14, Andres Freund wrote: >>>>> It also seems that due to the ordering inside the table (the order by >>>>> random()) during the table creation, you're going to see vastly different >>>>> number of page accesses. While that's obviously something worth taking into >>>>> account for planning purposes, I don't think it'd be properly done by the >>>>> random_page_cost itself. >>>>> >>>> >>>> I don't understand your argument. Yes, the index scan does way more page >>>> accesses. Essentially, each index tuple reads a new page - and without a >>>> cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap >>>> page, which means we have to read it ~21x. In other words, it reads as >>>> many pages as there are tuples. >>>> >>>> This is why my formulas divide the seqscan timing by number of pages, >>>> but indexscan is divided by number of tuples. And AFAIK the costing does >>>> exactly that too. So the random_page_cost doesn't need to do anything >>>> special, we estimate the number of page reads. (And at least in the case >>>> of random table it's pretty accurate.) >>> >>> The thing is that your queries have vastly different timings *regardless* of >>> IO. Here's an example, fully cached: >>> >>> postgres[2718803][1]=# EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM test_table) OFFSET 100000000; >>> ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ >>> │ QUERY PLAN │ >>> ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ >>> │ Limit (cost=288096.16..288096.16 rows=1 width=41) (actual time=392.066..392.067 rows=0.00 loops=1) │ >>> │ Buffers: shared hit=238096 │ >>> │ -> Seq Scan on test_table (cost=0.00..288096.16 rows=5000016 width=41) (actual time=0.032..269.031 rows=5000000.00loops=1) │ >>> │ Buffers: shared hit=238096 │ >>> │ Planning Time: 0.087 ms │ >>> │ Execution Time: 392.082 ms │ >>> └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ >>> (6 rows) >>> >>> postgres[2759534][1]=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM (SELECT ctid,* FROM test_table ORDER BY id) offset 100000000; >>> ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ >>> │ QUERY PLAN │ >>> ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ >>> │ Limit (cost=1082201.28..1082201.28 rows=1 width=47) (actual time=3025.902..3025.903 rows=0.00 loops=1) │ >>> │ Buffers: shared hit=5013643 │ >>> │ -> Index Scan using test_table_id_idx on test_table (cost=0.43..1082201.28 rows=5000016 width=47) (actual time=0.035..2907.699rows=5000000.00 loops=1) │ >>> │ Index Searches: 1 │ >>> │ Buffers: shared hit=5013643 │ >>> │ Planning Time: 0.131 ms │ >>> │ Execution Time: 3025.930 ms │ >>> └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ >>> (7 rows) >>> >>> You have a ~8x difference in time, even without a single IO. A lot of that is >>> probably just the 20x difference in buffer accesses, but also the index path >>> just being a lot less predictable for the CPU. >>> >> >> Right. And that's why the timing is divided by *pages* for sequential >> scans, and by *rows* for index scan. Which for the timings you show ends >> up doing this: >> >> seqscan = 392 / 238096 = 0.0016 >> indexescan = 3025 / 5M = 0.0006 >> >> So the per-page overhead with index scans is 1/3 of seqscans. AFAICS >> this directly contradicts your argument that random_page_cost ends up >> too high due to indexscans paying more per page. > > I don't think that's contradicting it. The seqscan is doing ~21 tuples per > page, the indexscan ~1. There is more work happening for each sequentially > scanned page. And of course the number of pages pinned isn't the only factor. > > Just look at a profile of the index query in the cached case: > > - 99.98% 0.00% postgres postgres [.] standard_ExecutorRun > standard_ExecutorRun > - ExecProcNodeInstr > - 99.85% ExecLimit > - 99.26% ExecProcNodeInstr > - 98.06% ExecScan > - 92.13% IndexNext > - 91.07% index_getnext_slot > - 85.55% heapam_index_fetch_tuple > - 47.69% ReleaseAndReadBuffer > + 42.17% StartReadBuffer > 1.78% hash_bytes > 1.63% UnpinBufferNoOwner > 15.94% heap_hot_search_buffer > 13.22% heap_page_prune_opt > + 3.70% ExecStoreBufferHeapTuple > 2.15% LWLockAcquire > 1.33% LWLockRelease > - 3.96% index_getnext_tid > + 3.44% btgettuple > + 4.50% ExecInterpExpr > > and compare that with the seqscan: > > - 99.93% 0.00% postgres postgres [.] standard_ExecutorRun > standard_ExecutorRun > - ExecProcNodeInstr > - 98.64% ExecLimit > - 93.10% ExecProcNodeInstr > - 81.23% ExecSeqScan > - 70.30% heap_getnextslot > - 56.28% heapgettup_pagemode > - 30.99% read_stream_next_buffer > + 29.16% StartReadBuffer > - 10.99% heap_prepare_pagescan > 7.13% heap_page_prune_opt > 1.08% LWLockAcquire > 1.42% LWLockRelease > 1.13% ReleaseBuffer > - 8.28% ExecStoreBufferHeapTuple > 2.00% UnpinBufferNoOwner > 2.26% MemoryContextReset > 1.31% heapgettup_pagemode > 1.27% ExecStoreBufferHeapTuple > 3.16% InstrStopNode > 2.82% InstrStartNode > 1.37% heap_getnextslot > 1.00% InstrStartNode > 1.30% ExecProcNodeInstr > > That's exactly my point. The index scan is doing *less* work per page. It's accessing more pages (21x in total), but the per-page cost is significantly lower. > >> I'd bet these costs to be pretty negligible compared to the cost of >> actual I/O. So I don't see how would this explain the timing difference >> with cold data. > > Of course there's an increased cost of random IO, I never doubted that! I'm > just saying that your method of measurement seems to over-estimate it. > If the per-page overhead is lower for random I/O (0.0006 vs. 0.0016), why would it lead to over-estimation? AFAIK it leads exactly to the opposite thing. Let's try to eliminate this overhead from the calculations. I measured sequential and index scan on a 10M table, with everything cached (in shared buffers). I got 260ms for sequential, 5,200ms for index. The 50M table would need ~13,000ms and 260,000ms. Let's assume all of this is the "unfair" overhead, and let's subtract that from the timings, leaving just the "clean" I/O costs. That gives us this (for the NVMe RAID0): seqscan (s) index scan (s) random_page_cost ----------------------------------------------------------------- NVMe/RAID0 11 25202 103.4 Which is about twice of what it used to be: seqscan (s) index scan (s) random_page_cost ----------------------------------------------------------------- NVMe/RAID0 24 25462 49.3 So, what am I missing? > >>> I don't think random/seq_page_cost should account for difference in processing >>> costs that aren't actually related to it being a sequential or a random IO. >>> >> >> Perhaps. If it's not directly related to randomness of I/O, then maybe >> random_page_cost is not the right cost parameter. >> >> But it needs to be included *somewhere*, right? And it's per-page cost, >> and per the experiments there's very clear difference between sequential >> and random access. So why wouldn't random_page_cost be a good fit? > > We are already accounting for the increased CPU cost for index scans etc to > some degree, via cpu_* costs. If we under-estimate the CPU cost of index scans > we should fix that, regardless of random_page_cost, as a well correlated index > scan *still* is a lot more expensive than a sequential scan. > Yeah, that makes sense. If this really is a CPU cost, then it should be reflected in some cpu_ parameter. But which one? I don't think we have a per-page CPU cost? > >>>>> I think doing this kind of measurement via normal SQL query processing is >>>>> almost always going to have too much other influences. I'd measure using fio >>>>> or such instead. It'd be interesting to see fio numbers for your disks... >>>>> >>>>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1--runtime=5 --ioengine pvsync --iodepth 1 >>>>> vs --rw randread >>>>> >>>>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another. >>>>> >>>> >>>> I can give it a try. But do we really want to strip "our" overhead with >>>> reading data? I got this on the two RAID devices (NVMe and SATA): NVMe: 83.5k / 15.8k SATA: 28.6k / 8.5k So the same ballpark / ratio as your test. Not surprising, really. >>> >>> I think the relevant differences in efficiency here don't come from something >>> inherent in postgres, it's that you're comparing completely different code >>> paths that have different IO independent costs. You could totally measure it >>> in postgres, just not easily via "normal" SQL queries on tables/indexes. >>> >> >> If the two paths for seq/index access have so vastly different >> overheads, why shouldn't that be reflected in the seq_page_cost and >> random_page_cost? > > No, because a) random_page_cost is applied to other things than index access > b) some index scans are not dominated by random accesses. > > > >>> You can do a decent approximation of this via postgres too, by utilizing >>> pg_prewarm and/or pageinspect. E.g. something like >>> >>> random IO: >>> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY random(), blkno); >>> and for sequential IO >>> SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY blkno, random()); >>> >>> Time for sequential IO: 8066.720 ms >>> Time for random IO: 31623.122 ms >>> >>> That's, almost eerily, close to 4x :) >>> >>> >>> FWIW, when cached, the difference in runtime between the two variants is >>> neglegible. >>> >> >> The problem is the sequential query forces the I/O blocks to be 8kB >> each, instead of issuing larger requests (like a regular seqscan). On my >> laptops that makes the seqscan 10x slower, so for regular seqscan the >> ratio is way higher than 4x. >> >> Maybe this the kind of overhead you claim should not be considered when >> setting random_page_cost, but I'm not convinced of that. > > A correlated index scan today will not do IO combining, despite being > accounted as seq_page_cost. So just doing individual 8kB IOs actually seems to > be the appropriate comparison. Even with table fetches in index scans doing > IO combining as part by your work, the reads of the index data itself won't be > combined. And I'm sure other things won't be either. > But that's the point. If the sequential reads do I/O combining and index scans don't (and I don't think that will change anytime soon), then that makes sequential I/O much more efficient / cheaper. And we better reflect that in the cost somehow. Maybe increasing the random_page_cost is not the right/best solution? That's possible. > > I'd be on-board with trying to improve the cost accounting to take IO > combining into account in some form. I don't quite know how, off-the-cuff: The > advantage of combining seems to quickly drop off after a few pages, but by how > much and where exactly seems to very heavily depend on the specific > hardware. Just dividing the number of sequential accesses by io_combine_limit > seems like it'd be over-estimating the effect substantially. > > > I do wonder if we ought to split off the CPU cost associated with both > sequential and random_page_cost into a cpu_page_cost or such. Right now we > IIRC largely disregard the cost of accessing some pages repeatedly, just > because we estimate that to not incur IO costs. But of course a plan that has > fewer pages accesses while otherwise doing the same amount of work will be > faster, the profiles further up clearly show that we spend a fair amount of > time in buffer handling even when fully cached. > Seems reasonable. regards -- Tomas Vondra
pgsql-hackers by date: