Re: Should we update the random_page_cost default value? - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Should we update the random_page_cost default value? |
Date | |
Msg-id | 1fbc1207-a738-46be-835b-c133df0df892@vondra.me Whole thread Raw |
In response to | Re: Should we update the random_page_cost default value? (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Should we update the random_page_cost default value?
|
List | pgsql-hackers |
On 10/6/25 23:27, Andres Freund wrote: > Hi, > > On 2025-10-06 20:20:15 +0200, Tomas Vondra wrote: >> On 10/6/25 17:14, Andres Freund wrote: >>> It also seems that due to the ordering inside the table (the order by >>> random()) during the table creation, you're going to see vastly different >>> number of page accesses. While that's obviously something worth taking into >>> account for planning purposes, I don't think it'd be properly done by the >>> random_page_cost itself. >>> >> >> I don't understand your argument. Yes, the index scan does way more page >> accesses. Essentially, each index tuple reads a new page - and without a >> cache hit it's an I/O. With fillfactor=20 we have ~21 tuples per heap >> page, which means we have to read it ~21x. In other words, it reads as >> many pages as there are tuples. >> >> This is why my formulas divide the seqscan timing by number of pages, >> but indexscan is divided by number of tuples. And AFAIK the costing does >> exactly that too. So the random_page_cost doesn't need to do anything >> special, we estimate the number of page reads. (And at least in the case >> of random table it's pretty accurate.) > > The thing is that your queries have vastly different timings *regardless* of > IO. Here's an example, fully cached: > > postgres[2718803][1]=# EXPLAIN ANALYZE SELECT * FROM (SELECT * FROM test_table) OFFSET 100000000; > ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ > │ QUERY PLAN │ > ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ > │ Limit (cost=288096.16..288096.16 rows=1 width=41) (actual time=392.066..392.067 rows=0.00 loops=1) │ > │ Buffers: shared hit=238096 │ > │ -> Seq Scan on test_table (cost=0.00..288096.16 rows=5000016 width=41) (actual time=0.032..269.031 rows=5000000.00loops=1) │ > │ Buffers: shared hit=238096 │ > │ Planning Time: 0.087 ms │ > │ Execution Time: 392.082 ms │ > └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ > (6 rows) > > postgres[2759534][1]=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM (SELECT ctid,* FROM test_table ORDER BY id) offset 100000000; > ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ > │ QUERY PLAN │ > ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ > │ Limit (cost=1082201.28..1082201.28 rows=1 width=47) (actual time=3025.902..3025.903 rows=0.00 loops=1) │ > │ Buffers: shared hit=5013643 │ > │ -> Index Scan using test_table_id_idx on test_table (cost=0.43..1082201.28 rows=5000016 width=47) (actual time=0.035..2907.699rows=5000000.00 loops=1) │ > │ Index Searches: 1 │ > │ Buffers: shared hit=5013643 │ > │ Planning Time: 0.131 ms │ > │ Execution Time: 3025.930 ms │ > └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ > (7 rows) > > You have a ~8x difference in time, even without a single IO. A lot of that is > probably just the 20x difference in buffer accesses, but also the index path > just being a lot less predictable for the CPU. > Right. And that's why the timing is divided by *pages* for sequential scans, and by *rows* for index scan. Which for the timings you show ends up doing this: seqscan = 392 / 238096 = 0.0016 indexescan = 3025 / 5M = 0.0006 So the per-page overhead with index scans is 1/3 of seqscans. AFAICS this directly contradicts your argument that random_page_cost ends up too high due to indexscans paying more per page. I'd bet these costs to be pretty negligible compared to the cost of actual I/O. So I don't see how would this explain the timing difference with cold data. > I don't think random/seq_page_cost should account for difference in processing > costs that aren't actually related to it being a sequential or a random IO. > Perhaps. If it's not directly related to randomness of I/O, then maybe random_page_cost is not the right cost parameter. But it needs to be included *somewhere*, right? And it's per-page cost, and per the experiments there's very clear difference between sequential and random access. So why wouldn't random_page_cost be a good fit? > >>> I think doing this kind of measurement via normal SQL query processing is >>> almost always going to have too much other influences. I'd measure using fio >>> or such instead. It'd be interesting to see fio numbers for your disks... >>> >>> fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1--runtime=5 --ioengine pvsync --iodepth 1 >>> vs --rw randread >>> >>> gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another. >>> >> >> I can give it a try. But do we really want to strip "our" overhead with >> reading data? > > I think the relevant differences in efficiency here don't come from something > inherent in postgres, it's that you're comparing completely different code > paths that have different IO independent costs. You could totally measure it > in postgres, just not easily via "normal" SQL queries on tables/indexes. > If the two paths for seq/index access have so vastly different overheads, why shouldn't that be reflected in the seq_page_cost and random_page_cost? But as shown above, the per-page timings IMHO contradict this. > Maybe we should have a function that measures some basic IO "factors" for each > tablespace and for pg_wal. > I'm not opposed to having functions like this. > > You can do a decent approximation of this via postgres too, by utilizing > pg_prewarm and/or pageinspect. E.g. something like > > random IO: > SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY random(), blkno); > and for sequential IO > SELECT count(*) FROM (SELECT pg_prewarm('test_table', first_block=>blkno, last_block=>blkno) FROM (SELECT generate_series(0,(pg_relation_size('test_table') - 1) / 8192) AS blkno) ORDER BY blkno, random()); > > Time for sequential IO: 8066.720 ms > Time for random IO: 31623.122 ms > > That's, almost eerily, close to 4x :) > > > FWIW, when cached, the difference in runtime between the two variants is > neglegible. > The problem is the sequential query forces the I/O blocks to be 8kB each, instead of issuing larger requests (like a regular seqscan). On my laptops that makes the seqscan 10x slower, so for regular seqscan the ratio is way higher than 4x. Maybe this the kind of overhead you claim should not be considered when setting random_page_cost, but I'm not convinced of that. > > I guess it'd actually be better to use mode=>'read' (to avoid the constant > cost of the memcpy into the buffer pool from reducing the impact of sequential > vs random IO), in which case the difference in IO time increases to a more > substantial 9x for the uncached case. > > >>>> From a robustness point of view, wouldn't it be better to actually err >>>> on the side of using a higher random_page_cost value? That'd mean we >>>> flip to "more-sequential" scans sooner, with much "flatter" behavior. >>>> That doesn't need to be a seqscan (which is about as flat as it gets), >>>> but e.g. a bitmap scan - which probably silently "fixes" many cases >>>> where the index scan gets costed too low. >>> >>> I think it's often the exact opposite - folks use a lower random page cost to >>> *prevent* the planner from going to sequential (or bitmap heap) scans. In many >>> real-world queries our selectivity estimates aren't great and the performance >>> penalties of switching from an index scan to a sequential scan are really >>> severe. As you note, this is heavily exascerbated by the hot data often being >>> cached, but cold data not. Obviously the seqscan will process the cold data >>> too. >>> >> >> I understand your point, but I'm not convinced random_page_cost is the >> right tool to fix incorrect estimates. > > I don't think it is - but I do think that if we just increased > random_page_cost substantially, without other accompanying changes, we'll > cause a lot of problems for folks, due to the above issue. > OK. -- Tomas Vondra
pgsql-hackers by date: