Re: Should we update the random_page_cost default value? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Should we update the random_page_cost default value? |
Date | |
Msg-id | c7dryab6hx5bu2qob3vpledzfgcywibxyxlwlfhljpugzl7pen@ahxgegh4qvq2 Whole thread Raw |
In response to | Re: Should we update the random_page_cost default value? (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
Hi, On 2025-10-06 01:53:05 -0400, Tom Lane wrote: > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > On Mon, 2025-10-06 at 01:29 -0400, Tom Lane wrote: > >> But if what > >> we're trying to model is net resource demands, with an eye to > >> minimizing the total system load not execution time of any one query, > >> maybe we can continue to work with something close to what we've > >> traditionally done. > > > Did anybody propose that? > > I just did ;-). If we don't adopt a mindset along that line, > then AIO is going to require some *radical* changes in the > planner's I/O cost models. FWIW, I think some vaguely-AIO related cost concept already ought to long have been taken into account, but weren't. I'm not trying to say that we don't need to do work, but that asynchronizity related issues are bigger than explicit asynchronuous IO. E.g., despite there often being a >10x performance difference between a forward and backward index scan in case the index and table order are well correlated, we didn't cost them any differently. The reason for that being that the OS will do readahead for forward scans, but not backward scans. Here's an example explain analyse on a disk with an artification 1ms latency (to simulate networked storage): echo 3 |sudo tee /proc/sys/vm/drop_caches && psql -Xq -f ~/tmp/evict.sql && \ /usr/bin/time -f '%e' psql -Xq -c 'explain analyze SELECT * FROM aiobench ORDER BY serial_val ASC LIMIT 1 OFFSET 100000;' QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------------------> Limit (cost=4455.58..4455.62 rows=1 width=120) (actual time=49.820..49.821 rows=1.00 loops=1) Buffers: shared hit=553 read=2145 I/O Timings: shared read=36.096 -> Index Scan using aiobench_serial_val_idx on aiobench (cost=0.57..8908138.12 rows=199957824 width=120) (actual time=2.971..47.261rows=100001.00 loops=> Index Searches: 1 Buffers: shared hit=553 read=2145 I/O Timings: shared read=36.096 Planning: Buffers: shared hit=113 read=23 I/O Timings: shared read=10.618 Planning Time: 10.981 ms Execution Time: 49.838 ms (12 rows) echo 3 |sudo tee /proc/sys/vm/drop_caches && psql -Xq -f ~/tmp/evict.sql && \ /usr/bin/time -f '%e' psql -Xq -c 'explain analyze SELECT * FROM aiobench ORDER BY serial_val DESC LIMIT 1 OFFSET 100000;' QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------------------> Limit (cost=4455.58..4455.62 rows=1 width=120) (actual time=2138.047..2138.049 rows=1.00 loops=1) Buffers: shared hit=739 read=2138 I/O Timings: shared read=2123.844 -> Index Scan Backward using aiobench_serial_val_idx on aiobench (cost=0.57..8908138.12 rows=199957824 width=120)(actual time=4.912..2135.519 rows=10000> Index Searches: 1 Buffers: shared hit=739 read=2138 I/O Timings: shared read=2123.844 Planning: Buffers: shared hit=112 read=23 I/O Timings: shared read=10.446 Planning Time: 10.816 ms Execution Time: 2138.067 ms (12 rows) 36ms vs 2123ms in IO time, with the same cost. Of course these queries *do* something different, but we often choose ordered index scans as a way of implementing a query (e.g. as part of fast start plans), where there are other alternatives of implementing the same query. Another example of costing around this being bad is that we do not take effective_io_concurrency into account when planning bitmap heap scans, despite that often making a *massive* difference in whether a bitmap heap scan is a good choice or a bad choice. E.g. on the system with the simulated 1ms IO latency, the difference between 1 and 32 is vast. eic=1: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=606243.51..606243.52 rows=1 width=8) (actual time=184218.775..184218.776 rows=1.00 loops=1) Buffers: shared hit=2 read=194846 I/O Timings: shared read=183913.048 -> Bitmap Heap Scan on aiobench (cost=3900.18..605784.10 rows=183767 width=8) (actual time=79.811..184202.181 rows=199465.00loops=1) Recheck Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision)) Heap Blocks: exact=194299 Buffers: shared hit=2 read=194846 I/O Timings: shared read=183913.048 -> Bitmap Index Scan on aiobench_random_val_idx (cost=0.00..3854.24 rows=183767 width=0) (actual time=37.287..37.288rows=199465.00 loops=1) Index Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision)) Index Searches: 1 Buffers: shared hit=2 read=547 I/O Timings: shared read=4.972 Planning: Buffers: shared hit=76 read=24 I/O Timings: shared read=11.571 Planning Time: 12.953 ms Execution Time: 184218.986 ms eic=64: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=606243.51..606243.52 rows=1 width=8) (actual time=2962.965..2962.966 rows=1.00 loops=1) Buffers: shared hit=2 read=194846 I/O Timings: shared read=2316.070 -> Bitmap Heap Scan on aiobench (cost=3900.18..605784.10 rows=183767 width=8) (actual time=82.871..2947.892 rows=199465.00loops=1) Recheck Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision)) Heap Blocks: exact=194299 Buffers: shared hit=2 read=194846 I/O Timings: shared read=2316.070 -> Bitmap Index Scan on aiobench_random_val_idx (cost=0.00..3854.24 rows=183767 width=0) (actual time=38.526..38.526rows=199465.00 loops=1) Index Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision)) Index Searches: 1 Buffers: shared hit=2 read=547 I/O Timings: shared read=5.021 Planning: Buffers: shared hit=76 read=24 I/O Timings: shared read=11.485 Planning Time: 12.890 ms Execution Time: 2963.154 ms (18 rows) A ~60x query execution difference that's not at all represented in the cost model. The former makes a bitmap heap scan a terrible choice, the latter a good one... > > I was under the impression that PostgreSQL's cost model tries to > > estimate and optimize execution time, not resource consumption. > > Yup, that's our traditional view of it. But I wonder how we > will make such estimates in a parallel-I/O world, especially > if we don't try to account for concurrent query activity. > (Which is a place I don't want to go, because it would render > planning results utterly irreproducible.) Another complicated aspect around this is that we don't take caching into account in any real-world reflecting way. In common workloads the hotly accessed data is all in shared_buffers, but the cold portion of the data is not. In such scenarios switching e.g. from an index scan to a sequential scan will be way worse, since it'll likely eat up a good portion of the IO bandwidth and possibly pollute the buffer pool, than in a scenario where all the data is cold. At the same time, leaving the difficulty of estimating that aside, making plans depend on the current state of the buffer pool has some fairly obvious, and massive, downsides. Like unstable plans and never actually ending up in the "high performance" situation after a restart, due choosing plans that just work for an empty buffer pool. I do not have the *slightest* idea for how to improve the situation around this, even though I think it's fairly important. > > But perhaps I misunderstood, or perhaps I am just too conservative. > > I'm normally pretty conservative also about changing planner > behavior. But in this context I think we need to be wary of > thinking too small. The fact that people keep coming out with > different ideas of what random_page_cost needs to be suggests > that there's something fundamentally wrong with the concept. I think one of the big issues with random_page_cost is that it combines two largely independent things: 1) the increased cost of doing a random IO over sequential IO 2) that random IOs very often are synchronuous and hard to predict / unpredictable But we had support for doing readahead for some random IO for a long time (the posix_fadvise() calls within bitmap heap scans), just without taking that into account from a costing POV. I suspect that we'll continue to need to somehow distinguish between random/non-random IO, the differences are simply too large at the storage level to ignore. But that we need to add accounting for whether IO is synchronuous and when not. If you have a bunch of random IOs, but you cheaply know them ahead of time (say in bitmap heap scan), there should be a different cost for the query than if there a bunch of random IOs that we cannot realistically predict (say the IO for inner index pages in an ordered index scan or all accesses as part of an index nested loop where the inner side is unique). Unfortunately that will probably make it more problematic that we aren't modeling resource consumption - costing a query that does 10x as many, but prefetchable, IOs than a ever so slightly more expensive query is probably not a tradeoff that most want to pay. Greetings, Andres Freund
pgsql-hackers by date: