Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | kwpndt7lqr4finrxc76p6cfgfin2o57e6ldg2eysmzchykdlxl@cijo5ahsosl4 Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (Tomas Vondra <tomas@vondra.me>) |
Responses |
Re: BitmapHeapScan streaming read user and prelim refactoring
|
List | pgsql-hackers |
Hi, On 2025-02-14 18:36:37 +0100, Tomas Vondra wrote: > All of this is true, ofc, but maybe it's better to have a tool providing > at least some advice I agree, a tool like that would be useful! One difficulty is that the relevant parameter space is really large, making it hard to keep the runtime in a reasonable range... > ? I'd imagine pg_test_fsync is affected by many of > those issues too (considering both are about I/O). I think pg_test_fsync is a bit less affected, because it doens't have a high queue depth, so it doesn't reach limits quite as quickly as something doing higher queue depth IO. Orthogonal aside: pg_test_fsync's numbers are not particularly helpful: - It e.g. tests O_SYNC, O_DSYNC with O_DIRECT, while testing fsync/fdatasync with buffered IO. But all the sync methods apply for both buffered and direct IO. And of course it'd be helpful to note what's being used, since postgres can do either... - Only tests O_SYNC, not O_DSYNC with larger write sizes, even though that information is a lot more relevant for O_DSYNC (since O_SYNC is useless) - Only tests write sizes of up to 16kB, even though larger writes are extremely common and performance critical... There's more, but it's already a long enough aside. > I'd definitely not want initdb to do this automatically, though. Getting > good numbers is fairly expensive (in time and I/O), can be flaky, etc. Yea. > But maybe having a tool that gives you a bunch of numbers, as input for > manual tuning, would be good enough? I think it'd be useful. I'd perhaps make it an SQL callable tool though, so it can be run in cloud environments. > As you say, it's not just about the hardware (and how that changes over > time because of "burst" credits etc.), but also about the workload. > Would it be possible to track something, and adjust this dynamically > over time? And then adjust the prefetch distance in some adaptive way? Yes, I do think so! It's not trivial, but I think we eventually do want it. Melanie has worked on this a fair bit, fwiw. My current thinking is that we'd want something very roughly like TCP BBR. Basically, it predicts the currently available bandwidth not just via lost packets - the traditional approach - but also by building a continually updated model of "bytes in flight" and latency and uses that to predict what the achievable bandwidth is. There are two snags on the way there: 1) Timestamps aren't cheap, so we probably can't do this for every IO. Modern NICs can associate timestamps with packets on a hardware level, we don't have that luxury. 2) It's not always easy to know an accurate completion timestamp. E.g. if a backend fired off a bunch of reads via io_uring and then is busy doing CPU bound work, we don't know how long ago the requests already completed. We probably can approximate that though. Or we could use a background process or timer interrupt to add a timestamp to IOs. > Perhaps it's a bit naive, but say we know what % of requests is handled > from cache, how long it took, etc. We opportunistically increase the > prefetch distance, and check the cache hit ratio after a while. Did it > help (% increased, consumed less time), maybe try another step. If not, > maybe try prefetching less? I don't immediately understand how you can use cache hit ratio here? We only will read data if it was a cache miss, after all? And we keep buffers pinned, so they can't be thrown out. Greetings, Andres Freund
pgsql-hackers by date: