Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id kwpndt7lqr4finrxc76p6cfgfin2o57e6ldg2eysmzchykdlxl@cijo5ahsosl4
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Tomas Vondra <tomas@vondra.me>)
Responses Re: BitmapHeapScan streaming read user and prelim refactoring
List pgsql-hackers
Hi,

On 2025-02-14 18:36:37 +0100, Tomas Vondra wrote:
> All of this is true, ofc, but maybe it's better to have a tool providing
> at least some advice

I agree, a tool like that would be useful!

One difficulty is that the relevant parameter space is really large, making it
hard to keep the runtime in a reasonable range...


> ? I'd imagine pg_test_fsync is affected by many of
> those issues too (considering both are about I/O).

I think pg_test_fsync is a bit less affected, because it doens't have a high
queue depth, so it doesn't reach limits quite as quickly as something doing
higher queue depth IO.

Orthogonal aside: pg_test_fsync's numbers are not particularly helpful:

- It e.g. tests O_SYNC, O_DSYNC with O_DIRECT, while testing fsync/fdatasync
  with buffered IO.  But all the sync methods apply for both buffered and
  direct IO. And of course it'd be helpful to note what's being used, since
  postgres can do either...

- Only tests O_SYNC, not O_DSYNC with larger write sizes, even though that
  information is a lot more relevant for O_DSYNC (since O_SYNC is useless)

- Only tests write sizes of up to 16kB, even though larger writes are
  extremely common and performance critical...

There's more, but it's already a long enough aside.



> I'd definitely not want initdb to do this automatically, though. Getting
> good numbers is fairly expensive (in time and I/O), can be flaky, etc.

Yea.


> But maybe having a tool that gives you a bunch of numbers, as input for
> manual tuning, would be good enough?

I think it'd be useful. I'd perhaps make it an SQL callable tool though, so
it can be run in cloud environments.


> As you say, it's not just about the hardware (and how that changes over
> time because of "burst" credits etc.), but also about the workload.
> Would it be possible to track something, and adjust this dynamically
> over time? And then adjust the prefetch distance in some adaptive way?

Yes, I do think so!  It's not trivial, but I think we eventually do want it.

Melanie has worked on this a fair bit, fwiw.

My current thinking is that we'd want something very roughly like TCP
BBR. Basically, it predicts the currently available bandwidth not just via
lost packets - the traditional approach - but also by building a continually
updated model of "bytes in flight" and latency and uses that to predict what
the achievable bandwidth is.

There are two snags on the way there:

1) Timestamps aren't cheap, so we probably can't do this for every IO.

   Modern NICs can associate timestamps with packets on a hardware level, we
   don't have that luxury.

2) It's not always easy to know an accurate completion timestamp.

   E.g. if a backend fired off a bunch of reads via io_uring and then is busy
   doing CPU bound work, we don't know how long ago the requests already
   completed.

   We probably can approximate that though. Or we could use a background
   process or timer interrupt to add a timestamp to IOs.



> Perhaps it's a bit naive, but say we know what % of requests is handled
> from cache, how long it took, etc. We opportunistically increase the
> prefetch distance, and check the cache hit ratio after a while. Did it
> help (% increased, consumed less time), maybe try another step. If not,
> maybe try prefetching less?

I don't immediately understand how you can use cache hit ratio here? We only
will read data if it was a cache miss, after all? And we keep buffers pinned,
so they can't be thrown out.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Confine vacuum skip logic to lazy_scan_skip
Next
From: Melanie Plageman
Date:
Subject: Re: Confine vacuum skip logic to lazy_scan_skip