Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id cg354qt5fuecp2gbdrervaf6psg3ycrwamvqrzfdlnvtgiwldy@dbz2zekqvawk
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2025-02-10 16:24:25 -0500, Robert Haas wrote:
> On Mon, Feb 10, 2025 at 1:11 PM Tomas Vondra <tomas@vondra.me> wrote:
> > Certainly for the "localized" regressions, and cases when bitmapheapscan
> > would not be picked. The eic=1 case makes me a bit more nervous, because
> > it's default and affects NVMe storage. Would be good to know why is
> > that, or perhaps consider bumping up eic default. Not sure.
> 
> I'm relatively upset by the fact that effective_io_concurrency is
> measured in units that are, AFAIUI, completely stupid. The original
> idea was that you would set it to the number of spindles you have. But
> from what I have heard, that didn't actually work: you needed to set
> it to a significantly higher number.

Which isn't too surprising.  I think there are a few main reasons #spindles
was way too low:

1) Decent interfaces to rotating media benefit very substantially from having
   multiple requests in flight, for each spindle. The disk can reorder the
   actual disk access so they each can be executed without needing on average
   half a rotation for each request.  Similar reordering can also happen on
   the OS level, even if not quite to the same degree of benefit.

2) Even if each spindle could only execute one IO request, you still want to
   start those requests substantially earlier than allowed by
   effective_io_concurrency=#spindles, to allow for the IOs to complete. With
   e.g. bitmap heap scans, there often are sequences of blocks that can be
   read together in one IO.

3) A single spindle commonly had multiple platters and a disk could read from
   multiple platters within a single rotation.


However, I'm not so sure that the unit of effective_io_concurrency is really
the main issue.  It IMO makes sense to limit the max number of requests one
scan can trigger, to prevent one scan from completely swamping the IO request
queue for the next several seconds (trivially possible without a limit on
smaller cloud storage disks with low iops).

It's plausible that we would want to separately control the "distance" from
the currently "consumed" position position and the
maximum-io-requests-in-flight.  But I think we might be able to come up with a
tuning algorithm that can tune the distance based on the observed latency and
"consumption" speed.


I think what we eventually should get rid of is the "effective_" prefix. My
understanding is that basically there to denote that we don't really have an
idea what concurrency is achieved, as posix_fadvise() neither tells us if IO
was necessary in the first place, nor whether multiple blocks could be read in
one IO, nor when IO has completed.

The read_stream implementation already has redefined effective_io_concurrency
to only count one IO for a multi-block read.


Of course that doesn't mean we shouldn't change the default (we very clearly
should) or the documentation for the GUC.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring
Next
From: Thomas Munro
Date:
Subject: Re: AIO v2.3