Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | cg354qt5fuecp2gbdrervaf6psg3ycrwamvqrzfdlnvtgiwldy@dbz2zekqvawk Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
Hi, On 2025-02-10 16:24:25 -0500, Robert Haas wrote: > On Mon, Feb 10, 2025 at 1:11 PM Tomas Vondra <tomas@vondra.me> wrote: > > Certainly for the "localized" regressions, and cases when bitmapheapscan > > would not be picked. The eic=1 case makes me a bit more nervous, because > > it's default and affects NVMe storage. Would be good to know why is > > that, or perhaps consider bumping up eic default. Not sure. > > I'm relatively upset by the fact that effective_io_concurrency is > measured in units that are, AFAIUI, completely stupid. The original > idea was that you would set it to the number of spindles you have. But > from what I have heard, that didn't actually work: you needed to set > it to a significantly higher number. Which isn't too surprising. I think there are a few main reasons #spindles was way too low: 1) Decent interfaces to rotating media benefit very substantially from having multiple requests in flight, for each spindle. The disk can reorder the actual disk access so they each can be executed without needing on average half a rotation for each request. Similar reordering can also happen on the OS level, even if not quite to the same degree of benefit. 2) Even if each spindle could only execute one IO request, you still want to start those requests substantially earlier than allowed by effective_io_concurrency=#spindles, to allow for the IOs to complete. With e.g. bitmap heap scans, there often are sequences of blocks that can be read together in one IO. 3) A single spindle commonly had multiple platters and a disk could read from multiple platters within a single rotation. However, I'm not so sure that the unit of effective_io_concurrency is really the main issue. It IMO makes sense to limit the max number of requests one scan can trigger, to prevent one scan from completely swamping the IO request queue for the next several seconds (trivially possible without a limit on smaller cloud storage disks with low iops). It's plausible that we would want to separately control the "distance" from the currently "consumed" position position and the maximum-io-requests-in-flight. But I think we might be able to come up with a tuning algorithm that can tune the distance based on the observed latency and "consumption" speed. I think what we eventually should get rid of is the "effective_" prefix. My understanding is that basically there to denote that we don't really have an idea what concurrency is achieved, as posix_fadvise() neither tells us if IO was necessary in the first place, nor whether multiple blocks could be read in one IO, nor when IO has completed. The read_stream implementation already has redefined effective_io_concurrency to only count one IO for a multi-block read. Of course that doesn't mean we shouldn't change the default (we very clearly should) or the documentation for the GUC. Greetings, Andres Freund
pgsql-hackers by date: