Thread: effective_io_concurrency's steampunk spindle maths

effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
Hello,

I was reading through some old threads[1][2][3] while trying to figure
out how to add a new GUC to control I/O prefetching for new kinds of
things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the
context of RAID spindles.

On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
> > On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
> > Maybe the best thing we can do is just completely abandon the "number of
> > spindles" idea, and just say "number of I/O requests to prefetch". Possibly
> > with an explanation of how to estimate it (devices * queue length).
>
> I think that'd be a lot better.

+many, though I doubt I could describe how to estimate it myself,
considering cloud storage, SANs, multi-lane NVMe etc.  You basically
have to experiment, and like most of our resource consumption limits,
it's a per-backend limit anyway, so it's pretty complicated, but I
don't see how the harmonic series helps anyone.

Should we rename it?  Here are my first suggestions:

random_page_prefetch_degree
maintenance_random_page_prefetch_degree

Rationale for this naming pattern:
* "random_page" from "random_page_cost"
* leaves room for a different setting for sequential prefetching
* "degree" conveys the idea without using loaded words like "queue"
that might imply we know something about the I/O subsystem or that
it's system-wide like kernel and device queues
* "maintenance_" prefix is like other GUCs that establish (presumably
larger) limits for processes working on behalf of many user sessions

Whatever we call it, I don't think it makes sense to try to model the
details of any particular storage system.  Let's use a simple counter
of I/Os initiated but not yet known to have completed (for now: it has
definitely completed when the associated pread() complete; perhaps
something involving real async I/O completion notification in later
releases).

[1]
https://www.postgresql.org/message-id/flat/CAHyXU0yaUG9R_E5%3D1gdXhD-MpWR%3DGr%3D4%3DEHFD_fRid2%2BSCQrqA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/Pine.GSO.4.64.0809220317320.20434%40westnet.com
[3] https://www.postgresql.org/message-id/flat/FDDBA24E-FF4D-4654-BA75-692B3BA71B97%40enterprisedb.com
[4]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
[5] https://www.postgresql.org/message-id/CA%2BTgmoZP-CTmEPZdmqEOb%2B6t_Tts2nuF7eoqxxvXEHaUoBDmsw%40mail.gmail.com



Re: effective_io_concurrency's steampunk spindle maths

From
Andres Freund
Date:
Hi,

On 2020-03-02 18:28:41 +1300, Thomas Munro wrote:
> I was reading through some old threads[1][2][3] while trying to figure
> out how to add a new GUC to control I/O prefetching for new kinds of
> things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the
> context of RAID spindles.
>
> On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
> > > On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
> > > Maybe the best thing we can do is just completely abandon the "number of
> > > spindles" idea, and just say "number of I/O requests to prefetch". Possibly
> > > with an explanation of how to estimate it (devices * queue length).
> >
> > I think that'd be a lot better.
>
> +many, though I doubt I could describe how to estimate it myself,
> considering cloud storage, SANs, multi-lane NVMe etc.  You basically
> have to experiment, and like most of our resource consumption limits,
> it's a per-backend limit anyway, so it's pretty complicated, but I
> don't see how the harmonic series helps anyone.
>
> Should we rename it?  Here are my first suggestions:

Why rename? It's not like anybody knew how to infer a useful value for
effective_io_concurrency, given the math computing the actually
effective prefetch distance... I feel like we'll just unnecessarily
cause people difficulty by doing so.


> random_page_prefetch_degree
> maintenance_random_page_prefetch_degree

I don't like these names.


> Rationale for this naming pattern:
> * "random_page" from "random_page_cost"

I don't think we want to corner us into only ever using these for random
io.


> * leaves room for a different setting for sequential prefetching

I think if we want to split those at some point, we ought to split it if
we have a good reason, not before. It's not at all clear to me why you'd
want a substantially different queue depth for both.


> * "degree" conveys the idea without using loaded words like "queue"
> that might imply we know something about the I/O subsystem or that
> it's system-wide like kernel and device queues

Why is that good? Queue depth is a pretty well established term. You can
search for benchmarks of devices with it, you can correlate with OS
config, etc.


> * "maintenance_" prefix is like other GUCs that establish (presumably
> larger) limits for processes working on behalf of many user sessions

That part makes sense to me.


> Whatever we call it, I don't think it makes sense to try to model the
> details of any particular storage system.  Let's use a simple counter
> of I/Os initiated but not yet known to have completed (for now: it has
> definitely completed when the associated pread() complete; perhaps
> something involving real async I/O completion notification in later
> releases).

+1

Greetings,

Andres Freund



Re: effective_io_concurrency's steampunk spindle maths

From
Tomas Vondra
Date:
On Fri, Mar 06, 2020 at 10:05:13AM -0800, Andres Freund wrote:
>Hi,
>
>On 2020-03-02 18:28:41 +1300, Thomas Munro wrote:
>> I was reading through some old threads[1][2][3] while trying to figure
>> out how to add a new GUC to control I/O prefetching for new kinds of
>> things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the
>> context of RAID spindles.
>>
>> On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
>> > > On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
>> > > Maybe the best thing we can do is just completely abandon the "number of
>> > > spindles" idea, and just say "number of I/O requests to prefetch". Possibly
>> > > with an explanation of how to estimate it (devices * queue length).
>> >
>> > I think that'd be a lot better.
>>
>> +many, though I doubt I could describe how to estimate it myself,
>> considering cloud storage, SANs, multi-lane NVMe etc.  You basically
>> have to experiment, and like most of our resource consumption limits,
>> it's a per-backend limit anyway, so it's pretty complicated, but I
>> don't see how the harmonic series helps anyone.
>>
>> Should we rename it?  Here are my first suggestions:
>
>Why rename? It's not like anybody knew how to infer a useful value for
>effective_io_concurrency, given the math computing the actually
>effective prefetch distance... I feel like we'll just unnecessarily
>cause people difficulty by doing so.
>

I think the main issue with keeping the current GUC name is that if you
had a value that worked, we'll silently interpret it differently. Which
is a bit annoying :-(

So I think we should either rename e_i_c or keep it as is, and then also
have a new GUC. And then translate the values between those (but that
might be overkill).

>
>> random_page_prefetch_degree
>> maintenance_random_page_prefetch_degree
>
>I don't like these names.
>

What about these names?

  * effective_io_prefetch_distance
  * effective_io_prefetch_queue
  * effective_io_queue_depth

>
>> Rationale for this naming pattern:
>> * "random_page" from "random_page_cost"
>
>I don't think we want to corner us into only ever using these for random
>io.
>

+1

>
>> * leaves room for a different setting for sequential prefetching
>
>I think if we want to split those at some point, we ought to split it if
>we have a good reason, not before. It's not at all clear to me why you'd
>want a substantially different queue depth for both.
>

+1

>
>> * "degree" conveys the idea without using loaded words like "queue"
>> that might imply we know something about the I/O subsystem or that
>> it's system-wide like kernel and device queues
>
>Why is that good? Queue depth is a pretty well established term. You can
>search for benchmarks of devices with it, you can correlate with OS
>config, etc.
>

I mostly agree. With "queue depth" people have a fairly good idea what
they're setting, while with "degree" that's pretty unlikely I think.

>
>> * "maintenance_" prefix is like other GUCs that establish (presumably
>> larger) limits for processes working on behalf of many user sessions
>
>That part makes sense to me.
>
>
>> Whatever we call it, I don't think it makes sense to try to model the
>> details of any particular storage system.  Let's use a simple counter
>> of I/Os initiated but not yet known to have completed (for now: it has
>> definitely completed when the associated pread() complete; perhaps
>> something involving real async I/O completion notification in later
>> releases).
>
>+1
>

Agreed.


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: effective_io_concurrency's steampunk spindle maths

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> So I think we should either rename e_i_c or keep it as is, and then also
> have a new GUC. And then translate the values between those (but that
> might be overkill).

Please DON'T try to have two interrelated GUCs for this.  We learned
our lesson about that years ago.

I think dropping the existing GUC is a perfectly sane thing to do,
if the new definition wouldn't be compatible.  In practice few
people will notice, because few will have set it.

            regards, tom lane



Re: effective_io_concurrency's steampunk spindle maths

From
Michael Banck
Date:
Hi,

On Mon, Mar 02, 2020 at 06:28:41PM +1300, Thomas Munro wrote:
> Should we rename it?  Here are my first suggestions:
> 
> maintenance_random_page_prefetch_degree
> 
> Rationale for this naming pattern:
[...]
> * "maintenance_" prefix is like other GUCs that establish (presumably
> larger) limits for processes working on behalf of many user sessions

I'm a bit skeptical about this - at least in V12 there's only two GUCs
with 'maintenance' in the name:  maintenance_work_mem and
max_parallel_maintenance_workers. Both are used for utility commands and
do not apply to regular user queries while (AFAICT) your proposal is not
limited to utility commands. So I think if you name it
'maintenance'-something, people will assume it only involves VACUUM or
so.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
On Sat, Mar 7, 2020 at 9:52 AM Michael Banck <michael.banck@credativ.de> wrote:
> On Mon, Mar 02, 2020 at 06:28:41PM +1300, Thomas Munro wrote:
> > * "maintenance_" prefix is like other GUCs that establish (presumably
> > larger) limits for processes working on behalf of many user sessions
>
> I'm a bit skeptical about this - at least in V12 there's only two GUCs
> with 'maintenance' in the name:  maintenance_work_mem and
> max_parallel_maintenance_workers. Both are used for utility commands and
> do not apply to regular user queries while (AFAICT) your proposal is not
> limited to utility commands. So I think if you name it
> 'maintenance'-something, people will assume it only involves VACUUM or
> so.

No, the proposal is not for the "maintenance" GUC to affect user
queries.  The idea is that the "maintenance" GUC would be used for WAL
prefetching during recovery[1], index prefetch during VACUUM[2] and
probably some other proposed things that are in development relating
to background "undo" processing.  What these things have in common, as
Andres first articulated on thread [2] is that they all deal with a
workload that is correlated with the activities of multiple user
backends running concurrently.  That's the basic idea of the WAL
prefetching patch: even though all backends suffer from I/O stalls due
to cache misses, usually that's happening concurrently in many
backends.  A streaming replica that is trying to follow along
replaying the write-workload of the primary has to suffer all those
stalls sequentially, so I'm trying to recreate some I/O parallelism by
looking ahead in the WAL. The theory with the two GUCs is that a user
backend should be able to use some I/O parallelism, but a
"maintenance" job like the WAL prefetcher should be allowed to use a
lot more.  That's why the existing VACUUM code mentioned in thread [2]
already does "+ 10".

Maybe "maintenance" isn't the best word for this, but that's the idea.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BTgmoZP-CTmEPZdmqEOb%2B6t_Tts2nuF7eoqxxvXEHaUoBDmsw%40mail.gmail.com



Re: effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
On Sat, Mar 7, 2020 at 8:00 AM Andres Freund <andres@anarazel.de> wrote:
> On 2020-03-02 18:28:41 +1300, Thomas Munro wrote:
> > * leaves room for a different setting for sequential prefetching
>
> I think if we want to split those at some point, we ought to split it if
> we have a good reason, not before. It's not at all clear to me why you'd
> want a substantially different queue depth for both.

Alright, I retract that part.  It's known that at least on some
systems you might want to suppress that (due to some kind of bad
interaction with kernel readahead heuristics).  But that isn't really
an argument for having a different queue size, it's an argument for
having a separate on/off switch.

> > * "degree" conveys the idea without using loaded words like "queue"
> > that might imply we know something about the I/O subsystem or that
> > it's system-wide like kernel and device queues
>
> Why is that good? Queue depth is a pretty well established term. You can
> search for benchmarks of devices with it, you can correlate with OS
> config, etc.

Queue depth is the standard term for an I/O queue that is shared by
all users.  What we're talking about here is undeniably also a queue
with a depth, but it's a limit on the amount of concurrent I/O that
*each operator in a query* will try to initiate (for example: each
bitmap heap scan in the query, in future perhaps btree scans and other
things), so I was thinking that we might want a different name.

The more I think about this the more I appreciate the current vague GUC name!



Re: effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
On Sat, Mar 7, 2020 at 8:35 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I think the main issue with keeping the current GUC name is that if you
> had a value that worked, we'll silently interpret it differently. Which
> is a bit annoying :-(

Yeah.  Perhaps we should just give the formula for translating v12
settings to v13 settings in the release notes.  If we don't rename the
GUC, you won't be forced to contemplate this when you upgrade, so the
amount of prefetching we do will go down a bit given the same value.
That is indeed what led me to start thinking about what a good new
name would be.  Now that I've been talked out of the "random_page"
part, your names look like sensible candidates, but I wonder if there
is some way to capture that it's "per operation"...



Re: effective_io_concurrency's steampunk spindle maths

From
Evgeniy Shishkin
Date:

> On Mar 7, 2020, at 00:33, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> That is indeed what led me to start thinking about what a good new
> name would be.

MySQL has a term io_capacity.
https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-io-capacity.html
> The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to
approximatelythe number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is
set,InnoDB estimates the I/O bandwidth available for background tasks based on the set value. 
>

Perhaps we can have maintenance_io_capacity as well.


Re: effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
On Sat, Mar 7, 2020 at 11:54 PM Evgeniy Shishkin <itparanoia@gmail.com> wrote:
> > On Mar 7, 2020, at 00:33, Thomas Munro <thomas.munro@gmail.com> wrote:
> > That is indeed what led me to start thinking about what a good new
> > name would be.
>
> MySQL has a term io_capacity.
> https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-io-capacity.html
> > The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to
approximatelythe number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is
set,InnoDB estimates the I/O bandwidth available for background tasks based on the set value. 
> >
>
> Perhaps we can have maintenance_io_capacity as well.

That sounds like total I/O capacity for your system that will be
shared out for various tasks, which would definitely be nice to have,
but here we're talking about a simpler per-operation settings.  What
we have is a bit like work_mem (a memory limit used for each
individual hash, sort, tuplestore, ...), compared to a hypothetical
whole-system memory budget (which would definitely also be nice to
have).



Re: effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
On Sat, Mar 7, 2020 at 9:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> > So I think we should either rename e_i_c or keep it as is, and then also
> > have a new GUC. And then translate the values between those (but that
> > might be overkill).
>
> Please DON'T try to have two interrelated GUCs for this.  We learned
> our lesson about that years ago.

Ack.

> I think dropping the existing GUC is a perfectly sane thing to do,
> if the new definition wouldn't be compatible.  In practice few
> people will notice, because few will have set it.

That's what I thought too, but if Andres is right that "it's not like
anybody knew how to infer a useful value", I'm wondering it's enough
if we just provide an explanation of the change in the release notes.
The default doesn't change (1 goes to 1), so most people will
experience no change, but it you had it set to (say) 42 after careful
experimentation, you might like to consider updating it to the result
of:

  select round(sum(42 / n::float)) as new_setting from
generate_series(1, 42) s(n)

Here's a patch set to remove the spindle stuff, add a maintenance
variant, and use the maintenance one in
heap_compute_xid_horizon_for_tuples().

Attachment

Re: effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
On Tue, Mar 10, 2020 at 12:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a patch set to remove the spindle stuff, add a maintenance
> variant, and use the maintenance one in
> heap_compute_xid_horizon_for_tuples().

Pushed.



Re: effective_io_concurrency's steampunk spindle maths

From
Peter Geoghegan
Date:
On Sun, Mar 15, 2020 at 9:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Mar 10, 2020 at 12:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Here's a patch set to remove the spindle stuff, add a maintenance
> > variant, and use the maintenance one in
> > heap_compute_xid_horizon_for_tuples().
>
> Pushed.

Shouldn't you close out the "Should we rename
effective_io_concurrency?" Postgres 13 open item now?

-- 
Peter Geoghegan



Re: effective_io_concurrency's steampunk spindle maths

From
Thomas Munro
Date:
On Wed, May 13, 2020 at 6:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Shouldn't you close out the "Should we rename
> effective_io_concurrency?" Postgres 13 open item now?

Yeah, that doesn't really seem worth the churn.  I'll move it to the
resolved list in a day or two if no one shows up to argue for a
rename.