Thread: effective_io_concurrency's steampunk spindle maths
Hello, I was reading through some old threads[1][2][3] while trying to figure out how to add a new GUC to control I/O prefetching for new kinds of things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the context of RAID spindles. On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote: > > On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote: > > Maybe the best thing we can do is just completely abandon the "number of > > spindles" idea, and just say "number of I/O requests to prefetch". Possibly > > with an explanation of how to estimate it (devices * queue length). > > I think that'd be a lot better. +many, though I doubt I could describe how to estimate it myself, considering cloud storage, SANs, multi-lane NVMe etc. You basically have to experiment, and like most of our resource consumption limits, it's a per-backend limit anyway, so it's pretty complicated, but I don't see how the harmonic series helps anyone. Should we rename it? Here are my first suggestions: random_page_prefetch_degree maintenance_random_page_prefetch_degree Rationale for this naming pattern: * "random_page" from "random_page_cost" * leaves room for a different setting for sequential prefetching * "degree" conveys the idea without using loaded words like "queue" that might imply we know something about the I/O subsystem or that it's system-wide like kernel and device queues * "maintenance_" prefix is like other GUCs that establish (presumably larger) limits for processes working on behalf of many user sessions Whatever we call it, I don't think it makes sense to try to model the details of any particular storage system. Let's use a simple counter of I/Os initiated but not yet known to have completed (for now: it has definitely completed when the associated pread() complete; perhaps something involving real async I/O completion notification in later releases). [1] https://www.postgresql.org/message-id/flat/CAHyXU0yaUG9R_E5%3D1gdXhD-MpWR%3DGr%3D4%3DEHFD_fRid2%2BSCQrqA%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/Pine.GSO.4.64.0809220317320.20434%40westnet.com [3] https://www.postgresql.org/message-id/flat/FDDBA24E-FF4D-4654-BA75-692B3BA71B97%40enterprisedb.com [4] https://www.postgresql.org/message-id/flat/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com [5] https://www.postgresql.org/message-id/CA%2BTgmoZP-CTmEPZdmqEOb%2B6t_Tts2nuF7eoqxxvXEHaUoBDmsw%40mail.gmail.com
Hi, On 2020-03-02 18:28:41 +1300, Thomas Munro wrote: > I was reading through some old threads[1][2][3] while trying to figure > out how to add a new GUC to control I/O prefetching for new kinds of > things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the > context of RAID spindles. > > On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote: > > > On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote: > > > Maybe the best thing we can do is just completely abandon the "number of > > > spindles" idea, and just say "number of I/O requests to prefetch". Possibly > > > with an explanation of how to estimate it (devices * queue length). > > > > I think that'd be a lot better. > > +many, though I doubt I could describe how to estimate it myself, > considering cloud storage, SANs, multi-lane NVMe etc. You basically > have to experiment, and like most of our resource consumption limits, > it's a per-backend limit anyway, so it's pretty complicated, but I > don't see how the harmonic series helps anyone. > > Should we rename it? Here are my first suggestions: Why rename? It's not like anybody knew how to infer a useful value for effective_io_concurrency, given the math computing the actually effective prefetch distance... I feel like we'll just unnecessarily cause people difficulty by doing so. > random_page_prefetch_degree > maintenance_random_page_prefetch_degree I don't like these names. > Rationale for this naming pattern: > * "random_page" from "random_page_cost" I don't think we want to corner us into only ever using these for random io. > * leaves room for a different setting for sequential prefetching I think if we want to split those at some point, we ought to split it if we have a good reason, not before. It's not at all clear to me why you'd want a substantially different queue depth for both. > * "degree" conveys the idea without using loaded words like "queue" > that might imply we know something about the I/O subsystem or that > it's system-wide like kernel and device queues Why is that good? Queue depth is a pretty well established term. You can search for benchmarks of devices with it, you can correlate with OS config, etc. > * "maintenance_" prefix is like other GUCs that establish (presumably > larger) limits for processes working on behalf of many user sessions That part makes sense to me. > Whatever we call it, I don't think it makes sense to try to model the > details of any particular storage system. Let's use a simple counter > of I/Os initiated but not yet known to have completed (for now: it has > definitely completed when the associated pread() complete; perhaps > something involving real async I/O completion notification in later > releases). +1 Greetings, Andres Freund
On Fri, Mar 06, 2020 at 10:05:13AM -0800, Andres Freund wrote: >Hi, > >On 2020-03-02 18:28:41 +1300, Thomas Munro wrote: >> I was reading through some old threads[1][2][3] while trying to figure >> out how to add a new GUC to control I/O prefetching for new kinds of >> things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the >> context of RAID spindles. >> >> On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote: >> > > On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote: >> > > Maybe the best thing we can do is just completely abandon the "number of >> > > spindles" idea, and just say "number of I/O requests to prefetch". Possibly >> > > with an explanation of how to estimate it (devices * queue length). >> > >> > I think that'd be a lot better. >> >> +many, though I doubt I could describe how to estimate it myself, >> considering cloud storage, SANs, multi-lane NVMe etc. You basically >> have to experiment, and like most of our resource consumption limits, >> it's a per-backend limit anyway, so it's pretty complicated, but I >> don't see how the harmonic series helps anyone. >> >> Should we rename it? Here are my first suggestions: > >Why rename? It's not like anybody knew how to infer a useful value for >effective_io_concurrency, given the math computing the actually >effective prefetch distance... I feel like we'll just unnecessarily >cause people difficulty by doing so. > I think the main issue with keeping the current GUC name is that if you had a value that worked, we'll silently interpret it differently. Which is a bit annoying :-( So I think we should either rename e_i_c or keep it as is, and then also have a new GUC. And then translate the values between those (but that might be overkill). > >> random_page_prefetch_degree >> maintenance_random_page_prefetch_degree > >I don't like these names. > What about these names? * effective_io_prefetch_distance * effective_io_prefetch_queue * effective_io_queue_depth > >> Rationale for this naming pattern: >> * "random_page" from "random_page_cost" > >I don't think we want to corner us into only ever using these for random >io. > +1 > >> * leaves room for a different setting for sequential prefetching > >I think if we want to split those at some point, we ought to split it if >we have a good reason, not before. It's not at all clear to me why you'd >want a substantially different queue depth for both. > +1 > >> * "degree" conveys the idea without using loaded words like "queue" >> that might imply we know something about the I/O subsystem or that >> it's system-wide like kernel and device queues > >Why is that good? Queue depth is a pretty well established term. You can >search for benchmarks of devices with it, you can correlate with OS >config, etc. > I mostly agree. With "queue depth" people have a fairly good idea what they're setting, while with "degree" that's pretty unlikely I think. > >> * "maintenance_" prefix is like other GUCs that establish (presumably >> larger) limits for processes working on behalf of many user sessions > >That part makes sense to me. > > >> Whatever we call it, I don't think it makes sense to try to model the >> details of any particular storage system. Let's use a simple counter >> of I/Os initiated but not yet known to have completed (for now: it has >> definitely completed when the associated pread() complete; perhaps >> something involving real async I/O completion notification in later >> releases). > >+1 > Agreed. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > So I think we should either rename e_i_c or keep it as is, and then also > have a new GUC. And then translate the values between those (but that > might be overkill). Please DON'T try to have two interrelated GUCs for this. We learned our lesson about that years ago. I think dropping the existing GUC is a perfectly sane thing to do, if the new definition wouldn't be compatible. In practice few people will notice, because few will have set it. regards, tom lane
Hi, On Mon, Mar 02, 2020 at 06:28:41PM +1300, Thomas Munro wrote: > Should we rename it? Here are my first suggestions: > > maintenance_random_page_prefetch_degree > > Rationale for this naming pattern: [...] > * "maintenance_" prefix is like other GUCs that establish (presumably > larger) limits for processes working on behalf of many user sessions I'm a bit skeptical about this - at least in V12 there's only two GUCs with 'maintenance' in the name: maintenance_work_mem and max_parallel_maintenance_workers. Both are used for utility commands and do not apply to regular user queries while (AFAICT) your proposal is not limited to utility commands. So I think if you name it 'maintenance'-something, people will assume it only involves VACUUM or so. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Sat, Mar 7, 2020 at 9:52 AM Michael Banck <michael.banck@credativ.de> wrote: > On Mon, Mar 02, 2020 at 06:28:41PM +1300, Thomas Munro wrote: > > * "maintenance_" prefix is like other GUCs that establish (presumably > > larger) limits for processes working on behalf of many user sessions > > I'm a bit skeptical about this - at least in V12 there's only two GUCs > with 'maintenance' in the name: maintenance_work_mem and > max_parallel_maintenance_workers. Both are used for utility commands and > do not apply to regular user queries while (AFAICT) your proposal is not > limited to utility commands. So I think if you name it > 'maintenance'-something, people will assume it only involves VACUUM or > so. No, the proposal is not for the "maintenance" GUC to affect user queries. The idea is that the "maintenance" GUC would be used for WAL prefetching during recovery[1], index prefetch during VACUUM[2] and probably some other proposed things that are in development relating to background "undo" processing. What these things have in common, as Andres first articulated on thread [2] is that they all deal with a workload that is correlated with the activities of multiple user backends running concurrently. That's the basic idea of the WAL prefetching patch: even though all backends suffer from I/O stalls due to cache misses, usually that's happening concurrently in many backends. A streaming replica that is trying to follow along replaying the write-workload of the primary has to suffer all those stalls sequentially, so I'm trying to recreate some I/O parallelism by looking ahead in the WAL. The theory with the two GUCs is that a user backend should be able to use some I/O parallelism, but a "maintenance" job like the WAL prefetcher should be allowed to use a lot more. That's why the existing VACUUM code mentioned in thread [2] already does "+ 10". Maybe "maintenance" isn't the best word for this, but that's the idea. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BTgmoZP-CTmEPZdmqEOb%2B6t_Tts2nuF7eoqxxvXEHaUoBDmsw%40mail.gmail.com
On Sat, Mar 7, 2020 at 8:00 AM Andres Freund <andres@anarazel.de> wrote: > On 2020-03-02 18:28:41 +1300, Thomas Munro wrote: > > * leaves room for a different setting for sequential prefetching > > I think if we want to split those at some point, we ought to split it if > we have a good reason, not before. It's not at all clear to me why you'd > want a substantially different queue depth for both. Alright, I retract that part. It's known that at least on some systems you might want to suppress that (due to some kind of bad interaction with kernel readahead heuristics). But that isn't really an argument for having a different queue size, it's an argument for having a separate on/off switch. > > * "degree" conveys the idea without using loaded words like "queue" > > that might imply we know something about the I/O subsystem or that > > it's system-wide like kernel and device queues > > Why is that good? Queue depth is a pretty well established term. You can > search for benchmarks of devices with it, you can correlate with OS > config, etc. Queue depth is the standard term for an I/O queue that is shared by all users. What we're talking about here is undeniably also a queue with a depth, but it's a limit on the amount of concurrent I/O that *each operator in a query* will try to initiate (for example: each bitmap heap scan in the query, in future perhaps btree scans and other things), so I was thinking that we might want a different name. The more I think about this the more I appreciate the current vague GUC name!
On Sat, Mar 7, 2020 at 8:35 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I think the main issue with keeping the current GUC name is that if you > had a value that worked, we'll silently interpret it differently. Which > is a bit annoying :-( Yeah. Perhaps we should just give the formula for translating v12 settings to v13 settings in the release notes. If we don't rename the GUC, you won't be forced to contemplate this when you upgrade, so the amount of prefetching we do will go down a bit given the same value. That is indeed what led me to start thinking about what a good new name would be. Now that I've been talked out of the "random_page" part, your names look like sensible candidates, but I wonder if there is some way to capture that it's "per operation"...
> On Mar 7, 2020, at 00:33, Thomas Munro <thomas.munro@gmail.com> wrote: > > That is indeed what led me to start thinking about what a good new > name would be. MySQL has a term io_capacity. https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-io-capacity.html > The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to approximatelythe number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is set,InnoDB estimates the I/O bandwidth available for background tasks based on the set value. > Perhaps we can have maintenance_io_capacity as well.
On Sat, Mar 7, 2020 at 11:54 PM Evgeniy Shishkin <itparanoia@gmail.com> wrote: > > On Mar 7, 2020, at 00:33, Thomas Munro <thomas.munro@gmail.com> wrote: > > That is indeed what led me to start thinking about what a good new > > name would be. > > MySQL has a term io_capacity. > https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-io-capacity.html > > The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to approximatelythe number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is set,InnoDB estimates the I/O bandwidth available for background tasks based on the set value. > > > > Perhaps we can have maintenance_io_capacity as well. That sounds like total I/O capacity for your system that will be shared out for various tasks, which would definitely be nice to have, but here we're talking about a simpler per-operation settings. What we have is a bit like work_mem (a memory limit used for each individual hash, sort, tuplestore, ...), compared to a hypothetical whole-system memory budget (which would definitely also be nice to have).
On Sat, Mar 7, 2020 at 9:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > > So I think we should either rename e_i_c or keep it as is, and then also > > have a new GUC. And then translate the values between those (but that > > might be overkill). > > Please DON'T try to have two interrelated GUCs for this. We learned > our lesson about that years ago. Ack. > I think dropping the existing GUC is a perfectly sane thing to do, > if the new definition wouldn't be compatible. In practice few > people will notice, because few will have set it. That's what I thought too, but if Andres is right that "it's not like anybody knew how to infer a useful value", I'm wondering it's enough if we just provide an explanation of the change in the release notes. The default doesn't change (1 goes to 1), so most people will experience no change, but it you had it set to (say) 42 after careful experimentation, you might like to consider updating it to the result of: select round(sum(42 / n::float)) as new_setting from generate_series(1, 42) s(n) Here's a patch set to remove the spindle stuff, add a maintenance variant, and use the maintenance one in heap_compute_xid_horizon_for_tuples().
Attachment
On Tue, Mar 10, 2020 at 12:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a patch set to remove the spindle stuff, add a maintenance > variant, and use the maintenance one in > heap_compute_xid_horizon_for_tuples(). Pushed.
On Sun, Mar 15, 2020 at 9:27 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Mar 10, 2020 at 12:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Here's a patch set to remove the spindle stuff, add a maintenance > > variant, and use the maintenance one in > > heap_compute_xid_horizon_for_tuples(). > > Pushed. Shouldn't you close out the "Should we rename effective_io_concurrency?" Postgres 13 open item now? -- Peter Geoghegan
On Wed, May 13, 2020 at 6:58 AM Peter Geoghegan <pg@bowt.ie> wrote: > Shouldn't you close out the "Should we rename > effective_io_concurrency?" Postgres 13 open item now? Yeah, that doesn't really seem worth the churn. I'll move it to the resolved list in a day or two if no one shows up to argue for a rename.