Thread: cost based vacuum (parallel)
For parallel vacuum [1], we were discussing what is the best way to divide the cost among parallel workers but we didn't get many inputs apart from people who are very actively involved in patch development. I feel that we need some more inputs before we finalize anything, so starting a new thread. The initial version of the patch has a very rudimentary way of doing it which means each parallel vacuum worker operates independently w.r.t vacuum delay and cost. This will lead to more I/O in the system than the user has intended to do. Assume that the overall I/O allowed for vacuum operation is X after which it will sleep for some time, reset the balance and continue. In the patch, each worker will be allowed to perform X before which it can sleep and also there is no coordination for the same with master backend which would have done some I/O for the heap. So, in the worst-case scenario, there can be n times more I/O where n is the number of workers doing the parallel operation. This is somewhat similar to a memory usage problem with a parallel query where each worker is allowed to use up to work_mem of memory. We can say that the users using parallel operation can expect more system resources to be used as they want to get the operation done faster, so we are fine with this. However, I am not sure if that is the right thing, so we should try to come up with some solution for it and if the solution is too complex, then probably we can think of documenting such behavior. The two approaches to solve this problem being discussed in that thread [1] are as follows: (a) Allow the parallel workers and master backend to have a shared view of vacuum cost related parameters (mainly VacuumCostBalance) and allow each worker to update it and then based on that decide whether it needs to sleep. Sawada-San has done the POC for this approach. See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One drawback of this approach could be that we allow the worker to sleep even though the I/O has been performed by some other worker. (b) The other idea could be that we split the I/O among workers something similar to what we do for auto vacuum workers (see autovac_balance_cost). The basic idea would be that before launching workers, we need to compute the remaining I/O (heap operation would have used something) after which we need to sleep and split it equally across workers. Here, we are primarily thinking of dividing VacuumCostBalance and VacuumCostLimit parameters. Once the workers are finished, they need to let master backend know how much I/O they have consumed and then master backend can add it to it's current I/O consumed. I think we also need to rebalance the cost of remaining workers once some of the worker's exit. Dilip has prepared a POC patch for this, see 0002-POC-divide-vacuum-cost-limit in email [3]. I think approach-2 is better in throttling the system as it doesn't have the drawback of the first approach, but it might be a bit tricky to implement. As of now, the POC for both the approaches has been developed and we see similar results for both approaches, but we have tested simpler cases where each worker has similar amount of I/O to perform. Thoughts? [1] - https://commitfest.postgresql.org/25/1774/ [2] - https://www.postgresql.org/message-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD%2BXOh_Q%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAFiTN-thU-z8f04jO7xGMu5yUUpTpsBTvBrFW6EhRf-jGvEz%3Dg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
This is somewhat similar to a memory usage problem with a
parallel query where each worker is allowed to use up to work_mem of
memory. We can say that the users using parallel operation can expect
more system resources to be used as they want to get the operation
done faster, so we are fine with this. However, I am not sure if that
is the right thing, so we should try to come up with some solution for
it and if the solution is too complex, then probably we can think of
documenting such behavior.
In cloud environments (Amazon + gp2) there's a budget on input/output operations. If you cross it for long time, everything starts looking like you work with a floppy disk.
For the ease of configuration, I would need a "max_vacuum_disk_iops" that would limit number of input-output operations by all of the vacuums in the system. If I set it to less than value of budget refill, I can be sure than that no vacuum runs too fast to impact any sibling query.
There's also value in non-throttled VACUUM for smaller tables. On gp2 such things will be consumed out of surge budget, and its size is known to sysadmin. Let's call it "max_vacuum_disk_surge_iops" - if a relation has less blocks than this value and it's a blocking in any way situation (antiwraparound, interactive console, ...) - go on and run without throttling.
For how to balance the cost: if we know a number of vacuum processes that were running in the previous second, we can just divide a slot for this iteration by that previous number.
To correct for overshots, we can subtract the previous second's overshot from next one's. That would also allow to account for surge budget usage and let it refill, pausing all autovacuum after a manual one for some time.
Precision of accounting limiting count of operations more than once a second isn't beneficial for this use case.
Please don't forget that processing one page can become several iops (read, write, wal).
Does this make sense? :)
On Mon, Nov 4, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think approach-2 is better in throttling the system as it doesn't > have the drawback of the first approach, but it might be a bit tricky > to implement. I might be missing something but I think that there could be the drawback of the approach-1 even on approach-2 depending on index pages loaded on the shared buffer and the vacuum delay setting. Is it right? Regards, --- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 4, 2019 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 4, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I think approach-2 is better in throttling the system as it doesn't > > have the drawback of the first approach, but it might be a bit tricky > > to implement. > > I might be missing something but I think that there could be the > drawback of the approach-1 even on approach-2 depending on index pages > loaded on the shared buffer and the vacuum delay setting. > Can you be a bit more specific about this? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 4, 2019 at 1:03 PM Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote: >> >> >> This is somewhat similar to a memory usage problem with a >> parallel query where each worker is allowed to use up to work_mem of >> memory. We can say that the users using parallel operation can expect >> more system resources to be used as they want to get the operation >> done faster, so we are fine with this. However, I am not sure if that >> is the right thing, so we should try to come up with some solution for >> it and if the solution is too complex, then probably we can think of >> documenting such behavior. > > > In cloud environments (Amazon + gp2) there's a budget on input/output operations. If you cross it for long time, everythingstarts looking like you work with a floppy disk. > > For the ease of configuration, I would need a "max_vacuum_disk_iops" that would limit number of input-output operationsby all of the vacuums in the system. If I set it to less than value of budget refill, I can be sure than that novacuum runs too fast to impact any sibling query. > > There's also value in non-throttled VACUUM for smaller tables. On gp2 such things will be consumed out of surge budget,and its size is known to sysadmin. Let's call it "max_vacuum_disk_surge_iops" - if a relation has less blocks thanthis value and it's a blocking in any way situation (antiwraparound, interactive console, ...) - go on and run withoutthrottling. > I think the need for these things can be addressed by current cost-based-vacuum parameters. See docs [1]. For example, if you set vacuum_cost_delay as zero, it will allow the operation to be performed without throttling. > For how to balance the cost: if we know a number of vacuum processes that were running in the previous second, we can justdivide a slot for this iteration by that previous number. > > To correct for overshots, we can subtract the previous second's overshot from next one's. That would also allow to accountfor surge budget usage and let it refill, pausing all autovacuum after a manual one for some time. > > Precision of accounting limiting count of operations more than once a second isn't beneficial for this use case. > I think it is better if we find a way to rebalance the cost on some worker exit rather than every second as anyway it won't change unless any worker exits. [1] - https://www.postgresql.org/docs/devel/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-VACUUM-COST -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, 4 Nov 2019 at 19:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 4, 2019 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Nov 4, 2019 at 3:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I think approach-2 is better in throttling the system as it doesn't > > > have the drawback of the first approach, but it might be a bit tricky > > > to implement. > > > > I might be missing something but I think that there could be the > > drawback of the approach-1 even on approach-2 depending on index pages > > loaded on the shared buffer and the vacuum delay setting. > > > > Can you be a bit more specific about this? Suppose there are two indexes: one index is loaded at all while another index isn't. One vacuum worker who processes the former index hits all pages on the shared buffer but another worker who processes the latter index read all pages from either OS page cache or disk. Even if both the cost limit and the cost balance are split evenly among workers because the cost of page hits and page misses are different it's possible that one vacuum worker sleeps while other workers doing I/O. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
For parallel vacuum [1], we were discussing what is the best way to
divide the cost among parallel workers but we didn't get many inputs
apart from people who are very actively involved in patch development.
I feel that we need some more inputs before we finalize anything, so
starting a new thread.
Maybe a I just don't have experience in the type of system that parallel vacuum is needed for, but if there is any meaningful IO throttling which is active, then what is the point of doing the vacuum in parallel in the first place?
Cheers,
Jeff
Hi, On 2019-11-04 12:24:35 +0530, Amit Kapila wrote: > For parallel vacuum [1], we were discussing what is the best way to > divide the cost among parallel workers but we didn't get many inputs > apart from people who are very actively involved in patch development. > I feel that we need some more inputs before we finalize anything, so > starting a new thread. > > The initial version of the patch has a very rudimentary way of doing > it which means each parallel vacuum worker operates independently > w.r.t vacuum delay and cost. Yea, that seems not ok for cases where vacuum delay is active. There's also the question of when/why it is beneficial to use parallelism when you're going to encounter IO limits in all likelihood. > This will lead to more I/O in the system > than the user has intended to do. Assume that the overall I/O allowed > for vacuum operation is X after which it will sleep for some time, > reset the balance and continue. In the patch, each worker will be > allowed to perform X before which it can sleep and also there is no > coordination for the same with master backend which would have done > some I/O for the heap. So, in the worst-case scenario, there can be n > times more I/O where n is the number of workers doing the parallel > operation. This is somewhat similar to a memory usage problem with a > parallel query where each worker is allowed to use up to work_mem of > memory. We can say that the users using parallel operation can expect > more system resources to be used as they want to get the operation > done faster, so we are fine with this. However, I am not sure if that > is the right thing, so we should try to come up with some solution for > it and if the solution is too complex, then probably we can think of > documenting such behavior. I mean for parallel query the problem wasn't really introduced in parallel query, it existed before - and does still - for non-parallel queries. And there's a complex underlying planning issue. I don't think this is a good excuse for VACUUM, where none of the complex "number of paths considered" issues etc apply. > The two approaches to solve this problem being discussed in that > thread [1] are as follows: > (a) Allow the parallel workers and master backend to have a shared > view of vacuum cost related parameters (mainly VacuumCostBalance) and > allow each worker to update it and then based on that decide whether > it needs to sleep. Sawada-San has done the POC for this approach. > See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One > drawback of this approach could be that we allow the worker to sleep > even though the I/O has been performed by some other worker. I don't understand this drawback. > (b) The other idea could be that we split the I/O among workers > something similar to what we do for auto vacuum workers (see > autovac_balance_cost). The basic idea would be that before launching > workers, we need to compute the remaining I/O (heap operation would > have used something) after which we need to sleep and split it equally > across workers. Here, we are primarily thinking of dividing > VacuumCostBalance and VacuumCostLimit parameters. Once the workers > are finished, they need to let master backend know how much I/O they > have consumed and then master backend can add it to it's current I/O > consumed. I think we also need to rebalance the cost of remaining > workers once some of the worker's exit. Dilip has prepared a POC > patch for this, see 0002-POC-divide-vacuum-cost-limit in email [3]. (b) doesn't strike me as advantageous. It seems quite possible that you end up with one worker that has a lot more IO than others, leading to unnecessary sleeps, even though the actually available IO budget has not been used up. Quite easy to see how that'd lead to parallel VACUUM having a lower throughput than a single threaded one. Greetings, Andres Freund
Hi, On 2019-11-04 12:59:02 -0500, Jeff Janes wrote: > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > For parallel vacuum [1], we were discussing what is the best way to > > divide the cost among parallel workers but we didn't get many inputs > > apart from people who are very actively involved in patch development. > > I feel that we need some more inputs before we finalize anything, so > > starting a new thread. > > > > Maybe a I just don't have experience in the type of system that parallel > vacuum is needed for, but if there is any meaningful IO throttling which is > active, then what is the point of doing the vacuum in parallel in the first > place? I am wondering the same - but to be fair, it's pretty easy to run into cases where VACUUM is CPU bound. E.g. because most pages are in shared_buffers, and compared to the size of the indexes number of tids that need to be pruned is fairly small (also [1]). That means a lot of pages need to be scanned, without a whole lot of IO going on. The problem with that is just that the defaults for vacuum throttling will also apply here, I've never seen anybody tune vacuum_cost_page_hit = 0, vacuum_cost_page_dirty=0 or such (in contrast, the latter is the highest cost currently). Nor do we reduce the cost of vacuum_cost_page_dirty for unlogged tables. So while it doesn't seem unreasonable to want to use cost limiting to protect against vacuum unexpectedly causing too much, especially read, IO, I'm doubtful it has current practical relevance. I'm wondering how much of the benefit of parallel vacuum really is just to work around vacuum ringbuffers often massively hurting performance (see e.g. [2]). Surely not all, but I'd be very unsurprised if it were a large fraction. Greetings, Andres Freund [1] I don't think the patch addresses this, IIUC it's only running index vacuums in parallel, but it's very easy to run into being CPU bottlenecked when vacuuming a busily updated table. heap_hot_prune can be really expensive, especially with longer update chains (I think it may have an O(n^2) worst case even). [2] https://www.postgresql.org/message-id/20160406105716.fhk2eparljthpzp6%40alap3.anarazel.de
Greetings, * Jeff Janes (jeff.janes@gmail.com) wrote: > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > For parallel vacuum [1], we were discussing what is the best way to > > divide the cost among parallel workers but we didn't get many inputs > > apart from people who are very actively involved in patch development. > > I feel that we need some more inputs before we finalize anything, so > > starting a new thread. > > Maybe a I just don't have experience in the type of system that parallel > vacuum is needed for, but if there is any meaningful IO throttling which is > active, then what is the point of doing the vacuum in parallel in the first > place? With parallelization across indexes, you could have a situation where the individual indexes are on different tablespaces with independent i/o, therefore the parallelization ends up giving you an increase in i/o throughput, not just additional CPU time. Thanks, Stephen
Attachment
Hi, On 2019-11-04 14:06:19 -0500, Stephen Frost wrote: > * Jeff Janes (jeff.janes@gmail.com) wrote: > > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > For parallel vacuum [1], we were discussing what is the best way to > > > divide the cost among parallel workers but we didn't get many inputs > > > apart from people who are very actively involved in patch development. > > > I feel that we need some more inputs before we finalize anything, so > > > starting a new thread. > > > > Maybe a I just don't have experience in the type of system that parallel > > vacuum is needed for, but if there is any meaningful IO throttling which is > > active, then what is the point of doing the vacuum in parallel in the first > > place? > > With parallelization across indexes, you could have a situation where > the individual indexes are on different tablespaces with independent > i/o, therefore the parallelization ends up giving you an increase in i/o > throughput, not just additional CPU time. How's that related to IO throttling being active or not? Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-11-04 14:06:19 -0500, Stephen Frost wrote: > > * Jeff Janes (jeff.janes@gmail.com) wrote: > > > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > For parallel vacuum [1], we were discussing what is the best way to > > > > divide the cost among parallel workers but we didn't get many inputs > > > > apart from people who are very actively involved in patch development. > > > > I feel that we need some more inputs before we finalize anything, so > > > > starting a new thread. > > > > > > Maybe a I just don't have experience in the type of system that parallel > > > vacuum is needed for, but if there is any meaningful IO throttling which is > > > active, then what is the point of doing the vacuum in parallel in the first > > > place? > > > > With parallelization across indexes, you could have a situation where > > the individual indexes are on different tablespaces with independent > > i/o, therefore the parallelization ends up giving you an increase in i/o > > throughput, not just additional CPU time. > > How's that related to IO throttling being active or not? You might find that you have to throttle the IO down when operating exclusively against one IO channel, but if you have multiple IO channels then the acceptable IO utilization could be higher as it would be spread across the different IO channels. In other words, the overall i/o allowance for a given operation might be able to be higher if it's spread across multiple i/o channels, as it wouldn't completely consume the i/o resources of any of them, whereas with a higher allowance and a single i/o channel, there would likely be an impact to other operations. As for if this is really relevant only when it comes to parallel operations is a bit of an interesting question- these considerations might not require actual parallel operations as a single process might be able to go through multiple indexes concurrently and still hit the i/o limit that was set for it overall across the tablespaces. I don't know that it would actually be interesting or useful to spend the effort to make that work though, so, from a practical perspective, it's probably only interesting to think about this when talking about parallel vacuum. I've been wondering if the accounting system should consider the cost per tablespace when there's multiple tablespaces involved, instead of throttling the overall process without consideration for the per-tablespace utilization. Thanks, Stephen
Attachment
Hi, On 2019-11-04 14:33:41 -0500, Stephen Frost wrote: > * Andres Freund (andres@anarazel.de) wrote: > > On 2019-11-04 14:06:19 -0500, Stephen Frost wrote: > > > With parallelization across indexes, you could have a situation where > > > the individual indexes are on different tablespaces with independent > > > i/o, therefore the parallelization ends up giving you an increase in i/o > > > throughput, not just additional CPU time. > > > > How's that related to IO throttling being active or not? > > You might find that you have to throttle the IO down when operating > exclusively against one IO channel, but if you have multiple IO channels > then the acceptable IO utilization could be higher as it would be > spread across the different IO channels. > > In other words, the overall i/o allowance for a given operation might be > able to be higher if it's spread across multiple i/o channels, as it > wouldn't completely consume the i/o resources of any of them, whereas > with a higher allowance and a single i/o channel, there would likely be > an impact to other operations. > > As for if this is really relevant only when it comes to parallel > operations is a bit of an interesting question- these considerations > might not require actual parallel operations as a single process might > be able to go through multiple indexes concurrently and still hit the > i/o limit that was set for it overall across the tablespaces. I don't > know that it would actually be interesting or useful to spend the effort > to make that work though, so, from a practical perspective, it's > probably only interesting to think about this when talking about > parallel vacuum. But you could just apply different budgets for different tablespaces? That's quite doable independent of parallelism, as we don't have tables or indexes spanning more than one tablespace. True, you could then make the processing of an individual vacuum faster by allowing to utilize multiple tablespace budgets at the same time. > I've been wondering if the accounting system should consider the cost > per tablespace when there's multiple tablespaces involved, instead of > throttling the overall process without consideration for the > per-tablespace utilization. This all seems like a feature proposal, or two, independent of the patch/question at hand. I think there's a good argument to be had that we should severely overhaul the current vacuum cost limiting - it's way way too hard to understand the bandwidth that it's allowed to consume. But unless one of the proposals makes that measurably harder or easier, I think we don't gain anything by entangling an already complex patchset with something new. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-11-04 14:33:41 -0500, Stephen Frost wrote: > > * Andres Freund (andres@anarazel.de) wrote: > > > On 2019-11-04 14:06:19 -0500, Stephen Frost wrote: > > > > With parallelization across indexes, you could have a situation where > > > > the individual indexes are on different tablespaces with independent > > > > i/o, therefore the parallelization ends up giving you an increase in i/o > > > > throughput, not just additional CPU time. > > > > > > How's that related to IO throttling being active or not? > > > > You might find that you have to throttle the IO down when operating > > exclusively against one IO channel, but if you have multiple IO channels > > then the acceptable IO utilization could be higher as it would be > > spread across the different IO channels. > > > > In other words, the overall i/o allowance for a given operation might be > > able to be higher if it's spread across multiple i/o channels, as it > > wouldn't completely consume the i/o resources of any of them, whereas > > with a higher allowance and a single i/o channel, there would likely be > > an impact to other operations. > > > > As for if this is really relevant only when it comes to parallel > > operations is a bit of an interesting question- these considerations > > might not require actual parallel operations as a single process might > > be able to go through multiple indexes concurrently and still hit the > > i/o limit that was set for it overall across the tablespaces. I don't > > know that it would actually be interesting or useful to spend the effort > > to make that work though, so, from a practical perspective, it's > > probably only interesting to think about this when talking about > > parallel vacuum. > > But you could just apply different budgets for different tablespaces? Yes, that would be one approach to addressing this, though it would change the existing meaning of those cost parameters. I'm not sure if we think that's an issue or not- if we only have this in the case of a parallel vacuum then it's probably fine, I'm less sure if it'd be alright to change that on an upgrade. > That's quite doable independent of parallelism, as we don't have tables > or indexes spanning more than one tablespace. True, you could then make > the processing of an individual vacuum faster by allowing to utilize > multiple tablespace budgets at the same time. Yes, it's possible to do independent of parallelism, but what I was trying to get at above is that it might not be worth the effort. When it comes to parallel vacuum though, I'm not sure that you can just punt on this question since you'll naturally end up spanning multiple tablespaces concurrently, at least if the heap+indexes are spread across multiple tablespaces and you're operating against more than one of those relations at a time (which, I admit, I'm not 100% sure is actually happening with this proposed patch set- if it isn't, then this isn't really an issue, though that would be pretty unfortunate as then you can't leverage multiple i/o channels concurrently and therefore Jeff's question about why you'd be doing parallel vacuum with IO throttling is a pretty good one). Thanks, Stephen
Attachment
On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote: > > > > The two approaches to solve this problem being discussed in that > > thread [1] are as follows: > > (a) Allow the parallel workers and master backend to have a shared > > view of vacuum cost related parameters (mainly VacuumCostBalance) and > > allow each worker to update it and then based on that decide whether > > it needs to sleep. Sawada-San has done the POC for this approach. > > See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One > > drawback of this approach could be that we allow the worker to sleep > > even though the I/O has been performed by some other worker. > > I don't understand this drawback. > I think the problem could be that the system is not properly throttled when it is supposed to be. Let me try by a simple example, say we have two workers w-1 and w-2. The w-2 is primarily doing the I/O and w-1 is doing very less I/O but unfortunately whenever w-1 checks it finds that cost_limit has exceeded and it goes for sleep, but w-1 still continues. Now in such a situation even though we have made one of the workers slept for a required time but ideally the worker which was doing I/O should have slept. The aim is to make the system stop doing I/O whenever the limit has exceeded, so that might not work in the above situation. > > > (b) The other idea could be that we split the I/O among workers > > something similar to what we do for auto vacuum workers (see > > autovac_balance_cost). The basic idea would be that before launching > > workers, we need to compute the remaining I/O (heap operation would > > have used something) after which we need to sleep and split it equally > > across workers. Here, we are primarily thinking of dividing > > VacuumCostBalance and VacuumCostLimit parameters. Once the workers > > are finished, they need to let master backend know how much I/O they > > have consumed and then master backend can add it to it's current I/O > > consumed. I think we also need to rebalance the cost of remaining > > workers once some of the worker's exit. Dilip has prepared a POC > > patch for this, see 0002-POC-divide-vacuum-cost-limit in email [3]. > > (b) doesn't strike me as advantageous. It seems quite possible that you > end up with one worker that has a lot more IO than others, leading to > unnecessary sleeps, even though the actually available IO budget has not > been used up. > Yeah, this is possible, but to an extent, this is possible in the current design as well where we balance the cost among autovacuum workers. Now, it is quite possible that the current design itself is not good and we don't want to do the same thing at another place, but at least we will be consistent and can explain the overall behavior. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-11-04 12:59:02 -0500, Jeff Janes wrote: > > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > For parallel vacuum [1], we were discussing what is the best way to > > > divide the cost among parallel workers but we didn't get many inputs > > > apart from people who are very actively involved in patch development. > > > I feel that we need some more inputs before we finalize anything, so > > > starting a new thread. > > > > > > > Maybe a I just don't have experience in the type of system that parallel > > vacuum is needed for, but if there is any meaningful IO throttling which is > > active, then what is the point of doing the vacuum in parallel in the first > > place? > > I am wondering the same - but to be fair, it's pretty easy to run into > cases where VACUUM is CPU bound. E.g. because most pages are in > shared_buffers, and compared to the size of the indexes number of tids > that need to be pruned is fairly small (also [1]). That means a lot of > pages need to be scanned, without a whole lot of IO going on. The > problem with that is just that the defaults for vacuum throttling will > also apply here, I've never seen anybody tune vacuum_cost_page_hit = 0, > vacuum_cost_page_dirty=0 or such (in contrast, the latter is the highest > cost currently). Nor do we reduce the cost of vacuum_cost_page_dirty > for unlogged tables. > > So while it doesn't seem unreasonable to want to use cost limiting to > protect against vacuum unexpectedly causing too much, especially read, > IO, I'm doubtful it has current practical relevance. > IIUC, you mean to say that it is of not much practical use to do parallel vacuum if I/O throttling is enabled for an operation, is that right? > I'm wondering how much of the benefit of parallel vacuum really is just > to work around vacuum ringbuffers often massively hurting performance > (see e.g. [2]). > Yeah, it is a good thing to check, but if anything, I think a parallel vacuum will further improve the performance with larger ring buffers as it will make it more CPU bound. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 5, 2019 at 1:12 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-11-04 14:33:41 -0500, Stephen Frost wrote: > > > I've been wondering if the accounting system should consider the cost > > per tablespace when there's multiple tablespaces involved, instead of > > throttling the overall process without consideration for the > > per-tablespace utilization. > > This all seems like a feature proposal, or two, independent of the > patch/question at hand. I think there's a good argument to be had that > we should severely overhaul the current vacuum cost limiting - it's way > way too hard to understand the bandwidth that it's allowed to > consume. But unless one of the proposals makes that measurably harder or > easier, I think we don't gain anything by entangling an already complex > patchset with something new. > +1. I think even if we want something related to per-tablespace costing for vacuum (parallel), it should be done as a separate patch. It is a whole new area where we need to define what is the appropriate way to achieve. It is going to change the current vacuum costing system in a big way which I don't think is reasonable to do as part of a parallel vacuum patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 5, 2019 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2019-11-04 12:59:02 -0500, Jeff Janes wrote: > > > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > For parallel vacuum [1], we were discussing what is the best way to > > > > divide the cost among parallel workers but we didn't get many inputs > > > > apart from people who are very actively involved in patch development. > > > > I feel that we need some more inputs before we finalize anything, so > > > > starting a new thread. > > > > > > > > > > Maybe a I just don't have experience in the type of system that parallel > > > vacuum is needed for, but if there is any meaningful IO throttling which is > > > active, then what is the point of doing the vacuum in parallel in the first > > > place? > > > > I am wondering the same - but to be fair, it's pretty easy to run into > > cases where VACUUM is CPU bound. E.g. because most pages are in > > shared_buffers, and compared to the size of the indexes number of tids > > that need to be pruned is fairly small (also [1]). That means a lot of > > pages need to be scanned, without a whole lot of IO going on. The > > problem with that is just that the defaults for vacuum throttling will > > also apply here, I've never seen anybody tune vacuum_cost_page_hit = 0, > > vacuum_cost_page_dirty=0 or such (in contrast, the latter is the highest > > cost currently). Nor do we reduce the cost of vacuum_cost_page_dirty > > for unlogged tables. > > > > So while it doesn't seem unreasonable to want to use cost limiting to > > protect against vacuum unexpectedly causing too much, especially read, > > IO, I'm doubtful it has current practical relevance. > > > > IIUC, you mean to say that it is of not much practical use to do > parallel vacuum if I/O throttling is enabled for an operation, is that > right? > > > > I'm wondering how much of the benefit of parallel vacuum really is just > > to work around vacuum ringbuffers often massively hurting performance > > (see e.g. [2]). > > > > Yeah, it is a good thing to check, but if anything, I think a parallel > vacuum will further improve the performance with larger ring buffers > as it will make it more CPU bound. I have tested the same and the results prove that increasing the ring buffer size we can see the performance gain. And, the gain is much more with the parallel vacuum. Test case: create table test(a int, b int, c int, d int, e int, f int, g int, h int); create index idx1 on test(a); create index idx2 on test(b); create index idx3 on test(c); create index idx4 on test(d); create index idx5 on test(e); create index idx6 on test(f); create index idx7 on test(g); create index idx8 on test(h); insert into test select i,i,i,i,i,i,i,i from generate_series(1,1000000) as i; delete from test where a < 300000; ( I have tested the parallel vacuum and non-parallel vacuum with different ring buffer size) 8 index ring buffer size 246kb-> non-parallel: 7.6 seconds parallel (2 worker): 3.9 seconds ring buffer size 256mb-> non-parallel: 6.1 seconds parallel (2 worker): 3.2 seconds 4 index ring buffer size 246kb -> non-parallel: 4.8 seconds parallel (2 worker): 3.2 seconds ring buffer size 256mb -> non-parallel: 3.8 seconds parallel (2 worker): 2.6 seconds -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On November 5, 2019 7:16:41 AM PST, Dilip Kumar <dilipbalaut@gmail.com> wrote: >On Tue, Nov 5, 2019 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> >wrote: >> >> On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de> >wrote: >> > >> > Hi, >> > >> > On 2019-11-04 12:59:02 -0500, Jeff Janes wrote: >> > > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila ><amit.kapila16@gmail.com> wrote: >> > > >> > > > For parallel vacuum [1], we were discussing what is the best >way to >> > > > divide the cost among parallel workers but we didn't get many >inputs >> > > > apart from people who are very actively involved in patch >development. >> > > > I feel that we need some more inputs before we finalize >anything, so >> > > > starting a new thread. >> > > > >> > > >> > > Maybe a I just don't have experience in the type of system that >parallel >> > > vacuum is needed for, but if there is any meaningful IO >throttling which is >> > > active, then what is the point of doing the vacuum in parallel in >the first >> > > place? >> > >> > I am wondering the same - but to be fair, it's pretty easy to run >into >> > cases where VACUUM is CPU bound. E.g. because most pages are in >> > shared_buffers, and compared to the size of the indexes number of >tids >> > that need to be pruned is fairly small (also [1]). That means a lot >of >> > pages need to be scanned, without a whole lot of IO going on. The >> > problem with that is just that the defaults for vacuum throttling >will >> > also apply here, I've never seen anybody tune vacuum_cost_page_hit >= 0, >> > vacuum_cost_page_dirty=0 or such (in contrast, the latter is the >highest >> > cost currently). Nor do we reduce the cost of >vacuum_cost_page_dirty >> > for unlogged tables. >> > >> > So while it doesn't seem unreasonable to want to use cost limiting >to >> > protect against vacuum unexpectedly causing too much, especially >read, >> > IO, I'm doubtful it has current practical relevance. >> > >> >> IIUC, you mean to say that it is of not much practical use to do >> parallel vacuum if I/O throttling is enabled for an operation, is >that >> right? >> >> >> > I'm wondering how much of the benefit of parallel vacuum really is >just >> > to work around vacuum ringbuffers often massively hurting >performance >> > (see e.g. [2]). >> > >> >> Yeah, it is a good thing to check, but if anything, I think a >parallel >> vacuum will further improve the performance with larger ring buffers >> as it will make it more CPU bound. >I have tested the same and the results prove that increasing the ring >buffer size we can see the performance gain. And, the gain is much >more with the parallel vacuum. > >Test case: >create table test(a int, b int, c int, d int, e int, f int, g int, h >int); >create index idx1 on test(a); >create index idx2 on test(b); >create index idx3 on test(c); >create index idx4 on test(d); >create index idx5 on test(e); >create index idx6 on test(f); >create index idx7 on test(g); >create index idx8 on test(h); >insert into test select i,i,i,i,i,i,i,i from generate_series(1,1000000) >as i; >delete from test where a < 300000; > >( I have tested the parallel vacuum and non-parallel vacuum with >different ring buffer size) Thanks! >8 index >ring buffer size 246kb-> non-parallel: 7.6 seconds parallel (2 >worker): 3.9 seconds >ring buffer size 256mb-> non-parallel: 6.1 seconds parallel (2 >worker): 3.2 seconds > >4 index >ring buffer size 246kb -> non-parallel: 4.8 seconds parallel (2 >worker): 3.2 seconds >ring buffer size 256mb -> non-parallel: 3.8 seconds parallel (2 >worker): 2.6 seconds What about the case of just disabling the ring buffer logic? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Tue, Nov 5, 2019 at 8:49 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On November 5, 2019 7:16:41 AM PST, Dilip Kumar <dilipbalaut@gmail.com> wrote: > >On Tue, Nov 5, 2019 at 2:40 PM Amit Kapila <amit.kapila16@gmail.com> > >wrote: > >> > >> On Mon, Nov 4, 2019 at 11:58 PM Andres Freund <andres@anarazel.de> > >wrote: > >> > > >> > Hi, > >> > > >> > On 2019-11-04 12:59:02 -0500, Jeff Janes wrote: > >> > > On Mon, Nov 4, 2019 at 1:54 AM Amit Kapila > ><amit.kapila16@gmail.com> wrote: > >> > > > >> > > > For parallel vacuum [1], we were discussing what is the best > >way to > >> > > > divide the cost among parallel workers but we didn't get many > >inputs > >> > > > apart from people who are very actively involved in patch > >development. > >> > > > I feel that we need some more inputs before we finalize > >anything, so > >> > > > starting a new thread. > >> > > > > >> > > > >> > > Maybe a I just don't have experience in the type of system that > >parallel > >> > > vacuum is needed for, but if there is any meaningful IO > >throttling which is > >> > > active, then what is the point of doing the vacuum in parallel in > >the first > >> > > place? > >> > > >> > I am wondering the same - but to be fair, it's pretty easy to run > >into > >> > cases where VACUUM is CPU bound. E.g. because most pages are in > >> > shared_buffers, and compared to the size of the indexes number of > >tids > >> > that need to be pruned is fairly small (also [1]). That means a lot > >of > >> > pages need to be scanned, without a whole lot of IO going on. The > >> > problem with that is just that the defaults for vacuum throttling > >will > >> > also apply here, I've never seen anybody tune vacuum_cost_page_hit > >= 0, > >> > vacuum_cost_page_dirty=0 or such (in contrast, the latter is the > >highest > >> > cost currently). Nor do we reduce the cost of > >vacuum_cost_page_dirty > >> > for unlogged tables. > >> > > >> > So while it doesn't seem unreasonable to want to use cost limiting > >to > >> > protect against vacuum unexpectedly causing too much, especially > >read, > >> > IO, I'm doubtful it has current practical relevance. > >> > > >> > >> IIUC, you mean to say that it is of not much practical use to do > >> parallel vacuum if I/O throttling is enabled for an operation, is > >that > >> right? > >> > >> > >> > I'm wondering how much of the benefit of parallel vacuum really is > >just > >> > to work around vacuum ringbuffers often massively hurting > >performance > >> > (see e.g. [2]). > >> > > >> > >> Yeah, it is a good thing to check, but if anything, I think a > >parallel > >> vacuum will further improve the performance with larger ring buffers > >> as it will make it more CPU bound. > >I have tested the same and the results prove that increasing the ring > >buffer size we can see the performance gain. And, the gain is much > >more with the parallel vacuum. > > > >Test case: > >create table test(a int, b int, c int, d int, e int, f int, g int, h > >int); > >create index idx1 on test(a); > >create index idx2 on test(b); > >create index idx3 on test(c); > >create index idx4 on test(d); > >create index idx5 on test(e); > >create index idx6 on test(f); > >create index idx7 on test(g); > >create index idx8 on test(h); > >insert into test select i,i,i,i,i,i,i,i from generate_series(1,1000000) > >as i; > >delete from test where a < 300000; > > > >( I have tested the parallel vacuum and non-parallel vacuum with > >different ring buffer size) > > Thanks! > > >8 index > >ring buffer size 246kb-> non-parallel: 7.6 seconds parallel (2 > >worker): 3.9 seconds > >ring buffer size 256mb-> non-parallel: 6.1 seconds parallel (2 > >worker): 3.2 seconds > > > >4 index > >ring buffer size 246kb -> non-parallel: 4.8 seconds parallel (2 > >worker): 3.2 seconds > >ring buffer size 256mb -> non-parallel: 3.8 seconds parallel (2 > >worker): 2.6 seconds > > What about the case of just disabling the ring buffer logic? > Repeated the same test by disabling the ring buffer logic. Results are almost same as increasing the ring buffer size. Tested with 4GB shared buffers: 8 index use shared buffers -> non-parallel: 6.2seconds parallel (2 worker): 3.3seconds 4 index use shared buffer -> non-parallel: 3.8seconds parallel (2 worker): 2.7seconds -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Nov 5, 2019 at 1:42 AM Stephen Frost <sfrost@snowman.net> wrote: > * Andres Freund (andres@anarazel.de) wrote: > > > That's quite doable independent of parallelism, as we don't have tables > > or indexes spanning more than one tablespace. True, you could then make > > the processing of an individual vacuum faster by allowing to utilize > > multiple tablespace budgets at the same time. > > Yes, it's possible to do independent of parallelism, but what I was > trying to get at above is that it might not be worth the effort. When > it comes to parallel vacuum though, I'm not sure that you can just punt > on this question since you'll naturally end up spanning multiple > tablespaces concurrently, at least if the heap+indexes are spread across > multiple tablespaces and you're operating against more than one of those > relations at a time > Each parallel worker operates on a separate index. It might be worth exploring per-tablespace vacuum throttling, but that should not be a requirement for the currently proposed patch. As per feedback in this thread, it seems that for now, it is better, if we can allow a parallel vacuum only when I/O throttling is not enabled. We can later extend it based on feedback from the field once the feature starts getting used. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2019-11-06 07:53:09 +0530, Amit Kapila wrote: > As per feedback in this thread, it seems that for now, it is better, > if we can allow a parallel vacuum only when I/O throttling is not > enabled. We can later extend it based on feedback from the field once > the feature starts getting used. That's not my read on this thread. I don't think we should introduce this feature without a solution for the throttling. Greetings, Andres Freund
On Wed, Nov 6, 2019 at 7:55 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-11-06 07:53:09 +0530, Amit Kapila wrote: > > As per feedback in this thread, it seems that for now, it is better, > > if we can allow a parallel vacuum only when I/O throttling is not > > enabled. We can later extend it based on feedback from the field once > > the feature starts getting used. > > That's not my read on this thread. I don't think we should introduce > this feature without a solution for the throttling. > Okay, then I misunderstood your response to Jeff's email [1]. Anyway, we have already explored two different approaches as mentioned in the initial email which has somewhat similar results on initial tests. So, we can explore more on those lines. Do you any preference or any other idea? [1] - https://www.postgresql.org/message-id/20191104182829.57bkz64qn5k3uwc3%40alap3.anarazel.de -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Greetings, * Amit Kapila (amit.kapila16@gmail.com) wrote: > On Tue, Nov 5, 2019 at 1:42 AM Stephen Frost <sfrost@snowman.net> wrote: > > * Andres Freund (andres@anarazel.de) wrote: > > > That's quite doable independent of parallelism, as we don't have tables > > > or indexes spanning more than one tablespace. True, you could then make > > > the processing of an individual vacuum faster by allowing to utilize > > > multiple tablespace budgets at the same time. > > > > Yes, it's possible to do independent of parallelism, but what I was > > trying to get at above is that it might not be worth the effort. When > > it comes to parallel vacuum though, I'm not sure that you can just punt > > on this question since you'll naturally end up spanning multiple > > tablespaces concurrently, at least if the heap+indexes are spread across > > multiple tablespaces and you're operating against more than one of those > > relations at a time > > Each parallel worker operates on a separate index. It might be worth > exploring per-tablespace vacuum throttling, but that should not be a > requirement for the currently proposed patch. Right, that each operates on a separate index in parallel is what I had figured was probably happening, and that's why I brought up the question of "well, what does IO throttling mean when you've got multiple tablespaces involved with presumably independent IO channels...?" (or, at least, that's what I was trying to go for). This isn't a question with the current system and way the code works within a single vacuum operation, as we're never operating on more than one relation concurrently in that case. Of course, we don't currently do anything to manage IO utilization across tablespaces when there are multiple autovacuum workers running concurrently, which I suppose goes to Andres' point that we aren't really doing anything to deal with this today and therefore this is perhaps not all that new of an issue just with the addition of parallel vacuum. I'd still argue that it becomes a lot more apparent when you're talking about one parallel vacuum, but ultimately we should probably be thinking about how to manage the resources across all the vacuums and tablespaces and queries and such. In an ideal world, we'd track the i/o from front-end queries, have some idea of the total i/o possible for each IO channel, and allow vacuum and whatever other background processes need to run to scale up and down, with enough buffer to avoid ever being maxed out on i/o, but keeping up a consistent rate of i/o that lets everything finish as quickly as possible. Thanks, Stephen
Attachment
On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > The two approaches to solve this problem being discussed in that > > > thread [1] are as follows: > > > (a) Allow the parallel workers and master backend to have a shared > > > view of vacuum cost related parameters (mainly VacuumCostBalance) and > > > allow each worker to update it and then based on that decide whether > > > it needs to sleep. Sawada-San has done the POC for this approach. > > > See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One > > > drawback of this approach could be that we allow the worker to sleep > > > even though the I/O has been performed by some other worker. > > > > I don't understand this drawback. > > > > I think the problem could be that the system is not properly throttled > when it is supposed to be. Let me try by a simple example, say we > have two workers w-1 and w-2. The w-2 is primarily doing the I/O and > w-1 is doing very less I/O but unfortunately whenever w-1 checks it > finds that cost_limit has exceeded and it goes for sleep, but w-1 > still continues. > Typo in the above sentence. /but w-1 still continues/but w-2 still continues. > Now in such a situation even though we have made one > of the workers slept for a required time but ideally the worker which > was doing I/O should have slept. The aim is to make the system stop > doing I/O whenever the limit has exceeded, so that might not work in > the above situation. > One idea to fix this drawback is that if we somehow avoid letting the workers sleep which has done less or no I/O as compared to other workers, then we can to a good extent ensure that workers which are doing more I/O will be throttled more. What we can do is to allow any worker sleep only if it has performed the I/O above a certain threshold and the overall balance is more than the cost_limit set by the system. Then we will allow the worker to sleep proportional to the work done by it and reduce the VacuumSharedCostBalance by the amount which is consumed by the current worker. Something like: If ( VacuumSharedCostBalance >= VacuumCostLimit && MyCostBalance > (threshold) VacuumCostLimit / workers) { VacuumSharedCostBalance -= MyCostBalance; Sleep (delay * MyCostBalance/VacuumSharedCostBalance) } Assume threshold be 0.5, what that means is, if it has done work more than 50% of what is expected from this worker and the overall share cost balance is exceeded, then we will consider this worker to sleep. What do you guys think? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 6, 2019 at 9:21 AM Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Amit Kapila (amit.kapila16@gmail.com) wrote: > > On Tue, Nov 5, 2019 at 1:42 AM Stephen Frost <sfrost@snowman.net> wrote: > > > * Andres Freund (andres@anarazel.de) wrote: > > > > That's quite doable independent of parallelism, as we don't have tables > > > > or indexes spanning more than one tablespace. True, you could then make > > > > the processing of an individual vacuum faster by allowing to utilize > > > > multiple tablespace budgets at the same time. > > > > > > Yes, it's possible to do independent of parallelism, but what I was > > > trying to get at above is that it might not be worth the effort. When > > > it comes to parallel vacuum though, I'm not sure that you can just punt > > > on this question since you'll naturally end up spanning multiple > > > tablespaces concurrently, at least if the heap+indexes are spread across > > > multiple tablespaces and you're operating against more than one of those > > > relations at a time > > > > Each parallel worker operates on a separate index. It might be worth > > exploring per-tablespace vacuum throttling, but that should not be a > > requirement for the currently proposed patch. > > Right, that each operates on a separate index in parallel is what I had > figured was probably happening, and that's why I brought up the question > of "well, what does IO throttling mean when you've got multiple > tablespaces involved with presumably independent IO channels...?" (or, > at least, that's what I was trying to go for). > > This isn't a question with the current system and way the code works > within a single vacuum operation, as we're never operating on more than > one relation concurrently in that case. > > Of course, we don't currently do anything to manage IO utilization > across tablespaces when there are multiple autovacuum workers running > concurrently, which I suppose goes to Andres' point that we aren't > really doing anything to deal with this today and therefore this is > perhaps not all that new of an issue just with the addition of > parallel vacuum. I'd still argue that it becomes a lot more apparent > when you're talking about one parallel vacuum, but ultimately we should > probably be thinking about how to manage the resources across all the > vacuums and tablespaces and queries and such. > > In an ideal world, we'd track the i/o from front-end queries, have some > idea of the total i/o possible for each IO channel, and allow vacuum and > whatever other background processes need to run to scale up and down, > with enough buffer to avoid ever being maxed out on i/o, but keeping up > a consistent rate of i/o that lets everything finish as quickly as > possible. IMHO, in future suppose we improve the I/O throttling for each tablespace, maybe by maintaining the independent balance for relation and each index of the relation or may be combined balance for the indexes which are on the same tablespace. And, the balance can be checked against its tablespace i/o limit. So If we get such a mechanism in the future then it seems that it will be easily expandable to the parallel vacuum, isn't it? Because across workers also we can track tablespace wise shared balance (if we go with the shared costing approach for example). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, 6 Nov 2019 at 15:45, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > > > The two approaches to solve this problem being discussed in that > > > > thread [1] are as follows: > > > > (a) Allow the parallel workers and master backend to have a shared > > > > view of vacuum cost related parameters (mainly VacuumCostBalance) and > > > > allow each worker to update it and then based on that decide whether > > > > it needs to sleep. Sawada-San has done the POC for this approach. > > > > See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One > > > > drawback of this approach could be that we allow the worker to sleep > > > > even though the I/O has been performed by some other worker. > > > > > > I don't understand this drawback. > > > > > > > I think the problem could be that the system is not properly throttled > > when it is supposed to be. Let me try by a simple example, say we > > have two workers w-1 and w-2. The w-2 is primarily doing the I/O and > > w-1 is doing very less I/O but unfortunately whenever w-1 checks it > > finds that cost_limit has exceeded and it goes for sleep, but w-1 > > still continues. > > > > Typo in the above sentence. /but w-1 still continues/but w-2 still continues. > > > Now in such a situation even though we have made one > > of the workers slept for a required time but ideally the worker which > > was doing I/O should have slept. The aim is to make the system stop > > doing I/O whenever the limit has exceeded, so that might not work in > > the above situation. > > > > One idea to fix this drawback is that if we somehow avoid letting the > workers sleep which has done less or no I/O as compared to other > workers, then we can to a good extent ensure that workers which are > doing more I/O will be throttled more. What we can do is to allow any > worker sleep only if it has performed the I/O above a certain > threshold and the overall balance is more than the cost_limit set by > the system. Then we will allow the worker to sleep proportional to > the work done by it and reduce the VacuumSharedCostBalance by the > amount which is consumed by the current worker. Something like: > > If ( VacuumSharedCostBalance >= VacuumCostLimit && > MyCostBalance > (threshold) VacuumCostLimit / workers) > { > VacuumSharedCostBalance -= MyCostBalance; > Sleep (delay * MyCostBalance/VacuumSharedCostBalance) > } > > Assume threshold be 0.5, what that means is, if it has done work more > than 50% of what is expected from this worker and the overall share > cost balance is exceeded, then we will consider this worker to sleep. > > What do you guys think? I think the idea that the more consuming I/O they sleep more longer time seems good. There seems not to be the drawback of approach(b) that is to unnecessarily delay vacuum if some indexes are very small or bulk-deletions of indexes does almost nothing such as brin. But on the other hand it's possible that workers don't sleep even if shared cost balance already exceeds the limit because it's necessary for sleeping that local balance exceeds the worker's limit divided by the number of workers. For example, a worker is scheduled doing I/O and exceeds the limit substantially while other 2 workers do less I/O. And then the 2 workers are scheduled and consume I/O. The total cost balance already exceeds the limit but the workers will not sleep as long as the local balance is less than (limit / # of workers). Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 8, 2019 at 8:18 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Wed, 6 Nov 2019 at 15:45, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > > > > > > The two approaches to solve this problem being discussed in that > > > > > thread [1] are as follows: > > > > > (a) Allow the parallel workers and master backend to have a shared > > > > > view of vacuum cost related parameters (mainly VacuumCostBalance) and > > > > > allow each worker to update it and then based on that decide whether > > > > > it needs to sleep. Sawada-San has done the POC for this approach. > > > > > See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One > > > > > drawback of this approach could be that we allow the worker to sleep > > > > > even though the I/O has been performed by some other worker. > > > > > > > > I don't understand this drawback. > > > > > > > > > > I think the problem could be that the system is not properly throttled > > > when it is supposed to be. Let me try by a simple example, say we > > > have two workers w-1 and w-2. The w-2 is primarily doing the I/O and > > > w-1 is doing very less I/O but unfortunately whenever w-1 checks it > > > finds that cost_limit has exceeded and it goes for sleep, but w-1 > > > still continues. > > > > > > > Typo in the above sentence. /but w-1 still continues/but w-2 still continues. > > > > > Now in such a situation even though we have made one > > > of the workers slept for a required time but ideally the worker which > > > was doing I/O should have slept. The aim is to make the system stop > > > doing I/O whenever the limit has exceeded, so that might not work in > > > the above situation. > > > > > > > One idea to fix this drawback is that if we somehow avoid letting the > > workers sleep which has done less or no I/O as compared to other > > workers, then we can to a good extent ensure that workers which are > > doing more I/O will be throttled more. What we can do is to allow any > > worker sleep only if it has performed the I/O above a certain > > threshold and the overall balance is more than the cost_limit set by > > the system. Then we will allow the worker to sleep proportional to > > the work done by it and reduce the VacuumSharedCostBalance by the > > amount which is consumed by the current worker. Something like: > > > > If ( VacuumSharedCostBalance >= VacuumCostLimit && > > MyCostBalance > (threshold) VacuumCostLimit / workers) > > { > > VacuumSharedCostBalance -= MyCostBalance; > > Sleep (delay * MyCostBalance/VacuumSharedCostBalance) > > } > > > > Assume threshold be 0.5, what that means is, if it has done work more > > than 50% of what is expected from this worker and the overall share > > cost balance is exceeded, then we will consider this worker to sleep. > > > > What do you guys think? > > I think the idea that the more consuming I/O they sleep more longer > time seems good. There seems not to be the drawback of approach(b) > that is to unnecessarily delay vacuum if some indexes are very small > or bulk-deletions of indexes does almost nothing such as brin. But on > the other hand it's possible that workers don't sleep even if shared > cost balance already exceeds the limit because it's necessary for > sleeping that local balance exceeds the worker's limit divided by the > number of workers. For example, a worker is scheduled doing I/O and > exceeds the limit substantially while other 2 workers do less I/O. And > then the 2 workers are scheduled and consume I/O. The total cost > balance already exceeds the limit but the workers will not sleep as > long as the local balance is less than (limit / # of workers). > Right, this is the reason I told to keep some threshold for local balance(say 50% of (limit / # of workers)). I think we need to do some experiments to see what is the best thing to do. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 8, 2019 at 8:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, Nov 8, 2019 at 8:18 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Wed, 6 Nov 2019 at 15:45, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Nov 5, 2019 at 11:28 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Nov 4, 2019 at 11:42 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > > > > > > > > > > > The two approaches to solve this problem being discussed in that > > > > > > thread [1] are as follows: > > > > > > (a) Allow the parallel workers and master backend to have a shared > > > > > > view of vacuum cost related parameters (mainly VacuumCostBalance) and > > > > > > allow each worker to update it and then based on that decide whether > > > > > > it needs to sleep. Sawada-San has done the POC for this approach. > > > > > > See v32-0004-PoC-shared-vacuum-cost-balance in email [2]. One > > > > > > drawback of this approach could be that we allow the worker to sleep > > > > > > even though the I/O has been performed by some other worker. > > > > > > > > > > I don't understand this drawback. > > > > > > > > > > > > > I think the problem could be that the system is not properly throttled > > > > when it is supposed to be. Let me try by a simple example, say we > > > > have two workers w-1 and w-2. The w-2 is primarily doing the I/O and > > > > w-1 is doing very less I/O but unfortunately whenever w-1 checks it > > > > finds that cost_limit has exceeded and it goes for sleep, but w-1 > > > > still continues. > > > > > > > > > > Typo in the above sentence. /but w-1 still continues/but w-2 still continues. > > > > > > > Now in such a situation even though we have made one > > > > of the workers slept for a required time but ideally the worker which > > > > was doing I/O should have slept. The aim is to make the system stop > > > > doing I/O whenever the limit has exceeded, so that might not work in > > > > the above situation. > > > > > > > > > > One idea to fix this drawback is that if we somehow avoid letting the > > > workers sleep which has done less or no I/O as compared to other > > > workers, then we can to a good extent ensure that workers which are > > > doing more I/O will be throttled more. What we can do is to allow any > > > worker sleep only if it has performed the I/O above a certain > > > threshold and the overall balance is more than the cost_limit set by > > > the system. Then we will allow the worker to sleep proportional to > > > the work done by it and reduce the VacuumSharedCostBalance by the > > > amount which is consumed by the current worker. Something like: > > > > > > If ( VacuumSharedCostBalance >= VacuumCostLimit && > > > MyCostBalance > (threshold) VacuumCostLimit / workers) > > > { > > > VacuumSharedCostBalance -= MyCostBalance; > > > Sleep (delay * MyCostBalance/VacuumSharedCostBalance) > > > } > > > > > > Assume threshold be 0.5, what that means is, if it has done work more > > > than 50% of what is expected from this worker and the overall share > > > cost balance is exceeded, then we will consider this worker to sleep. > > > > > > What do you guys think? > > > > I think the idea that the more consuming I/O they sleep more longer > > time seems good. There seems not to be the drawback of approach(b) > > that is to unnecessarily delay vacuum if some indexes are very small > > or bulk-deletions of indexes does almost nothing such as brin. But on > > the other hand it's possible that workers don't sleep even if shared > > cost balance already exceeds the limit because it's necessary for > > sleeping that local balance exceeds the worker's limit divided by the > > number of workers. For example, a worker is scheduled doing I/O and > > exceeds the limit substantially while other 2 workers do less I/O. And > > then the 2 workers are scheduled and consume I/O. The total cost > > balance already exceeds the limit but the workers will not sleep as > > long as the local balance is less than (limit / # of workers). > > > > Right, this is the reason I told to keep some threshold for local > balance(say 50% of (limit / # of workers)). I think we need to do > some experiments to see what is the best thing to do. > I have done some experiments on this line. I have first produced a case where we can show the problem with the existing shared costing patch (worker which is doing less I/O might pay the penalty on behalf of the worker who is doing more I/O). I have also hacked the shared costing patch of Swada-san so that worker only go for sleep if the shared balance has crossed the limit and it's local balance has crossed some threadshold[1]. Test setup: I have created 4 indexes on the table. Out of which 3 indexes will have a lot of pages to process but need to dirty a few pages whereas the 4th index will have to process a very less number of pages but need to dirty all of them. I have attached the test script along with the mail. I have shown what is the delay time each worker have done. What is total I/O[1] each worker and what is the page hit, page miss and page dirty count? [1] total I/O = _nhit * VacuumCostPageHit + _nmiss * VacuumCostPageMiss + _ndirty * VacuumCostPageDirty patch 1: Shared costing patch: (delay condition -> VacuumSharedCostBalance > VacuumCostLimit) worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2 worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2 worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2 worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603 Observation1: I think here it's clearly visible that worker 3 is doing the least total I/O but delaying for maximum amount of time. OTOH, worker 1 is delaying for very little time compared to how much I/O it is doing. So for solving this problem, I have add a small tweak to the patch. Wherein the worker will only sleep if its local balance has crossed some threshold. And, we can see that with that change the problem is solved up to quite an extent. patch 2: Shared costing patch: (delay condition -> VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > VacuumCostLimit/number of workers) worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2 worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2 worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2 worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603 Observation2: This patch solves the problem discussed with patch1 but in some extreme cases there is a possibility that the shared limit can become twice as much as local limit and still no worker goes for the delay. For solving that there could be multiple ideas a) Set the max limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will give the delay whoever tries to do the I/O irrespective of its local balance. b) Set a little lower value for the local threshold e.g 50% of the local limit Here I have changed the patch2 as per (b) If local balance reaches to 50% of the local limit and shared balance hit the vacuum cost limit then go for the delay. patch 3: Shared costing patch: (delay condition -> VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5 * VacuumCostLimit/number of workers) worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2 worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2 worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2 worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603 Observation3: I think patch3 doesn't completely solve the issue discussed in patch1 but its far better than patch1. But, patch 2 might have another problem as discussed in observation2. I think I need to do some more analysis and experiment before we can reach to some conclusion. But, one point is clear that we need to do something to solve the problem observed with patch1 if we are going with the shared costing approach. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Nov 8, 2019 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have done some experiments on this line. I have first produced a > case where we can show the problem with the existing shared costing > patch (worker which is doing less I/O might pay the penalty on behalf > of the worker who is doing more I/O). I have also hacked the shared > costing patch of Swada-san so that worker only go for sleep if the > shared balance has crossed the limit and it's local balance has > crossed some threadshold[1]. > > Test setup: I have created 4 indexes on the table. Out of which 3 > indexes will have a lot of pages to process but need to dirty a few > pages whereas the 4th index will have to process a very less number of > pages but need to dirty all of them. I have attached the test script > along with the mail. I have shown what is the delay time each worker > have done. What is total I/O[1] each worker and what is the page hit, > page miss and page dirty count? > [1] total I/O = _nhit * VacuumCostPageHit + _nmiss * > VacuumCostPageMiss + _ndirty * VacuumCostPageDirty > > patch 1: Shared costing patch: (delay condition -> > VacuumSharedCostBalance > VacuumCostLimit) > worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603 > > Observation1: I think here it's clearly visible that worker 3 is > doing the least total I/O but delaying for maximum amount of time. > OTOH, worker 1 is delaying for very little time compared to how much > I/O it is doing. So for solving this problem, I have add a small > tweak to the patch. Wherein the worker will only sleep if its local > balance has crossed some threshold. And, we can see that with that > change the problem is solved up to quite an extent. > > patch 2: Shared costing patch: (delay condition -> > VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > > VacuumCostLimit/number of workers) > worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603 > > Observation2: This patch solves the problem discussed with patch1 but > in some extreme cases there is a possibility that the shared limit can > become twice as much as local limit and still no worker goes for the > delay. For solving that there could be multiple ideas a) Set the max > limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will > give the delay whoever tries to do the I/O irrespective of its local > balance. > b) Set a little lower value for the local threshold e.g 50% of the local limit > > Here I have changed the patch2 as per (b) If local balance reaches to > 50% of the local limit and shared balance hit the vacuum cost limit > then go for the delay. > > patch 3: Shared costing patch: (delay condition -> > VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5 > * VacuumCostLimit/number of workers) > worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603 > > Observation3: I think patch3 doesn't completely solve the issue > discussed in patch1 but its far better than patch1. > Yeah, I think it is difficult to get the exact balance, but we can try to be as close as possible. We can try to play with the threshold and another possibility is to try to sleep in proportion to the amount of I/O done by the worker. Thanks for doing these experiments, but I think it is better if you can share the modified patches so that others can also reproduce what you are seeing. There is no need to post the entire parallel vacuum patch-set, but the costing related patch can be posted with a reference to what all patches are required from parallel vacuum thread. Another option is to move this discussion to the parallel vacuum thread, but I think it is better to decide the costing model here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 8, 2019 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have done some experiments on this line. I have first produced a > > case where we can show the problem with the existing shared costing > > patch (worker which is doing less I/O might pay the penalty on behalf > > of the worker who is doing more I/O). I have also hacked the shared > > costing patch of Swada-san so that worker only go for sleep if the > > shared balance has crossed the limit and it's local balance has > > crossed some threadshold[1]. > > > > Test setup: I have created 4 indexes on the table. Out of which 3 > > indexes will have a lot of pages to process but need to dirty a few > > pages whereas the 4th index will have to process a very less number of > > pages but need to dirty all of them. I have attached the test script > > along with the mail. I have shown what is the delay time each worker > > have done. What is total I/O[1] each worker and what is the page hit, > > page miss and page dirty count? > > [1] total I/O = _nhit * VacuumCostPageHit + _nmiss * > > VacuumCostPageMiss + _ndirty * VacuumCostPageDirty > > > > patch 1: Shared costing patch: (delay condition -> > > VacuumSharedCostBalance > VacuumCostLimit) > > worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > Observation1: I think here it's clearly visible that worker 3 is > > doing the least total I/O but delaying for maximum amount of time. > > OTOH, worker 1 is delaying for very little time compared to how much > > I/O it is doing. So for solving this problem, I have add a small > > tweak to the patch. Wherein the worker will only sleep if its local > > balance has crossed some threshold. And, we can see that with that > > change the problem is solved up to quite an extent. > > > > patch 2: Shared costing patch: (delay condition -> > > VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > > > VacuumCostLimit/number of workers) > > worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > Observation2: This patch solves the problem discussed with patch1 but > > in some extreme cases there is a possibility that the shared limit can > > become twice as much as local limit and still no worker goes for the > > delay. For solving that there could be multiple ideas a) Set the max > > limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will > > give the delay whoever tries to do the I/O irrespective of its local > > balance. > > b) Set a little lower value for the local threshold e.g 50% of the local limit > > > > Here I have changed the patch2 as per (b) If local balance reaches to > > 50% of the local limit and shared balance hit the vacuum cost limit > > then go for the delay. > > > > patch 3: Shared costing patch: (delay condition -> > > VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5 > > * VacuumCostLimit/number of workers) > > worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > Observation3: I think patch3 doesn't completely solve the issue > > discussed in patch1 but its far better than patch1. > > > > Yeah, I think it is difficult to get the exact balance, but we can try > to be as close as possible. We can try to play with the threshold and > another possibility is to try to sleep in proportion to the amount of > I/O done by the worker. I have done another experiment where I have done another 2 changes on top op patch3 a) Only reduce the local balance from the total shared balance whenever it's applying delay b) Compute the delay based on the local balance. patch4: worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 I think with this approach the delay is divided among the worker quite well compared to other approaches > > Thanks for doing these experiments, but I think it is better if you > can share the modified patches so that others can also reproduce what > you are seeing. There is no need to post the entire parallel vacuum > patch-set, but the costing related patch can be posted with a > reference to what all patches are required from parallel vacuum > thread. Another option is to move this discussion to the parallel > vacuum thread, but I think it is better to decide the costing model > here. I have attached the POC patches I have for testing. Step for testing 1. First, apply the parallel vacuum base patch and the shared costing patch[1]. 2. Apply 0001-vacuum_costing_test.patch attached in the mail 3. Run the script shared in previous mail [2]. --> this will give the results for patch 1 shared upthread[2] 4. Apply patch shared_costing_plus_patch[2] or [3] or [4] to see the results with different approaches explained in the mail. [1] https://www.postgresql.org/message-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD%2BXOh_Q%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFiTN-tFLN%3Dvdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Nov 8, 2019 at 9:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have done some experiments on this line. I have first produced a > > > case where we can show the problem with the existing shared costing > > > patch (worker which is doing less I/O might pay the penalty on behalf > > > of the worker who is doing more I/O). I have also hacked the shared > > > costing patch of Swada-san so that worker only go for sleep if the > > > shared balance has crossed the limit and it's local balance has > > > crossed some threadshold[1]. > > > > > > Test setup: I have created 4 indexes on the table. Out of which 3 > > > indexes will have a lot of pages to process but need to dirty a few > > > pages whereas the 4th index will have to process a very less number of > > > pages but need to dirty all of them. I have attached the test script > > > along with the mail. I have shown what is the delay time each worker > > > have done. What is total I/O[1] each worker and what is the page hit, > > > page miss and page dirty count? > > > [1] total I/O = _nhit * VacuumCostPageHit + _nmiss * > > > VacuumCostPageMiss + _ndirty * VacuumCostPageDirty > > > > > > patch 1: Shared costing patch: (delay condition -> > > > VacuumSharedCostBalance > VacuumCostLimit) > > > worker 0 delay=80.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 1 delay=40.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 2 delay=110.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 3 delay=120.98 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > Observation1: I think here it's clearly visible that worker 3 is > > > doing the least total I/O but delaying for maximum amount of time. > > > OTOH, worker 1 is delaying for very little time compared to how much > > > I/O it is doing. So for solving this problem, I have add a small > > > tweak to the patch. Wherein the worker will only sleep if its local > > > balance has crossed some threshold. And, we can see that with that > > > change the problem is solved up to quite an extent. > > > > > > patch 2: Shared costing patch: (delay condition -> > > > VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > > > > VacuumCostLimit/number of workers) > > > worker 0 delay=100.12 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 1 delay=90.00 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 2 delay=80.06 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 3 delay=80.72 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > Observation2: This patch solves the problem discussed with patch1 but > > > in some extreme cases there is a possibility that the shared limit can > > > become twice as much as local limit and still no worker goes for the > > > delay. For solving that there could be multiple ideas a) Set the max > > > limit on shared balance e.g. 1.5 * VacuumCostLimit after that we will > > > give the delay whoever tries to do the I/O irrespective of its local > > > balance. > > > b) Set a little lower value for the local threshold e.g 50% of the local limit > > > > > > Here I have changed the patch2 as per (b) If local balance reaches to > > > 50% of the local limit and shared balance hit the vacuum cost limit > > > then go for the delay. > > > > > > patch 3: Shared costing patch: (delay condition -> > > > VacuumSharedCostBalance > VacuumCostLimit && VacuumLocalBalance > 0.5 > > > * VacuumCostLimit/number of workers) > > > worker 0 delay=70.03 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 1 delay=100.14 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 2 delay=80.01 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 3 delay=101.03 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > Observation3: I think patch3 doesn't completely solve the issue > > > discussed in patch1 but its far better than patch1. > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > to be as close as possible. We can try to play with the threshold and > > another possibility is to try to sleep in proportion to the amount of > > I/O done by the worker. > I have done another experiment where I have done another 2 changes on > top op patch3 > a) Only reduce the local balance from the total shared balance > whenever it's applying delay > b) Compute the delay based on the local balance. > > patch4: > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > I think with this approach the delay is divided among the worker quite > well compared to other approaches > > > > > Thanks for doing these experiments, but I think it is better if you > > can share the modified patches so that others can also reproduce what > > you are seeing. There is no need to post the entire parallel vacuum > > patch-set, but the costing related patch can be posted with a > > reference to what all patches are required from parallel vacuum > > thread. Another option is to move this discussion to the parallel > > vacuum thread, but I think it is better to decide the costing model > > here. > > I have attached the POC patches I have for testing. Step for testing > 1. First, apply the parallel vacuum base patch and the shared costing patch[1]. > 2. Apply 0001-vacuum_costing_test.patch attached in the mail > 3. Run the script shared in previous mail [2]. --> this will give the > results for patch 1 shared upthread[2] > 4. Apply patch shared_costing_plus_patch[2] or [3] or [4] to see the > results with different approaches explained in the mail. > > > [1] https://www.postgresql.org/message-id/CAD21AoAqT17QwKJ_sWOqRxNvg66wMw1oZZzf9Rt-E-zD%2BXOh_Q%40mail.gmail.com > [2] https://www.postgresql.org/message-id/CAFiTN-tFLN%3Dvdu5Ra-23E9_7Z1JXkk5MkRY3Bkj2zAoWK7fULA%40mail.gmail.com > I have tested the same with some other workload(test file attached). I can see the same behaviour with this workload as well that with the patch 4 the distribution of the delay is better compared to other patches i.e. worker with more I/O have more delay and with equal IO have alsomost equal delay. Only thing is that the total delay with the patch 4 is slightly less compared to other pacthes. patch1: worker 0 delay=120.000000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=170.000000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=210.000000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=263.400000 total io=44322 hit=8352 miss=1199 dirty=1199 patch2: worker 0 delay=190.645000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=160.090000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=170.775000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=243.180000 total io=44322 hit=8352 miss=1199 dirty=1199 patch3: worker 0 delay=191.765000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=180.935000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=201.305000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=192.770000 total io=44322 hit=8352 miss=1199 dirty=1199 patch4: worker 0 delay=175.290000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=174.135000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=175.560000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=212.100000 total io=44322 hit=8352 miss=1199 dirty=1199 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > > to be as close as possible. We can try to play with the threshold and > > > another possibility is to try to sleep in proportion to the amount of > > > I/O done by the worker. > > I have done another experiment where I have done another 2 changes on > > top op patch3 > > a) Only reduce the local balance from the total shared balance > > whenever it's applying delay > > b) Compute the delay based on the local balance. > > > > patch4: > > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > I think with this approach the delay is divided among the worker quite > > well compared to other approaches > > > > > .. > I have tested the same with some other workload(test file attached). > I can see the same behaviour with this workload as well that with the > patch 4 the distribution of the delay is better compared to other > patches i.e. worker with more I/O have more delay and with equal IO > have alsomost equal delay. Only thing is that the total delay with > the patch 4 is slightly less compared to other pacthes. > I see one problem with the formula you have used in the patch, maybe that is causing the value of total delay to go down. - if (new_balance >= VacuumCostLimit) + VacuumCostBalanceLocal += VacuumCostBalance; + if ((new_balance >= VacuumCostLimit) && + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) As per discussion, the second part of the condition should be "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think you can once change this and try again. Also, please try with the different values of threshold (0.3, 0.5, 0.7, etc.). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > > > to be as close as possible. We can try to play with the threshold and > > > > another possibility is to try to sleep in proportion to the amount of > > > > I/O done by the worker. > > > I have done another experiment where I have done another 2 changes on > > > top op patch3 > > > a) Only reduce the local balance from the total shared balance > > > whenever it's applying delay > > > b) Compute the delay based on the local balance. > > > > > > patch4: > > > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > I think with this approach the delay is divided among the worker quite > > > well compared to other approaches > > > > > > > > .. > > I have tested the same with some other workload(test file attached). > > I can see the same behaviour with this workload as well that with the > > patch 4 the distribution of the delay is better compared to other > > patches i.e. worker with more I/O have more delay and with equal IO > > have alsomost equal delay. Only thing is that the total delay with > > the patch 4 is slightly less compared to other pacthes. > > > > I see one problem with the formula you have used in the patch, maybe > that is causing the value of total delay to go down. > > - if (new_balance >= VacuumCostLimit) > + VacuumCostBalanceLocal += VacuumCostBalance; > + if ((new_balance >= VacuumCostLimit) && > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > As per discussion, the second part of the condition should be > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". My Bad I think > you can once change this and try again. Also, please try with the > different values of threshold (0.3, 0.5, 0.7, etc.). > Okay, I will retest with both patch3 and path4 for both the scenarios. I will also try with different multipliers. > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 11, 2019 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > .. > > > I have tested the same with some other workload(test file attached). > > > I can see the same behaviour with this workload as well that with the > > > patch 4 the distribution of the delay is better compared to other > > > patches i.e. worker with more I/O have more delay and with equal IO > > > have alsomost equal delay. Only thing is that the total delay with > > > the patch 4 is slightly less compared to other pacthes. > > > > > > > I see one problem with the formula you have used in the patch, maybe > > that is causing the value of total delay to go down. > > > > - if (new_balance >= VacuumCostLimit) > > + VacuumCostBalanceLocal += VacuumCostBalance; > > + if ((new_balance >= VacuumCostLimit) && > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > > > As per discussion, the second part of the condition should be > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". > My Bad > I think > > you can once change this and try again. Also, please try with the > > different values of threshold (0.3, 0.5, 0.7, etc.). > > > Okay, I will retest with both patch3 and path4 for both the scenarios. > I will also try with different multipliers. > One more thing, I think we should also test these cases with a varying number of indexes (say 2,6,8,etc.) and then probably, we should test by a varying number of workers where the number of workers are lesser than indexes. You can do these after finishing your previous experiments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > > > to be as close as possible. We can try to play with the threshold and > > > > another possibility is to try to sleep in proportion to the amount of > > > > I/O done by the worker. > > > I have done another experiment where I have done another 2 changes on > > > top op patch3 > > > a) Only reduce the local balance from the total shared balance > > > whenever it's applying delay > > > b) Compute the delay based on the local balance. > > > > > > patch4: > > > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > I think with this approach the delay is divided among the worker quite > > > well compared to other approaches > > > > > > > > .. > > I have tested the same with some other workload(test file attached). > > I can see the same behaviour with this workload as well that with the > > patch 4 the distribution of the delay is better compared to other > > patches i.e. worker with more I/O have more delay and with equal IO > > have alsomost equal delay. Only thing is that the total delay with > > the patch 4 is slightly less compared to other pacthes. > > > > I see one problem with the formula you have used in the patch, maybe > that is causing the value of total delay to go down. > > - if (new_balance >= VacuumCostLimit) > + VacuumCostBalanceLocal += VacuumCostBalance; > + if ((new_balance >= VacuumCostLimit) && > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > As per discussion, the second part of the condition should be > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think > you can once change this and try again. Also, please try with the > different values of threshold (0.3, 0.5, 0.7, etc.). > I have modified the patch4 and ran with different values. But, I don't see much difference in the values with the patch4. Infact I removed the condition for the local balancing check completely still the delays are the same, I think this is because with patch4 worker are only reducing their own balance and also delaying as much as their local balance. So maybe the second condition will not have much impact. Patch4 (test.sh) 0 worker 0 delay=82.380000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=89.370000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=89.645000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=79.150000 total io=16378 hit=4318 miss=0 dirty=603 0.1 worker 0 delay=89.295000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=89.230000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=89.675000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=81.840000 total io=16378 hit=4318 miss=0 dirty=603 0.3 worker 0 delay=85.915000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=85.180000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=88.760000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=81.975000 total io=16378 hit=4318 miss=0 dirty=603 0.5 worker 0 delay=81.635000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=87.490000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=89.425000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=82.050000 total io=16378 hit=4318 miss=0 dirty=603 0.7 worker 0 delay=85.185000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=88.835000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=86.005000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=76.160000 total io=16378 hit=4318 miss=0 dirty=603 Patch4 (test1.sh) 0 worker 0 delay=179.005000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=221.900000 total io=44322 hit=8352 miss=1199 dirty=1199 0.1 worker 0 delay=177.840000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=179.465000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=179.255000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=222.695000 total io=44322 hit=8352 miss=1199 dirty=1199 0.3 worker 0 delay=178.295000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=178.720000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=178.270000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=220.420000 total io=44322 hit=8352 miss=1199 dirty=1199 0.5 worker 0 delay=178.415000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=178.385000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=173.805000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=221.605000 total io=44322 hit=8352 miss=1199 dirty=1199 0.7 worker 0 delay=175.330000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=177.890000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=167.540000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=216.725000 total io=44322 hit=8352 miss=1199 dirty=1199 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > > > > to be as close as possible. We can try to play with the threshold and > > > > > another possibility is to try to sleep in proportion to the amount of > > > > > I/O done by the worker. > > > > I have done another experiment where I have done another 2 changes on > > > > top op patch3 > > > > a) Only reduce the local balance from the total shared balance > > > > whenever it's applying delay > > > > b) Compute the delay based on the local balance. > > > > > > > > patch4: > > > > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > > > I think with this approach the delay is divided among the worker quite > > > > well compared to other approaches > > > > > > > > > > > .. > > > I have tested the same with some other workload(test file attached). > > > I can see the same behaviour with this workload as well that with the > > > patch 4 the distribution of the delay is better compared to other > > > patches i.e. worker with more I/O have more delay and with equal IO > > > have alsomost equal delay. Only thing is that the total delay with > > > the patch 4 is slightly less compared to other pacthes. > > > > > > > I see one problem with the formula you have used in the patch, maybe > > that is causing the value of total delay to go down. > > > > - if (new_balance >= VacuumCostLimit) > > + VacuumCostBalanceLocal += VacuumCostBalance; > > + if ((new_balance >= VacuumCostLimit) && > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > > > As per discussion, the second part of the condition should be > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think > > you can once change this and try again. Also, please try with the > > different values of threshold (0.3, 0.5, 0.7, etc.). > > > I have modified the patch4 and ran with different values. But, I > don't see much difference in the values with the patch4. Infact I > removed the condition for the local balancing check completely still > the delays are the same, I think this is because with patch4 worker > are only reducing their own balance and also delaying as much as their > local balance. So maybe the second condition will not have much > impact. > > Patch4 (test.sh) > 0 > worker 0 delay=82.380000 total io=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=89.370000 total io=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=89.645000 total io=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=79.150000 total io=16378 hit=4318 miss=0 dirty=603 > > 0.1 > worker 0 delay=89.295000 total io=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=89.230000 total io=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=89.675000 total io=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=81.840000 total io=16378 hit=4318 miss=0 dirty=603 > > 0.3 > worker 0 delay=85.915000 total io=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=85.180000 total io=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=88.760000 total io=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=81.975000 total io=16378 hit=4318 miss=0 dirty=603 > > 0.5 > worker 0 delay=81.635000 total io=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=87.490000 total io=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=89.425000 total io=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=82.050000 total io=16378 hit=4318 miss=0 dirty=603 > > 0.7 > worker 0 delay=85.185000 total io=17931 hit=17891 miss=0 dirty=2 > worker 1 delay=88.835000 total io=17931 hit=17891 miss=0 dirty=2 > worker 2 delay=86.005000 total io=17931 hit=17891 miss=0 dirty=2 > worker 3 delay=76.160000 total io=16378 hit=4318 miss=0 dirty=603 > > Patch4 (test1.sh) > 0 > worker 0 delay=179.005000 total io=35828 hit=35788 miss=0 dirty=2 > worker 1 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2 > worker 2 delay=179.010000 total io=35828 hit=35788 miss=0 dirty=2 > worker 3 delay=221.900000 total io=44322 hit=8352 miss=1199 dirty=1199 > > 0.1 > worker 0 delay=177.840000 total io=35828 hit=35788 miss=0 dirty=2 > worker 1 delay=179.465000 total io=35828 hit=35788 miss=0 dirty=2 > worker 2 delay=179.255000 total io=35828 hit=35788 miss=0 dirty=2 > worker 3 delay=222.695000 total io=44322 hit=8352 miss=1199 dirty=1199 > > 0.3 > worker 0 delay=178.295000 total io=35828 hit=35788 miss=0 dirty=2 > worker 1 delay=178.720000 total io=35828 hit=35788 miss=0 dirty=2 > worker 2 delay=178.270000 total io=35828 hit=35788 miss=0 dirty=2 > worker 3 delay=220.420000 total io=44322 hit=8352 miss=1199 dirty=1199 > > 0.5 > worker 0 delay=178.415000 total io=35828 hit=35788 miss=0 dirty=2 > worker 1 delay=178.385000 total io=35828 hit=35788 miss=0 dirty=2 > worker 2 delay=173.805000 total io=35828 hit=35788 miss=0 dirty=2 > worker 3 delay=221.605000 total io=44322 hit=8352 miss=1199 dirty=1199 > > 0.7 > worker 0 delay=175.330000 total io=35828 hit=35788 miss=0 dirty=2 > worker 1 delay=177.890000 total io=35828 hit=35788 miss=0 dirty=2 > worker 2 delay=167.540000 total io=35828 hit=35788 miss=0 dirty=2 > worker 3 delay=216.725000 total io=44322 hit=8352 miss=1199 dirty=1199 > I have revised the patch4 so that it doesn't depent upon the fix number of workers, instead I have dynamically updated the worker count. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Nov 12, 2019 at 3:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > > > > > to be as close as possible. We can try to play with the threshold and > > > > > > another possibility is to try to sleep in proportion to the amount of > > > > > > I/O done by the worker. > > > > > I have done another experiment where I have done another 2 changes on > > > > > top op patch3 > > > > > a) Only reduce the local balance from the total shared balance > > > > > whenever it's applying delay > > > > > b) Compute the delay based on the local balance. > > > > > > > > > > patch4: > > > > > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > > > > > I think with this approach the delay is divided among the worker quite > > > > > well compared to other approaches > > > > > > > > > > > > > > .. > > > > I have tested the same with some other workload(test file attached). > > > > I can see the same behaviour with this workload as well that with the > > > > patch 4 the distribution of the delay is better compared to other > > > > patches i.e. worker with more I/O have more delay and with equal IO > > > > have alsomost equal delay. Only thing is that the total delay with > > > > the patch 4 is slightly less compared to other pacthes. > > > > > > > > > > I see one problem with the formula you have used in the patch, maybe > > > that is causing the value of total delay to go down. > > > > > > - if (new_balance >= VacuumCostLimit) > > > + VacuumCostBalanceLocal += VacuumCostBalance; > > > + if ((new_balance >= VacuumCostLimit) && > > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > > > > > As per discussion, the second part of the condition should be > > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think > > > you can once change this and try again. Also, please try with the > > > different values of threshold (0.3, 0.5, 0.7, etc.). > > > > > I have modified the patch4 and ran with different values. But, I > > don't see much difference in the values with the patch4. Infact I > > removed the condition for the local balancing check completely still > > the delays are the same, I think this is because with patch4 worker > > are only reducing their own balance and also delaying as much as their > > local balance. So maybe the second condition will not have much > > impact. > > Yeah, but I suspect the condition (when the local balance exceeds a certain threshold, then only try to perform delay) you mentioned can have an impact in some other scenarios. So, it is better to retain the same. I feel the overall results look sane and the approach seems reasonable to me. > > > I have revised the patch4 so that it doesn't depent upon the fix > number of workers, instead I have dynamically updated the worker > count. > Thanks. Sawada-San, by any chance, can you try some of the tests done by Dilip or some similar tests just to rule out any sort of machine-specific dependency? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, 12 Nov 2019 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 12, 2019 at 3:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > > > > > > to be as close as possible. We can try to play with the threshold and > > > > > > > another possibility is to try to sleep in proportion to the amount of > > > > > > > I/O done by the worker. > > > > > > I have done another experiment where I have done another 2 changes on > > > > > > top op patch3 > > > > > > a) Only reduce the local balance from the total shared balance > > > > > > whenever it's applying delay > > > > > > b) Compute the delay based on the local balance. > > > > > > > > > > > > patch4: > > > > > > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > > > > > > > I think with this approach the delay is divided among the worker quite > > > > > > well compared to other approaches > > > > > > > > > > > > > > > > > .. > > > > > I have tested the same with some other workload(test file attached). > > > > > I can see the same behaviour with this workload as well that with the > > > > > patch 4 the distribution of the delay is better compared to other > > > > > patches i.e. worker with more I/O have more delay and with equal IO > > > > > have alsomost equal delay. Only thing is that the total delay with > > > > > the patch 4 is slightly less compared to other pacthes. > > > > > > > > > > > > > I see one problem with the formula you have used in the patch, maybe > > > > that is causing the value of total delay to go down. > > > > > > > > - if (new_balance >= VacuumCostLimit) > > > > + VacuumCostBalanceLocal += VacuumCostBalance; > > > > + if ((new_balance >= VacuumCostLimit) && > > > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > > > > > > > As per discussion, the second part of the condition should be > > > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think > > > > you can once change this and try again. Also, please try with the > > > > different values of threshold (0.3, 0.5, 0.7, etc.). > > > > > > > I have modified the patch4 and ran with different values. But, I > > > don't see much difference in the values with the patch4. Infact I > > > removed the condition for the local balancing check completely still > > > the delays are the same, I think this is because with patch4 worker > > > are only reducing their own balance and also delaying as much as their > > > local balance. So maybe the second condition will not have much > > > impact. > > > > > Yeah, but I suspect the condition (when the local balance exceeds a > certain threshold, then only try to perform delay) you mentioned can > have an impact in some other scenarios. So, it is better to retain > the same. I feel the overall results look sane and the approach seems > reasonable to me. > > > > > > I have revised the patch4 so that it doesn't depent upon the fix > > number of workers, instead I have dynamically updated the worker > > count. > > > > Thanks. Sawada-San, by any chance, can you try some of the tests done > by Dilip or some similar tests just to rule out any sort of > machine-specific dependency? Sure. I'll try it tomorrow. -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 12 Nov 2019 at 20:22, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Tue, 12 Nov 2019 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Nov 12, 2019 at 3:03 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Nov 12, 2019 at 10:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Mon, Nov 11, 2019 at 12:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > On Mon, Nov 11, 2019 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > > > On Fri, Nov 8, 2019 at 11:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > Yeah, I think it is difficult to get the exact balance, but we can try > > > > > > > > to be as close as possible. We can try to play with the threshold and > > > > > > > > another possibility is to try to sleep in proportion to the amount of > > > > > > > > I/O done by the worker. > > > > > > > I have done another experiment where I have done another 2 changes on > > > > > > > top op patch3 > > > > > > > a) Only reduce the local balance from the total shared balance > > > > > > > whenever it's applying delay > > > > > > > b) Compute the delay based on the local balance. > > > > > > > > > > > > > > patch4: > > > > > > > worker 0 delay=84.130000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > > > worker 1 delay=89.230000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > > > worker 2 delay=88.680000 total I/O=17931 hit=17891 miss=0 dirty=2 > > > > > > > worker 3 delay=80.790000 total I/O=16378 hit=4318 miss=0 dirty=603 > > > > > > > > > > > > > > I think with this approach the delay is divided among the worker quite > > > > > > > well compared to other approaches > > > > > > > > > > > > > > > > > > > > .. > > > > > > I have tested the same with some other workload(test file attached). > > > > > > I can see the same behaviour with this workload as well that with the > > > > > > patch 4 the distribution of the delay is better compared to other > > > > > > patches i.e. worker with more I/O have more delay and with equal IO > > > > > > have alsomost equal delay. Only thing is that the total delay with > > > > > > the patch 4 is slightly less compared to other pacthes. > > > > > > > > > > > > > > > > I see one problem with the formula you have used in the patch, maybe > > > > > that is causing the value of total delay to go down. > > > > > > > > > > - if (new_balance >= VacuumCostLimit) > > > > > + VacuumCostBalanceLocal += VacuumCostBalance; > > > > > + if ((new_balance >= VacuumCostLimit) && > > > > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > > > > > > > > > As per discussion, the second part of the condition should be > > > > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". I think > > > > > you can once change this and try again. Also, please try with the > > > > > different values of threshold (0.3, 0.5, 0.7, etc.). > > > > > > > > > I have modified the patch4 and ran with different values. But, I > > > > don't see much difference in the values with the patch4. Infact I > > > > removed the condition for the local balancing check completely still > > > > the delays are the same, I think this is because with patch4 worker > > > > are only reducing their own balance and also delaying as much as their > > > > local balance. So maybe the second condition will not have much > > > > impact. > > > > > > > > Yeah, but I suspect the condition (when the local balance exceeds a > > certain threshold, then only try to perform delay) you mentioned can > > have an impact in some other scenarios. So, it is better to retain > > the same. I feel the overall results look sane and the approach seems > > reasonable to me. > > > > > > > > > I have revised the patch4 so that it doesn't depent upon the fix > > > number of workers, instead I have dynamically updated the worker > > > count. > > > > > > > Thanks. Sawada-San, by any chance, can you try some of the tests done > > by Dilip or some similar tests just to rule out any sort of > > machine-specific dependency? > > Sure. I'll try it tomorrow. I've done some tests while changing shared buffer size, delays and number of workers. The overall results has the similar tendency as the result shared by Dilip and looks reasonable to me. * test.sh shared_buffers = '4GB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 1; worker 0 delay=89.315000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=88.860000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=89.290000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=81.805000 total io=16378 hit=4318 miss=0 dirty=603 shared_buffers = '1GB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 1; worker 0 delay=89.210000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=89.325000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=88.870000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=81.735000 total io=16378 hit=4318 miss=0 dirty=603 shared_buffers = '512MB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 1; worker 0 delay=88.480000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=88.635000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=88.600000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=81.660000 total io=16378 hit=4318 miss=0 dirty=603 shared_buffers = '512MB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 5; worker 0 delay=447.725000 total io=17931 hit=17891 miss=0 dirty=2 worker 1 delay=445.850000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=445.125000 total io=17931 hit=17891 miss=0 dirty=2 worker 3 delay=409.025000 total io=16378 hit=4318 miss=0 dirty=603 shared_buffers = '512MB'; max_parallel_maintenance_workers = 2; vacuum_cost_delay = 5; worker 0 delay=854.750000 total io=34309 hit=22209 miss=0 dirty=605 worker 1 delay=446.500000 total io=17931 hit=17891 miss=0 dirty=2 worker 2 delay=444.175000 total io=17931 hit=17891 miss=0 dirty=2 --- * test1.sh shared_buffers = '4GB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 1; worker 0 delay=178.205000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=178.550000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=178.660000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=221.280000 total io=44322 hit=8352 miss=1199 dirty=1199 shared_buffers = '1GB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 1; worker 0 delay=178.035000 total io=35828 hit=35788 miss=0 dirty=2 worker 1 delay=178.535000 total io=35828 hit=35788 miss=0 dirty=2 worker 2 delay=178.585000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=221.465000 total io=44322 hit=8352 miss=1199 dirty=1199 shared_buffers = '512MB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 1; worker 0 delay=1795.900000 total io=357911 hit=1 miss=35787 dirty=2 worker 1 delay=1790.700000 total io=357911 hit=1 miss=35787 dirty=2 worker 2 delay=179.000000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=221.355000 total io=44322 hit=8352 miss=1199 dirty=1199 shared_buffers = '512MB'; max_parallel_maintenance_workers = 6; vacuum_cost_delay = 5; worker 0 delay=8958.500000 total io=357911 hit=1 miss=35787 dirty=2 worker 1 delay=8950.000000 total io=357911 hit=1 miss=35787 dirty=2 worker 2 delay=894.150000 total io=35828 hit=35788 miss=0 dirty=2 worker 3 delay=1106.400000 total io=44322 hit=8352 miss=1199 dirty=1199 shared_buffers = '512MB'; max_parallel_maintenance_workers = 2; vacuum_cost_delay = 5; worker 0 delay=8956.500000 total io=357911 hit=1 miss=35787 dirty=2 worker 1 delay=8955.050000 total io=357893 hit=3 miss=35785 dirty=2 worker 2 delay=2002.825000 total io=80150 hit=44140 miss=1199 dirty=1201 Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 11 Nov 2019 at 17:56, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 11, 2019 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > ..
> > > > I have tested the same with some other workload(test file attached).
> > > > I can see the same behaviour with this workload as well that with the
> > > > patch 4 the distribution of the delay is better compared to other
> > > > patches i.e. worker with more I/O have more delay and with equal IO
> > > > have alsomost equal delay. Only thing is that the total delay with
> > > > the patch 4 is slightly less compared to other pacthes.
> > > >
> > >
> > > I see one problem with the formula you have used in the patch, maybe
> > > that is causing the value of total delay to go down.
> > >
> > > - if (new_balance >= VacuumCostLimit)
> > > + VacuumCostBalanceLocal += VacuumCostBalance;
> > > + if ((new_balance >= VacuumCostLimit) &&
> > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))
> > >
> > > As per discussion, the second part of the condition should be
> > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker".
> > My Bad
> > I think
> > > you can once change this and try again. Also, please try with the
> > > different values of threshold (0.3, 0.5, 0.7, etc.).
> > >
> > Okay, I will retest with both patch3 and path4 for both the scenarios.
> > I will also try with different multipliers.
> >
>
> One more thing, I think we should also test these cases with a varying
> number of indexes (say 2,6,8,etc.) and then probably, we should test
> by a varying number of workers where the number of workers are lesser
> than indexes. You can do these after finishing your previous
> experiments.
On the top of parallel vacuum patch, I applied Dilip's patch(0001-vacuum_costing_test.patch). I have tested by varying number of indexes and number of workers. I compared shared costing(0001-vacuum_costing_test.patch) vs shared costing latest patch(shared_costing_plus_patch4_v1.patch).
With shared costing base patch, I can see that delay is not in sync compared to I/O which is resolved by applying patch (shared_costing_plus_patch4_v1.patch). I have also observed that total delay is slightly reduced with shared_costing_plus_patch4_v1.patch patch.
Below is the full testing summary:
Test setup:
step1) Apply parallel vacuum patch
step2) Apply 0001-vacuum_costing_test.patch patch (on the top of this patch, delay is not in sync compared to I/O)
Configuration settings:
autovacuum = off
max_parallel_workers = 30
shared_buffers = 2GB
max_parallel_maintenance_workers = 20
vacuum_cost_limit = 2000
vacuum_cost_delay = 10
>
> On Mon, Nov 11, 2019 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > ..
> > > > I have tested the same with some other workload(test file attached).
> > > > I can see the same behaviour with this workload as well that with the
> > > > patch 4 the distribution of the delay is better compared to other
> > > > patches i.e. worker with more I/O have more delay and with equal IO
> > > > have alsomost equal delay. Only thing is that the total delay with
> > > > the patch 4 is slightly less compared to other pacthes.
> > > >
> > >
> > > I see one problem with the formula you have used in the patch, maybe
> > > that is causing the value of total delay to go down.
> > >
> > > - if (new_balance >= VacuumCostLimit)
> > > + VacuumCostBalanceLocal += VacuumCostBalance;
> > > + if ((new_balance >= VacuumCostLimit) &&
> > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker)))
> > >
> > > As per discussion, the second part of the condition should be
> > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker".
> > My Bad
> > I think
> > > you can once change this and try again. Also, please try with the
> > > different values of threshold (0.3, 0.5, 0.7, etc.).
> > >
> > Okay, I will retest with both patch3 and path4 for both the scenarios.
> > I will also try with different multipliers.
> >
>
> One more thing, I think we should also test these cases with a varying
> number of indexes (say 2,6,8,etc.) and then probably, we should test
> by a varying number of workers where the number of workers are lesser
> than indexes. You can do these after finishing your previous
> experiments.
On the top of parallel vacuum patch, I applied Dilip's patch(0001-vacuum_costing_test.patch). I have tested by varying number of indexes and number of workers. I compared shared costing(0001-vacuum_costing_test.patch) vs shared costing latest patch(shared_costing_plus_patch4_v1.patch).
With shared costing base patch, I can see that delay is not in sync compared to I/O which is resolved by applying patch (shared_costing_plus_patch4_v1.patch). I have also observed that total delay is slightly reduced with shared_costing_plus_patch4_v1.patch patch.
Below is the full testing summary:
Test setup:
step1) Apply parallel vacuum patch
step2) Apply 0001-vacuum_costing_test.patch patch (on the top of this patch, delay is not in sync compared to I/O)
step3) Apply shared_costing_plus_patch4_v1.patch (delay is in sync compared to I/O)
Configuration settings:
autovacuum = off
max_parallel_workers = 30
shared_buffers = 2GB
max_parallel_maintenance_workers = 20
vacuum_cost_limit = 2000
vacuum_cost_delay = 10
Test 1: Varry indexes(2,4,6,8) but parallel workers are fixed as 4:
Case 1) When indexes are 2:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2
With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=87.780000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.995000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.995000 total io=17931 hit=17891 miss=0 dirty=2
Case 2) When indexes are 4:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2
With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=87.430000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.175000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=86.340000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=88.020000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=87.175000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=86.340000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=88.020000 total io=17931 hit=17891 miss=0 dirty=2
Case 3) When indexes are 6:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=110.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=160.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=90.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=160.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=90.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2
With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=173.195000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=88.715000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=87.710000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=86.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=89.435000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=88.715000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=87.710000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=86.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 4 delay=89.435000 total io=17931 hit=17891 miss=0 dirty=2
Case 4) When indexes are 8:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=130.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=190.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=110.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 2 delay=130.000000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=190.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=110.000000 total io=35862 hit=35782 miss=0 dirty=4
With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=174.700000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=177.880000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 2 delay=89.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=177.320000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=86.810000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 1 delay=177.880000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 2 delay=89.460000 total io=17931 hit=17891 miss=0 dirty=2
WARNING: worker 3 delay=177.320000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=86.810000 total io=17931 hit=17891 miss=0 dirty=2
Test 2: Indexes are 16 but parallel workers are 2,4,8:
Case 1) When 2 parallel workers:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1513.230000 total io=307197 hit=85167 miss=22179 dirty=12
WARNING: worker 1 delay=1543.385000 total io=326553 hit=63133 miss=26322 dirty=10
WARNING: worker 2 delay=1633.625000 total io=302199 hit=65839 miss=23616 dirty=10
WARNING: worker 1 delay=1543.385000 total io=326553 hit=63133 miss=26322 dirty=10
WARNING: worker 2 delay=1633.625000 total io=302199 hit=65839 miss=23616 dirty=10
With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1539.475000 total io=308175 hit=65175 miss=24280 dirty=10
WARNING: worker 1 delay=1251.200000 total io=250692 hit=71562 miss=17893 dirty=10
WARNING: worker 2 delay=1143.690000 total io=228987 hit=93857 miss=13489 dirty=12
WARNING: worker 1 delay=1251.200000 total io=250692 hit=71562 miss=17893 dirty=10
WARNING: worker 2 delay=1143.690000 total io=228987 hit=93857 miss=13489 dirty=12
Case 2) When 4 parallel workers:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1182.430000 total io=213567 hit=16037 miss=19745 dirty=4
WARNING: worker 1 delay=1202.710000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=210.000000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=270.000000 total io=71724 hit=71564 miss=0 dirty=8
WARNING: worker 4 delay=851.825000 total io=188229 hit=58619 miss=12945 dirty=8
WARNING: worker 1 delay=1202.710000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=210.000000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=270.000000 total io=71724 hit=71564 miss=0 dirty=8
WARNING: worker 4 delay=851.825000 total io=188229 hit=58619 miss=12945 dirty=8
With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1136.875000 total io=227679 hit=14469 miss=21313 dirty=4
WARNING: worker 1 delay=973.745000 total io=196881 hit=17891 miss=17891 dirty=4
WARNING: worker 2 delay=447.410000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=833.235000 total io=168228 hit=40958 miss=12715 dirty=6
WARNING: worker 4 delay=683.200000 total io=136488 hit=64368 miss=7196 dirty=8
WARNING: worker 1 delay=973.745000 total io=196881 hit=17891 miss=17891 dirty=4
WARNING: worker 2 delay=447.410000 total io=89655 hit=89455 miss=0 dirty=10
WARNING: worker 3 delay=833.235000 total io=168228 hit=40958 miss=12715 dirty=6
WARNING: worker 4 delay=683.200000 total io=136488 hit=64368 miss=7196 dirty=8
Case 3) When 8 parallel workers:
Without shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=1022.300000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 1 delay=1072.770000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=140.035000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=200.000000 total io=53802 hit=53672 miss=1 dirty=6
WARNING: worker 6 delay=130.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=150.000000 total io=53793 hit=53673 miss=0 dirty=6
WARNING: worker 1 delay=1072.770000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 4 delay=140.035000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=200.000000 total io=53802 hit=53672 miss=1 dirty=6
WARNING: worker 6 delay=130.000000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=150.000000 total io=53793 hit=53673 miss=0 dirty=6
With shared_costing_plus_patch4_v1.patch:
WARNING: worker 0 delay=872.800000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 1 delay=885.950000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=175.680000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=259.560000 total io=53793 hit=53673 miss=0 dirty=6
WARNING: worker 4 delay=169.945000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=613.845000 total io=125100 hit=45750 miss=7923 dirty=6
WARNING: worker 6 delay=171.895000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=176.505000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 1 delay=885.950000 total io=178941 hit=1 miss=17890 dirty=2
WARNING: worker 2 delay=175.680000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 3 delay=259.560000 total io=53793 hit=53673 miss=0 dirty=6
WARNING: worker 4 delay=169.945000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 5 delay=613.845000 total io=125100 hit=45750 miss=7923 dirty=6
WARNING: worker 6 delay=171.895000 total io=35862 hit=35782 miss=0 dirty=4
WARNING: worker 7 delay=176.505000 total io=35862 hit=35782 miss=0 dirty=4
On Wed, Nov 13, 2019 at 10:02 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > I've done some tests while changing shared buffer size, delays and > number of workers. The overall results has the similar tendency as the > result shared by Dilip and looks reasonable to me. > Thanks, Sawada-san for repeating the tests. I can see from yours, Dilip and Mahendra's testing that the delay is distributed depending upon the I/O done by a particular worker and the total I/O is also as expected in various kinds of scenarios. So, I think this is a better approach. Do you agree or you think we should still investigate more on another approach as well? I would like to summarize this approach. The basic idea for parallel vacuum is to allow the parallel workers and master backend to have a shared view of vacuum cost related parameters (mainly VacuumCostBalance) and allow each worker to update it and then based on that decide whether it needs to sleep. With this basic idea, we found that in some cases the throttling is not accurate as explained with an example in my email above [1] and then the tests performed by Dilip and others in the following emails (In short, the workers doing more I/O can be throttled less). Then as discussed in an email later [2], we tried a way to avoid letting the workers sleep which has done less or no I/O as compared to other workers. This ensured that workers who are doing more I/O got throttled more. The idea is to allow any worker to sleep only if it has performed the I/O above a certain threshold and the overall balance is more than the cost_limit set by the system. Then we will allow the worker to sleep proportional to the work done by it and reduce the VacuumSharedCostBalance by the amount which is consumed by the current worker. This scheme leads to the desired throttling by different workers based on the work done by the individual worker. We have tested this idea with various kinds of workloads like by varying shared buffer size, delays and number of workers. Then also, we have tried with a different number of indexes and workers. In all the tests, we found that the workers are throttled proportional to the I/O being done by a particular worker. [1] - https://www.postgresql.org/message-id/CAA4eK1JvxBTWTPqHGx1X7in7j42ZYwuKOZUySzH3YMwTNRE-2Q%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1K9kCqLKbVA9KUuuarjj%2BsNYqrmf6UAFok5VTgZ8evWoA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 14, 2019 at 5:02 PM Mahendra Singh <mahi6run@gmail.com> wrote: > > On Mon, 11 Nov 2019 at 17:56, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 11, 2019 at 5:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Nov 11, 2019 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > .. > > > > > I have tested the same with some other workload(test file attached). > > > > > I can see the same behaviour with this workload as well that with the > > > > > patch 4 the distribution of the delay is better compared to other > > > > > patches i.e. worker with more I/O have more delay and with equal IO > > > > > have alsomost equal delay. Only thing is that the total delay with > > > > > the patch 4 is slightly less compared to other pacthes. > > > > > > > > > > > > > I see one problem with the formula you have used in the patch, maybe > > > > that is causing the value of total delay to go down. > > > > > > > > - if (new_balance >= VacuumCostLimit) > > > > + VacuumCostBalanceLocal += VacuumCostBalance; > > > > + if ((new_balance >= VacuumCostLimit) && > > > > + (VacuumCostBalanceLocal > VacuumCostLimit/(0.5 * nworker))) > > > > > > > > As per discussion, the second part of the condition should be > > > > "VacuumCostBalanceLocal > (0.5) * VacuumCostLimit/nworker". > > > My Bad > > > I think > > > > you can once change this and try again. Also, please try with the > > > > different values of threshold (0.3, 0.5, 0.7, etc.). > > > > > > > Okay, I will retest with both patch3 and path4 for both the scenarios. > > > I will also try with different multipliers. > > > > > > > One more thing, I think we should also test these cases with a varying > > number of indexes (say 2,6,8,etc.) and then probably, we should test > > by a varying number of workers where the number of workers are lesser > > than indexes. You can do these after finishing your previous > > experiments. > > On the top of parallel vacuum patch, I applied Dilip's patch(0001-vacuum_costing_test.patch). I have tested by varyingnumber of indexes and number of workers. I compared shared costing(0001-vacuum_costing_test.patch) vs shared costinglatest patch(shared_costing_plus_patch4_v1.patch). > With shared costing base patch, I can see that delay is not in sync compared to I/O which is resolved by applying patch(shared_costing_plus_patch4_v1.patch). I have also observed that total delay is slightly reduced with shared_costing_plus_patch4_v1.patchpatch. > > Below is the full testing summary: > Test setup: > step1) Apply parallel vacuum patch > step2) Apply 0001-vacuum_costing_test.patch patch (on the top of this patch, delay is not in sync compared to I/O) > step3) Apply shared_costing_plus_patch4_v1.patch (delay is in sync compared to I/O) > > Configuration settings: > autovacuum = off > max_parallel_workers = 30 > shared_buffers = 2GB > max_parallel_maintenance_workers = 20 > vacuum_cost_limit = 2000 > vacuum_cost_delay = 10 > > Test 1: Varry indexes(2,4,6,8) but parallel workers are fixed as 4: > > Case 1) When indexes are 2: > Without shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 1 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2 > > With shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=87.780000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 1 delay=87.995000 total io=17931 hit=17891 miss=0 dirty=2 > > Case 2) When indexes are 4: > Without shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 1 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 2 delay=60.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 3 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2 > > With shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=87.430000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 1 delay=87.175000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 2 delay=86.340000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 3 delay=88.020000 total io=17931 hit=17891 miss=0 dirty=2 > > Case 3) When indexes are 6: > Without shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=110.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 1 delay=100.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 2 delay=160.000000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 3 delay=90.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 4 delay=80.000000 total io=17931 hit=17891 miss=0 dirty=2 > > With shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=173.195000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 1 delay=88.715000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 2 delay=87.710000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 3 delay=86.460000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 4 delay=89.435000 total io=17931 hit=17891 miss=0 dirty=2 > > Case 4) When indexes are 8: > Without shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 1 delay=120.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 2 delay=130.000000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 3 delay=190.000000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 4 delay=110.000000 total io=35862 hit=35782 miss=0 dirty=4 > > With shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=174.700000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 1 delay=177.880000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 2 delay=89.460000 total io=17931 hit=17891 miss=0 dirty=2 > WARNING: worker 3 delay=177.320000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 4 delay=86.810000 total io=17931 hit=17891 miss=0 dirty=2 > > Test 2: Indexes are 16 but parallel workers are 2,4,8: > > Case 1) When 2 parallel workers: > Without shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=1513.230000 total io=307197 hit=85167 miss=22179 dirty=12 > WARNING: worker 1 delay=1543.385000 total io=326553 hit=63133 miss=26322 dirty=10 > WARNING: worker 2 delay=1633.625000 total io=302199 hit=65839 miss=23616 dirty=10 > > With shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=1539.475000 total io=308175 hit=65175 miss=24280 dirty=10 > WARNING: worker 1 delay=1251.200000 total io=250692 hit=71562 miss=17893 dirty=10 > WARNING: worker 2 delay=1143.690000 total io=228987 hit=93857 miss=13489 dirty=12 > > Case 2) When 4 parallel workers: > Without shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=1182.430000 total io=213567 hit=16037 miss=19745 dirty=4 > WARNING: worker 1 delay=1202.710000 total io=178941 hit=1 miss=17890 dirty=2 > WARNING: worker 2 delay=210.000000 total io=89655 hit=89455 miss=0 dirty=10 > WARNING: worker 3 delay=270.000000 total io=71724 hit=71564 miss=0 dirty=8 > WARNING: worker 4 delay=851.825000 total io=188229 hit=58619 miss=12945 dirty=8 > > With shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=1136.875000 total io=227679 hit=14469 miss=21313 dirty=4 > WARNING: worker 1 delay=973.745000 total io=196881 hit=17891 miss=17891 dirty=4 > WARNING: worker 2 delay=447.410000 total io=89655 hit=89455 miss=0 dirty=10 > WARNING: worker 3 delay=833.235000 total io=168228 hit=40958 miss=12715 dirty=6 > WARNING: worker 4 delay=683.200000 total io=136488 hit=64368 miss=7196 dirty=8 > > Case 3) When 8 parallel workers: > Without shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=1022.300000 total io=178941 hit=1 miss=17890 dirty=2 > WARNING: worker 1 delay=1072.770000 total io=178941 hit=1 miss=17890 dirty=2 > WARNING: worker 2 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 3 delay=170.000000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 4 delay=140.035000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 5 delay=200.000000 total io=53802 hit=53672 miss=1 dirty=6 > WARNING: worker 6 delay=130.000000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 7 delay=150.000000 total io=53793 hit=53673 miss=0 dirty=6 > > With shared_costing_plus_patch4_v1.patch: > WARNING: worker 0 delay=872.800000 total io=178941 hit=1 miss=17890 dirty=2 > WARNING: worker 1 delay=885.950000 total io=178941 hit=1 miss=17890 dirty=2 > WARNING: worker 2 delay=175.680000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 3 delay=259.560000 total io=53793 hit=53673 miss=0 dirty=6 > WARNING: worker 4 delay=169.945000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 5 delay=613.845000 total io=125100 hit=45750 miss=7923 dirty=6 > WARNING: worker 6 delay=171.895000 total io=35862 hit=35782 miss=0 dirty=4 > WARNING: worker 7 delay=176.505000 total io=35862 hit=35782 miss=0 dirty=4 It seems that the bigger delay difference (8% - 9 %), which is observed with the higher number of indexes is due to the IO difference, for example in case3, the total page miss without patch is 35780 whereas with the patch it is 43703. So it seems that with more indexes your data is not fitting in the shared buffer so the page hits/misses are varying run to run and that will cause variance in the total delay. Another problem where delay with the patch is 2-3% lesser, is basically the problem of the "0001-vacuum_costing_test" patch because that patch is only displaying the delay during the index vacuuming phase, not the total delay. So if we observe the total delay then it should be the same. The modified version of 0001-vacuum_costing_test is attached to print the total delay. In my test.sh, I can see the total delay is almost the same. Non-parallel vacuum WARNING: VacuumCostTotalDelay = 11332.170000 Paralle vacuum with shared_costing_plus_patch4_v1.patch: WARNING: worker 0 delay=89.230000 total io=17931 hit=17891 miss=0 dirty=2 WARNING: worker 1 delay=85.205000 total io=17931 hit=17891 miss=0 dirty=2 WARNING: worker 2 delay=87.290000 total io=17931 hit=17891 miss=0 dirty=2 WARNING: worker 3 delay=78.365000 total io=16378 hit=4318 miss=0 dirty=603 WARNING: VacuumCostTotalDelay = 11331.690000 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, 15 Nov 2019 at 11:54, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 13, 2019 at 10:02 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > I've done some tests while changing shared buffer size, delays and > > number of workers. The overall results has the similar tendency as the > > result shared by Dilip and looks reasonable to me. > > > > Thanks, Sawada-san for repeating the tests. I can see from yours, > Dilip and Mahendra's testing that the delay is distributed depending > upon the I/O done by a particular worker and the total I/O is also as > expected in various kinds of scenarios. So, I think this is a better > approach. Do you agree or you think we should still investigate more > on another approach as well? > > I would like to summarize this approach. The basic idea for parallel > vacuum is to allow the parallel workers and master backend to have a > shared view of vacuum cost related parameters (mainly > VacuumCostBalance) and allow each worker to update it and then based > on that decide whether it needs to sleep. With this basic idea, we > found that in some cases the throttling is not accurate as explained > with an example in my email above [1] and then the tests performed by > Dilip and others in the following emails (In short, the workers doing > more I/O can be throttled less). Then as discussed in an email later > [2], we tried a way to avoid letting the workers sleep which has done > less or no I/O as compared to other workers. This ensured that > workers who are doing more I/O got throttled more. The idea is to > allow any worker to sleep only if it has performed the I/O above a > certain threshold and the overall balance is more than the cost_limit > set by the system. Then we will allow the worker to sleep > proportional to the work done by it and reduce the > VacuumSharedCostBalance by the amount which is consumed by the current > worker. This scheme leads to the desired throttling by different > workers based on the work done by the individual worker. > > We have tested this idea with various kinds of workloads like by > varying shared buffer size, delays and number of workers. Then also, > we have tried with a different number of indexes and workers. In all > the tests, we found that the workers are throttled proportional to the > I/O being done by a particular worker. Thank you for summarizing! I agreed to this approach. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services