Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CAA4eK1Kny1RyzuxL4-OmEFFYPRiHD+xf8Jy4DkZ1bbGV4fhHgA@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Wed, Jun 19, 2019 at 8:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > The reason for the same is that currently, the undo worker keep on > > executing the requests if there are any. I think this is good when > > there are different requests, but getting the same request from error > > queue and doing it, again and again, doesn't seem to be good and I > > think it will not help either. > > Even if there are multiple requests involved, you don't want a tight > loop like this. > Okay, one reason that comes to mind is we don't want to choke the system as applying undo can consume CPU and generate a lot of I/O. Is that you have in mind or something else? I see an advantage in having some sort of throttling here, so we can have some wait time (say 100ms) between processing requests. Do we see any need of guc here? I think on one side it seems a good idea to have multiple guc's for tuning undo worker machinery because whatever default values we pick might not be good for some of the users. OTOH, giving too many guc's can also make the system difficult to understand and can confuse users or at the least, they won't know how exactly to use those. It seems to me that we should first complete the entire patch and then we can decide which all things need separate guc. > > > I assume we want some kind of cool-down between retries. > > > 10 seconds? A minute? Some kind of back-off algorithm that gradually > > > increases the retry time up to some maximum? > > > > Yeah, something on these lines would be good. How about if we add > > failure_count with each request in error queue? Now, it will get > > incremented on each retry and we can wait in proportion to that, say > > 10s after the first retry, 20s after second and so on and maximum up > > to 10 failure_count (100s) will be allowed after which worker will > > exit considering it has no more work to do. > > > > Actually, we also need to think about what we should with such > > requests because even if undo worker exits after retrying for some > > threshold number of times, undo launcher will again launch a new > > worker for this request unless we have some special handling for the > > same. > > > > We can issue some WARNING once any particular request reached the > > maximum number of retries but not sure if that is enough because the > > user might not notice the same or didn't take any action. Do we want > > to PANIC at some point of time, if so, when or the other alternative > > is we can try at regular intervals till we succeed? > > PANIC is a terrible idea. How would that fix anything? You'll very > possibly still have the same problem after restarting, and so you'll > just keep on hitting the PANIC. That will mean that in addition to > whatever problem with undo you already had, you now have a system that > you can't use for anything at all, because it keeps restarting. > > The design goal here should be that if undo for a transaction fails, > we keep retrying periodically, but with minimal adverse impact on the > rest of the system. That means you can't retry in a loop. It also > means that the system needs to provide fairness: that is, it shouldn't > be possible to create a system where one or more transactions for > which undo keeps failing cause other transactions that could have been > undone to get starved. > Agreed. > It seems to me that thinking of this in terms of what the undo worker > does and what the undo launcher does is probably not the right > approach. We need to think of it more as an integrated system. Instead > of storing a failure_count with each request in the error queue, how > about storing a next retry time? > I think both failure_count and next_retry_time can work in a similar way. I think incrementing next retry time in multiples will be a bit tricky. Say first-time error occurs at X hours. We can say that next_retry_time will X+10s=Y and error_occured_at will be X. The second time it again failed, how will we know that we need set next_retry_time as Y+20s, maybe we can do something like Y-X and then add 10s to it and add the result to the current time. Now whenever the worker or launcher finds this request, they can check if the current_time is greater than or equal to next_retry_time, if so they can pick that request, otherwise, they check request in next queue. The failure_count can also work in a somewhat similar fashion. Basically, we can use error_occurred at and failure_count to compute the required time. So, if error is occurred at say X hours and failure count is 3, then we can check if current_time is greater than X+(3 * 10s), then we will allow the entry to be processed, otherwise, it will check other queues for work. > I think the error queue needs to be > ordered by database_id, then by next_retry_time, and then by order of > insertion. (The last part is important because next_retry_time is > going to be prone to having ties, and we need to break those ties in > the right way.) > I think it makes sense to order requests by next_retry_time, error_occurred_at (this will ensure the order of insertion). However, I am not sure if there is a need to club the requests w.r.t database id. It can starve the error requests from other databases. Moreover, we already have a functionality wherein if the undo worker doesn't encounter the next request from the same database on which it is operating for a certain amount of time, then it will peek ahead (few entries) in each queue to get the request for the same database. We don't sort by db_id in other queues as well, so it will be consistent for this queue if we just sort by next_retry_time and error_occurred_at. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: