Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoYdswm3vtfL3X6h3-hsTSqDSc2hPdziBv6h8avaoNRA7Q@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > The reason for the same is that currently, the undo worker keep on > executing the requests if there are any. I think this is good when > there are different requests, but getting the same request from error > queue and doing it, again and again, doesn't seem to be good and I > think it will not help either. Even if there are multiple requests involved, you don't want a tight loop like this. > > I assume we want some kind of cool-down between retries. > > 10 seconds? A minute? Some kind of back-off algorithm that gradually > > increases the retry time up to some maximum? > > Yeah, something on these lines would be good. How about if we add > failure_count with each request in error queue? Now, it will get > incremented on each retry and we can wait in proportion to that, say > 10s after the first retry, 20s after second and so on and maximum up > to 10 failure_count (100s) will be allowed after which worker will > exit considering it has no more work to do. > > Actually, we also need to think about what we should with such > requests because even if undo worker exits after retrying for some > threshold number of times, undo launcher will again launch a new > worker for this request unless we have some special handling for the > same. > > We can issue some WARNING once any particular request reached the > maximum number of retries but not sure if that is enough because the > user might not notice the same or didn't take any action. Do we want > to PANIC at some point of time, if so, when or the other alternative > is we can try at regular intervals till we succeed? PANIC is a terrible idea. How would that fix anything? You'll very possibly still have the same problem after restarting, and so you'll just keep on hitting the PANIC. That will mean that in addition to whatever problem with undo you already had, you now have a system that you can't use for anything at all, because it keeps restarting. The design goal here should be that if undo for a transaction fails, we keep retrying periodically, but with minimal adverse impact on the rest of the system. That means you can't retry in a loop. It also means that the system needs to provide fairness: that is, it shouldn't be possible to create a system where one or more transactions for which undo keeps failing cause other transactions that could have been undone to get starved. It seems to me that thinking of this in terms of what the undo worker does and what the undo launcher does is probably not the right approach. We need to think of it more as an integrated system. Instead of storing a failure_count with each request in the error queue, how about storing a next retry time? I think the error queue needs to be ordered by database_id, then by next_retry_time, and then by order of insertion. (The last part is important because next_retry_time is going to be prone to having ties, and we need to break those ties in the right way.) So, when a per-database worker starts up, it's pulling from the queues in alternation, ignoring items that are not for the current database. When it pulls from the error queue, it looks at the item for the current database that has the lowest retry time - if that's still in the future, then it ignores the queue until something new (perhaps with a lower retry_time) is added, or until the first next_retry_time arrives. If the item that it pulls again fails, it gets inserted back into the error queue but with a higher next retry time. This might not be exactly right, but the point is that there should probably be NO logic that causes a worker to retry the same transaction immediately afterward, with or without a delay. It should be all be driven off what gets pulled out of the error queue. In the above sketch, if a worker gets to the point where there's nothing in the error queue for the current database with a timestamp that is <= the current time, then it can't pull anything else from that queue; if there's no other work to do, it exits. If there is other work to do, it does that and then maybe enough time will have passed to allow something to be pulled from the error queue, or maybe not. Meanwhile some other worker running in the same database might pull the item before the original worker gets back to it. Meanwhile if the worker exits because there's nothing more to do in that database, the launcher can also see the error queue. When enough time has passed, it can notice that there is an item (or items) that could be pulled from the error queue for that database and launch a worker for that database if necessary (or else let an existing worker take care of it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: