Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoZH1EvxqwSCkfJ=nXSO1aasPZuuyaaMtrcLWNbSwK0-WQ@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: POC: Cleaning up orphaned files using undo logs
|
List | pgsql-hackers |
On Thu, Jun 20, 2019 at 11:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > This delay is for *not* choking the system by constantly performing > undo requests that consume a lot of CPU and I/O as discussed in above > point. For holding off the same error request to be re-tried, we need > next_retry_time type of method as discussed below. Oh. That's not what I thought we were talking about. It's not unreasonable to think about trying to rate limit undo application just like we do for vacuum, but a fixed delay between requests would be a completely inadequate way of attacking that problem. If the individual requests are short, it will create too much delay, and if they are long, it will not create enough. We would need delays within a transaction, not just between transactions, similar to how the vacuum cost delay stuff works. I suggest that we leave that to one side for now. It seems like something that could be added later, maybe in a more general way, and not something that needs to be or should be closely connected to the logic for deciding the order in which we're going to process different transactions in undo. > > Meh. Don't get stuck on one particular method of calculating the next > > retry time. We want to be able to change that easily if whatever we > > try first doesn't work out well. I am not convinced that we need > > anything more complex than a fixed retry time, probably controlled by > > a GUC (undo_failure_retry_time = 10s?). > > IIRC, then you only seem to have suggested that we need a kind of > back-off algorithm that gradually increases the retry time up to some > maximum [1]. I think that is a good way to de-prioritize requests > that are repeatedly failing. Say, there is a request that has already > failed for 5 times and the worker queues it to get executed after 10s. > Immediately after that, another new request has failed for the first > time for the same database and it also got queued to get executed > after 10s. In this scheme the request that has already failed for 5 > times will get a chance before the request that has failed for the > first time. Sure, that's an advantage of increasing back-off times -- you can keep the stuff that looks hopeless from interfering too much with the stuff that is more likely to work out. However, I don't think we've actually done enough testing to know for sure what algorithm will work out best. Do we want linear back-off (10s, 20s, 30s, ...)? Exponential back-off (1s, 2s, 4s, 8s, ...)? No back-off (10s, 10s, 10s, 10s)? Some algorithm that depends on the size of the failed transaction, so that big things get retried less often? I think it's important to design the code in such a way that the algorithm can be changed easily later, because I don't think we can be confident that whatever we pick for the first attempt will prove to be best. I'm pretty sure that storing the failure count INSTEAD OF the next retry time is going to make it harder to experiment with different algorithms later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: