Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CAA4eK1J1RprUznBoKS-quvAGOjOD5q7U4FCvUm4jv71zn0mJ9A@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On Thu, Jun 20, 2019 at 8:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Okay, one reason that comes to mind is we don't want to choke the
> > system as applying undo can consume CPU and generate a lot of I/O.  Is
> > that you have in mind or something else?
>
> Yeah, mainly that, but also things like log spam, and even pressure on
> the lock table.  If we are trying over and over again to take useless
> locks, it can affect other things on the system. The main thing,
> however, is the CPU and I/O consumption.
>
> > I see an advantage in having some sort of throttling here, so we can
> > have some wait time (say 100ms) between processing requests.  Do we
> > see any need of guc here?
>
> I don't think that is the right approach. As I said in my previous
> reply, we need a way of holding off the retry of the same error for a
> certain amount of time, probably measured in seconds or tens of
> seconds. Introducing a delay before processing every request is an
> inferior alternative:
>

This delay is for *not* choking the system by constantly performing
undo requests that consume a lot of CPU and I/O as discussed in above
point.  For holding off the same error request to be re-tried, we need
next_retry_time type of method as discussed below.

 if there are a lot of rollbacks, it can cause
> the system to lag; and in the case where there's just one rollback
> that's failing, it will still be way too much log spam (and probably
> CPU time too).  Nobody wants 10 failure messages per second in the
> log.
>
> > > It seems to me that thinking of this in terms of what the undo worker
> > > does and what the undo launcher does is probably not the right
> > > approach. We need to think of it more as an integrated system. Instead
> > > of storing a failure_count with each request in the error queue, how
> > > about storing a next retry time?
> >
> > I think both failure_count and next_retry_time can work in a similar way.
> >
> > I think incrementing next retry time in multiples will be a bit
> > tricky.  Say first-time error occurs at X hours.  We can say that
> > next_retry_time will X+10s=Y and error_occured_at will be X.  The
> > second time it again failed, how will we know that we need set
> > next_retry_time as Y+20s, maybe we can do something like Y-X and then
> > add 10s to it and add the result to the current time.  Now whenever
> > the worker or launcher finds this request, they can check if the
> > current_time is greater than or equal to next_retry_time, if so they
> > can pick that request, otherwise, they check request in next queue.
> >
> > The failure_count can also work in a somewhat similar fashion.
> > Basically, we can use error_occurred at and failure_count to compute
> > the required time.  So, if error is occurred at say X hours and
> > failure count is 3, then we can check if current_time is greater than
> > X+(3 * 10s), then we will allow the entry to be processed, otherwise,
> > it will check other queues for work.
>
> Meh.  Don't get stuck on one particular method of calculating the next
> retry time.  We want to be able to change that easily if whatever we
> try first doesn't work out well.  I am not convinced that we need
> anything more complex than a fixed retry time, probably controlled by
> a GUC (undo_failure_retry_time = 10s?).
>

IIRC, then you only seem to have suggested that we need a kind of
back-off algorithm that gradually increases the retry time up to some
maximum [1].  I think that is a good way to de-prioritize requests
that are repeatedly failing.  Say, there is a request that has already
failed for 5 times and the worker queues it to get executed after 10s.
Immediately after that, another new request has failed for the first
time for the same database and it also got queued to get executed
after 10s.  In this scheme the request that has already failed for 5
times will get a chance before the request that has failed for the
first time.


[1] - https://www.postgresql.org/message-id/CA%2BTgmoYHBkm7M8tNk6Z9G_aEOiw3Bjdux7v9%2BUzmdNTdFmFzjA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Disconnect from SPI manager on error
Next
From: Andres Freund
Date:
Subject: Re: benchmarking Flex practices