Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+TgmoYdswm3vtfL3X6h3-hsTSqDSc2hPdziBv6h8avaoNRA7Q@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On Wed, Jun 19, 2019 at 2:45 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The reason for the same is that currently, the undo worker keep on
> executing the requests if there are any.  I think this is good when
> there are different requests, but getting the same request from error
> queue and doing it, again and again, doesn't seem to be good and I
> think it will not help either.

Even if there are multiple requests involved, you don't want a tight
loop like this.

> > I assume we want some kind of cool-down between retries.
> > 10 seconds?  A minute?  Some kind of back-off algorithm that gradually
> > increases the retry time up to some maximum?
>
> Yeah, something on these lines would be good.  How about if we add
> failure_count with each request in error queue?   Now, it will get
> incremented on each retry and we can wait in proportion to that, say
> 10s after the first retry, 20s after second and so on and maximum up
> to 10 failure_count (100s) will be allowed after which worker will
> exit considering it has no more work to do.
>
> Actually, we also need to think about what we should with such
> requests because even if undo worker exits after retrying for some
> threshold number of times, undo launcher will again launch a new
> worker for this request unless we have some special handling for the
> same.
>
> We can issue some WARNING once any particular request reached the
> maximum number of retries but not sure if that is enough because the
> user might not notice the same or didn't take any action.  Do we want
> to PANIC at some point of time, if so, when or the other alternative
> is we can try at regular intervals till we succeed?

PANIC is a terrible idea.  How would that fix anything?  You'll very
possibly still have the same problem after restarting, and so you'll
just keep on hitting the PANIC. That will mean that in addition to
whatever problem with undo you already had, you now have a system that
you can't use for anything at all, because it keeps restarting.

The design goal here should be that if undo for a transaction fails,
we keep retrying periodically, but with minimal adverse impact on the
rest of the system.  That means you can't retry in a loop. It also
means that the system needs to provide fairness: that is, it shouldn't
be possible to create a system where one or more transactions for
which undo keeps failing cause other transactions that could have been
undone to get starved.

It seems to me that thinking of this in terms of what the undo worker
does and what the undo launcher does is probably not the right
approach. We need to think of it more as an integrated system. Instead
of storing a failure_count with each request in the error queue, how
about storing a next retry time?  I think the error queue needs to be
ordered by database_id, then by next_retry_time, and then by order of
insertion.  (The last part is important because next_retry_time is
going to be prone to having ties, and we need to break those ties in
the right way.) So, when a per-database worker starts up, it's pulling
from the queues in alternation, ignoring items that are not for the
current database.  When it pulls from the error queue, it looks at the
item for the current database that has the lowest retry time - if
that's still in the future, then it ignores the queue until something
new (perhaps with a lower retry_time) is added, or until the first
next_retry_time arrives.  If the item that it pulls again fails, it
gets inserted back into the error queue but with a higher next retry
time.

This might not be exactly right, but the point is that there should
probably be NO logic that causes a worker to retry the same
transaction immediately afterward, with or without a delay. It should
be all be driven off what gets pulled out of the error queue.  In the
above sketch, if a worker gets to the point where there's nothing in
the error queue for the current database with a timestamp that is <=
the current time, then it can't pull anything else from that queue; if
there's no other work to do, it exits.  If there is other work to do,
it does that and then maybe enough time will have passed to allow
something to be pulled from the error queue, or maybe not.  Meanwhile
some other worker running in the same database might pull the item
before the original worker gets back to it.  Meanwhile if the worker
exits because there's nothing more to do in that database, the
launcher can also see the error queue.  When enough time has passed,
it can notice that there is an item (or items) that could be pulled
from the error queue for that database and launch a worker for that
database if necessary (or else let an existing worker take care of
it).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: New EXPLAIN option: ALL
Next
From: Thom Brown
Date:
Subject: Re: SQL/JSON path issues/questions