Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+TgmoYsUw22=zYcKQrtC_Udi3-Ukw0qZU0zgccFyUcZ6_5FtA@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs
List pgsql-hackers
On Thu, Jun 20, 2019 at 2:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, one reason that comes to mind is we don't want to choke the
> system as applying undo can consume CPU and generate a lot of I/O.  Is
> that you have in mind or something else?

Yeah, mainly that, but also things like log spam, and even pressure on
the lock table.  If we are trying over and over again to take useless
locks, it can affect other things on the system. The main thing,
however, is the CPU and I/O consumption.

> I see an advantage in having some sort of throttling here, so we can
> have some wait time (say 100ms) between processing requests.  Do we
> see any need of guc here?

I don't think that is the right approach. As I said in my previous
reply, we need a way of holding off the retry of the same error for a
certain amount of time, probably measured in seconds or tens of
seconds. Introducing a delay before processing every request is an
inferior alternative: if there are a lot of rollbacks, it can cause
the system to lag; and in the case where there's just one rollback
that's failing, it will still be way too much log spam (and probably
CPU time too).  Nobody wants 10 failure messages per second in the
log.

> > It seems to me that thinking of this in terms of what the undo worker
> > does and what the undo launcher does is probably not the right
> > approach. We need to think of it more as an integrated system. Instead
> > of storing a failure_count with each request in the error queue, how
> > about storing a next retry time?
>
> I think both failure_count and next_retry_time can work in a similar way.
>
> I think incrementing next retry time in multiples will be a bit
> tricky.  Say first-time error occurs at X hours.  We can say that
> next_retry_time will X+10s=Y and error_occured_at will be X.  The
> second time it again failed, how will we know that we need set
> next_retry_time as Y+20s, maybe we can do something like Y-X and then
> add 10s to it and add the result to the current time.  Now whenever
> the worker or launcher finds this request, they can check if the
> current_time is greater than or equal to next_retry_time, if so they
> can pick that request, otherwise, they check request in next queue.
>
> The failure_count can also work in a somewhat similar fashion.
> Basically, we can use error_occurred at and failure_count to compute
> the required time.  So, if error is occurred at say X hours and
> failure count is 3, then we can check if current_time is greater than
> X+(3 * 10s), then we will allow the entry to be processed, otherwise,
> it will check other queues for work.

Meh.  Don't get stuck on one particular method of calculating the next
retry time.  We want to be able to change that easily if whatever we
try first doesn't work out well.  I am not convinced that we need
anything more complex than a fixed retry time, probably controlled by
a GUC (undo_failure_retry_time = 10s?).  An escalating time between
retries would be important and advantageous if we expected the sizes
of these queues to grow into the millions, but the current design
seems to be contemplating something more in the tends-of-thousands
range and I am not sure we're going to need it at that level.  We
should try simple things first and then see where we need to make it
more complex.

At some basic level, the queue needs to be ordered by increasing retry
time.  You can do that with your design, but you have to recompute the
next retry time from the error_occurred_at and failure_count values
every time you examine an entry.  It's almost certainly better to
store the next_retry_time explicitly.  That way, if for example we
change the logic for computing the next_retry_time to something really
complicated, it doesn't have any effect on the code that keeps the
queue in order -- it just looks at the computed value.  If we end up
with something very simple, like error_occurred_at + constant, it may
end up seeming a little silly, but I think that's a price well worth
paying for code maintainability.  If we end up with error_occurred_at
+ Min(failure_count * 10, 100) or something of that sort, then we can
also store failure_count in each record, but it will just be part of
the payload, not the sort key, so adding it or removing it won't
affect the code that maintains the queue ordering.

> >  I think the error queue needs to be
> > ordered by database_id, then by next_retry_time, and then by order of
> > insertion.  (The last part is important because next_retry_time is
> > going to be prone to having ties, and we need to break those ties in
> > the right way.)
>
> I think it makes sense to order requests by next_retry_time,
> error_occurred_at (this will ensure the order of insertion).  However,
> I am not sure if there is a need to club the requests w.r.t database
> id.  It can starve the error requests from other databases.  Moreover,
> we already have a functionality wherein if the undo worker doesn't
> encounter the next request from the same database on which it is
> operating for a certain amount of time, then it will peek ahead (few
> entries) in each queue to get the request for the same database.  We
> don't sort by db_id in other queues as well, so it will be consistent
> for this queue if we just sort by next_retry_time and
> error_occurred_at.

You're misunderstanding my point. We certainly do not wish to always
pick the request from the database with the lowest OID, or anything
like that.  However, we do need a worker for a particular database to
find the work pending for that database efficiently.  Peeking ahead a
few requests is a version of that, but I'm not sure it's going to be
good enough.  Suppose we look ahead 3 requests but there are 10
databases. Then, if all 10 databases have requests pending, it is
likely that we won't find the next request for our particular database
even though it exists -- the first 3 may easily be for all other
databases.  If you look ahead more requests, that doesn't really fix
it - it just means you need more databases for the problem to become
likely.  And note that this problem happens even if every database
contains a worker.  Some of those workers will erroneously think that
they should exit.

I'm not sure exactly what to do about this.  My first thought was that
for all of the queues we might need to have a queue per database (or
something equivalent) rather than just one big queue.  But that has
problems too: it will mean that a database worker will never exit as
long as there is any work at all to be done in that database, even if
some other database is getting starved. Somehow we need to balance the
efficiency of having a worker for a particular database process many
requests before exiting against the need to ensure fairness across
databases, and it doesn't sound to me like we quite know what exactly
we ought to do there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Joe Conway
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)
Next
From: John Naylor
Date:
Subject: benchmarking Flex practices