should there be a hard-limit on the number of transactions pending undo? - Mailing list pgsql-hackers

From Robert Haas
Subject should there be a hard-limit on the number of transactions pending undo?
Date
Msg-id CA+Tgmoafc_RZNW7G2w1CpkDUjaUzTRJaE_Uh6dx_sDLGJFsKKg@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Andres Freund <andres@anarazel.de>)
Responses Re: should there be a hard-limit on the number of transactionspending undo?  (Andres Freund <andres@anarazel.de>)
Re: should there be a hard-limit on the number of transactionspending undo?  (Peter Geoghegan <pg@bowt.ie>)
Re: should there be a hard-limit on the number of transactionspending undo?  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Jul 16, 2019 at 6:23 PM Andres Freund <andres@anarazel.de> wrote:
> Yea, that seems like a question independent of the "completeness"
> requirement. If desirable, it seems trivial to either have
> RollbackHashEntry have per-persistence level status (for one entry per
> xid), or not (for per-persistence entries).

I want to talk more about the "completeness" issue, which is basically
a question of whether we should (a) put a hard limit on the number of
transactions that have unprocessed undo and that are either aborted or
in progress (and thus capable of aborting) as proposed by Andres or
(b) not have such a hard limit, as originally proposed by me.  I think
everyone who is not working on this code has installed an automatic
filter rule to send the original thread to /dev/null, so I'm changing
the subject line in the hopes of getting some of those people to pay
attention.  If it doesn't work, at least the concerns will be
memorialized in case it comes up later.

I originally proposed (b) because if undo application is failing for
some reason, it seemed better not to bring the whole system to a
screeching halt, but rather just cause incremental performance
degradation or something. However, Andres has pointed out this may
postpone remedial action and let bugs go undetected, so it might not
actually be better. Also, some of the things we need to do, like
computing the oldest XID whose undo has not been retired, are
tricky/expensive if you don't have complete data in memory, and you
can't be sure of having complete data in shared memory without a hard
limit.  No matter which way we go, failures to apply undo had better
be really rare, or we're going to be in really serious trouble, so
we're only concerned here with how to handle what is hopefully a very
rare scenario, not the common case.

I want to consider three specific scenarios that could cause undo
application to fail, and then offer some observations about them.

Scenario #1:

1. Sessions 1..N each begin a transaction and write a bunch of data to
a table (at least enough that we'll try to perform undo in the
background).
2. Session N+1 begins a transaction and tries to lock the same table.
It blocks.
3. Sessions 1..N abort, successfully pushing the undo work into the background.
4. Session N+1 now acquires the lock and sits on it.
5. Optionally, repeat steps 1-4 K times, each time for a different table.

Scenario #2:

1. Any number of sessions begin a transaction, write a bunch of data,
and then abort.
2. They all try to perform undo in the foreground.
3. They get killed using pg_terminate_backend().

Scenario #3:

1. A transaction begins, does some work, and then aborts.
2. When undo processing occurs, 1% of such transactions fail during
undo apply because of a bug in the table AM.
3. When undo processing retries after a failure, it fails again
because the bug is triggered by something about the contents of the
undo record, rather than by, say, concurrency.

In scenario one, the problem is mostly self-correcting. When we decide
that we've got too many things queued up for background processing,
and start to force undo processing to happen in the foreground, it
will start succeeding, because the foreground process will have
retained the lock that it took before writing any data and can
therefore undo those writes without having to wait for the lock.
However, this will do nothing to help the requests that are already in
the background, which will just sit there until they can get the lock.
I think there is a good argument that they should not actually wait
for the lock, or only for a certain time period, and then give up and
put the transaction on the error queue for reprocessing at a later
time. Otherwise, we're pinning down undo workers, which could easily
lead to starvation, just as it does for autovacuum. On the whole, this
doesn't sound too bad. We shouldn't be able to fill up the queue with
small transactions, because of the restriction that we only push undo
work into the background when the transaction is big enough, and if we
fill it up with big transactions, then (1) back-pressure will prevent
the problem from crippling the system and (2) eventually the problem
will be self-correcting, because when the transaction in session N+1
ends, the undo will all go through and everything will be fine. The
only real user impact of this scenario is that unrelated work on the
system might notice that large rollbacks are now happening in the
foreground rather than the background, and if that causes a problem,
the DBA can fix it by terminating session N+1. Even if she doesn't,
you shouldn't ever hit the hard cap.

However, if prepared transactions are in use, we could have a variant
of scenario #1 in which each transaction is first prepared, and then
the prepared transaction is rolled back.  Unlike the ordinary case,
this can lead to a nearly-unbounded growth in the number of
transactions that are pending undo, because we don't have a way to
transfer the locks held by PGPROC used for the prepare to some running
session that could perform the undo.  It's not necessary to have a
large value for max_prepared_transactions; it only has to be greater
than 0, because we can keep reusing the same slots with different
tables.  That is, let N = max_prepared_xacts, and let K be anything at
all; session N+1 can just stay in the same transaction and keep on
taking new locks one at a time until the lock table fills up; not sure
exactly how long that will take, but it's probably a five digit number
of transactions, or maybe six. In this case, we can't force undo into
the foreground, so we can exceed the number of transactions that are
supposed to be backgrounded.  We'll eventually have to just start
refusing new transactions permission to attach to an undo log, and
they'll error out.  Although unpleasant, I don't think that this
scenario is a death sentence for the idea of having a hard cap on the
table size, because if we can the cap is 100k or so, you shouldn't
really hit it unless you specifically make it your goal to do so.  At
least, not this way. But if you have a lower cap, like 1k, it doesn't
seem crazy to think that you could hit this in a non-artificial
scenario; you just need lots of rolled-back prepared transactions plus
some long-running DDL.

We could mitigate the prepared transaction scenario by providing a way
to transfer locks from the prepared transaction to the backend doing
the ROLLBACK PREPARED and then make it try to execute the undo
actions.  I think that would bring this scenario into parity with the
non-prepared case.  We could still try to background large rollbacks,
but if the queue gets too full then ROLLBACK PREPARED would do the
work instead, and, with the hypothetical lock transfer mechanism, that
would dodge the locking issues.

In scenario #2, the undo work is going to have to be retried in the
background, and perforce that means reacquiring locks that have been
released, and so there is a chance of long lock waits and/or deadlock
that cannot really be avoided. I think there is basically no way at
all to avoid an unbounded accumulation of transactions requiring undo
in this case, just as in the similar case where the cluster is
repeatedly shut down or repeatedly crashes. Eventually, if you have a
hard cap on the number of transactions requiring undo, you're going to
hit it, and have to start refusing new undo-using transactions. As
Thomas pointed out, that might still be better than some other systems
which use undo, where the system doesn't open for any transactions at
all after a restart until all undo is retired, and/or where undo is
never processed in the background. But it's a possible concern. On the
other hand, if you don't have a hard cap, the system may just get
further and further behind until it eventually melts, and that's also
a possible concern.

How plausible is this scenario? For most users, cluster restarts and
crashes are uncommon, so that variant isn't likely to happen unless
something else is going badly wrong. As to the scenario as written,
it's not crazy to think that a DBA might try to kill off sessions that
sitting there stuck in undo processing for long periods of time, but
that doesn't make it a good idea. Whatever problems it causes are
analogous to the problems you get if you keep killing of autovacuum
processes: the system is trying to make you do the right thing, and if
you fight it, you will have some kind of trouble no matter what design
decisions we make.

In scenario #3, the hard limit is likely to bring things to a
screeching halt pretty quickly; you'll just run out of space in the
in-memory data structures. Otherwise, the problem will not be obvious
unless you're keeping an eye on error messages in your logs, the first
sign of trouble may be that the undo logs fill up the disk. It's not
really clear which is better. There is value in knowing about the
problem sooner (because then you can file a bug report right away and
get a fix sooner) but there is also value in having the system limp
along instead of grinding to a halt (because then you might not be
totally down while you're waiting for that bug fix to become
available).

One other thing that seems worth noting is that we have to consider
what happens after a restart.  After a crash, and depending on exactly
how we design it perhaps also after a non-crash restart, we won't
immediately know how many outstanding transactions need undo; we'll
have to grovel through the undo logs to find out. If we've got a hard
cap, we can't allow new undo-using transactions to start until we
finish that work.  It's possible that, at the moment of the crash, the
maximum number of items had already been pushed into the background,
and every foreground session was busy trying to undo an abort as well.
If so, we're already up against the limit.  We'll have to scan through
all of the undo logs and examine each transaction to get a count on
how many transactions are already in a needs-undo-work state; only
once we have that value do we know whether it's OK to admit new
transactions to using the undo machinery, and how many we can admit.
In typical cases, that won't take long at all, because there won't be
any pending undo work, or not much, and we'll very quickly read the
handful of transaction headers that we need to consult and away we go.
However, if the hard limit is pretty big, and we're pretty close to
it, counting might take a long time. It seems bothersome to have this
interval between when we start accepting transactions and when we can
accept transactions that use undo. Instead of throwing an ERROR, we
can probably just teach the system to wait for the background process
to finish doing the counting; that's what Amit's patch does currently.
Or, we could not even open for connections until the counting has been
completed.

When I first thought about this, I was really concerned about the idea
of a hard limit, but the more I think about it the less problematic it
seems. I think in the end it boils down to a question of: when things
break, what behavior would users prefer? You can either have a fairly
quick, hard breakage which will definitely get your attention, or you
can have a long, slow process of gradual degradation that doesn't
actually stop the system until, say, the XIDs stuck in the undo
processing queue become old enough to threaten wraparound, or the disk
fills up.  Which is less evil?

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: global / super barriers (for checksums)
Next
From: Anastasia Lubennikova
Date:
Subject: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.