Re: should there be a hard-limit on the number of transactionspending undo? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: should there be a hard-limit on the number of transactionspending undo?
Date
Msg-id CA+hUKGLyNm_5mtWkyt8h4m02Yo7n9n2SG8mdZRPZw47Vq3ikZQ@mail.gmail.com
Whole thread Raw
In response to Re: should there be a hard-limit on the number of transactionspending undo?  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: should there be a hard-limit on the number of transactionspending undo?  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Sat, Jul 20, 2019 at 11:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jul 19, 2019 at 4:14 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't think this matters here at all. As long as there's only DML
> > involved, there won't be any lock conflicts anyway - everybody's
> > taking RowExclusiveLock or less, and it's all fine. If you update a
> > row in zheap, abort, and then try to update again before the rollback
> > happens, we'll do a page-at-a-time rollback in the foreground, and
> > proceed with the update; when we get around to applying the undo,
> > we'll notice that page has already been handled and skip the undo
> > records that pertain to it.  To get the kinds of problems I'm on about
> > here, somebody's got to be taking some more serious locks.
>
> If I'm not mistaken, you're tacitly assuming that you'll always be
> using zheap, or something sufficiently similar to zheap. It'll
> probably never be possible to UNDO changes to something like a GIN
> index on a zheap table, because you can never do that with sensible
> concurrency/deadlock behavior.
>
> I don't necessarily have a problem with that. I don't pretend to
> understand how much of a problem it is. Obviously it partially depends
> on what your ambitions are for this infrastructure. Still, assuming
> that I have it right, ISTM that UNDO/zheap/whatever should explicitly
> own this restriction.

I had a similar thought: you might regret that choice if you were
wanting to implement an AM with lock table-based concurrency control
(meaning that there are lock ordering concerns for row and page locks,
for DML statements, not just DDL).  That seemed a bit too far fetched
to mention before, but are you saying the same sort of concerns might
come up with indexes that support true undo (as opposed to indexes
that still need VACUUM)?

For comparison, ARIES[1] has no-deadlock rollbacks as a basic property
and reacquires locks during restart before new transactions are allow
to execute.  In its model, the locks in question can be on things like
rows and pages.  We don't even use our lock table for those (except
for non-blocking SIREAD locks, irrelevant here).  After crash
recovery, if zheap encounters a row with pending rollback from an
aborted transaction, as usual it either needs to read an older version
from an undo log (for reads) or help execute the rollback before
updating (for writes).  That only requires page-at-a-time LWLocks
("latching"), so it's deadlock-free.  The only deadlock risk comes
from the need to acquire heavyweight locks on relations which
typically only conflict when you run DDL, so yeah, it's tempting to
worry a lot less about those than the fine grained lock traffic from
DML statements that DB2 and others have to deal with.

So spell out the two options again:

A.  Rollback can't deadlock.  You have to make sure you reliably hold
locks until rollback is completed (including some tricky new lock
transfer magic), and then reacquire them after recovery before new
transactions are allowed.  You could trivially achieve the restart
part by simply waiting until all rollback is executed before you allow
new transactions, but other systems including DB2 first acquire all
the locks in an earlier scan through the log, then allow new
connections, and then execute the rollback.  Acquiring them before new
transactions are allowed means that they must fit in the lock table and
there must be no conflicts among them if they were all granted as at
the moment you crashed or shut down.

B.  Rollback can deadlock or exhaust the lock table because we release
and reacquire some arbitrary time later.  No choice but to keep
retrying if anything goes wrong, and rollback is theoretically not
guaranteed to complete and you can contrive a workload that will never
make progress.  This amounts to betting that these problems will be
rare enough that it doesn't matter and eventually make progress, and
it should be fairly clear what's happening and why.

I might as well put the quote marks on now:  "Perhaps we could
implement A later."

[1] https://cs.stanford.edu/people/chrismre/cs345/rl/aries.pdf


--
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Speed up transaction completion faster after many relations areaccessed in a transaction
Next
From: Thomas Munro
Date:
Subject: Re: Commitfest 2019-07, the first of five* for PostgreSQL 13