Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+Tgmob_Zf8rsDoHY8U=1=yc49CsAZ862DnM920QDm_gsAdqJA@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Responses Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Jul 23, 2019 at 10:42 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I think, even though there might not be a correctness issue with the
> current code as it stands, we should still use a local variable.
> Updating MyUndoWorker is a big side-effect, which the caller is not
> supposed to be aware of, because all that function should do is just
> get the slot info.

Absolutely right.  It's just routine good practice to avoid using
global variables when there is no compelling reason to do otherwise.
The reason you state here is one of several good ones.

> Yes, I also think that the function would error out only because of
> can't-happen cases, like "too many locks taken" or "out of binary heap
> slots" or "out of memory" (this last one is not such a can't happen
> case). These cases happen probably due to some bugs, I suppose. But I
> was wondering : Generally when the code errors out with such
> can't-happen elog() calls, worst thing that happens is that the
> transaction gets aborted. Whereas, in this case, the worst thing that
> could happen is : the undo action would never get executed, which
> means selects for this tuple will keep on accessing the undo log ?
> This does not sound like any data consistency issue, so we should be
> fine after all ?

I don't think so.  Every XID present in undo has to be something we
can look up in CLOG to figure out which transactions are aborted and
which transactions are committed, so that we know which transactions
need undo.  If we forget to undo the transaction, we can't discard it,
which means we can't advance the CLOG transaction horizon, which means
we'll eventually start failing to assign XIDs, leading to a refusal of
all write transactions.  Oops.

More generally, it's not OK for the generic undo layer to make
assumptions about whether the operations performed by the undo
handlers are essential or not.  We don't want to impose a design
constraint the undo can only be used for things that are not actually
critical, because that will make it hard to write AMs that use it.
And there's no reason to live with such a design constraint anyway,
because, as noted above, CLOG truncation requires it.

More generally still, some can't-happen situations should be checked
via Assert() and others via elog(). For example, consider some code
that looks up a syscache tuple and pulls data from the returned tuple.
If the code that handles DDL is written in such a way that the tuple
should always exist, then this is a can't-happen situation, but
generally the code checks this via elog(), not Assert(), because it
could also happen due to the catalog contents being corrupted.  If
Assert() were used, the checks would not run in production builds, and
a corrupt catalog would lead to a seg fault. An elog() is much
friendlier. As a general principle, when a certain thing ought to
always be true, but it being true depends on a whole lot of
assumptions elsewhere in the code, and especially if it also depends
on assumptions like "the database is not corrupted," I think elog() is
preferable.  Assert() is better for things that are more localized and
that really can't go wrong for any reason other than a bug.  In this
case, I think I would tend towards elog(PANIC), but it's arguable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: minor fixes after pgindent prototype fixes
Next
From: Robert Haas
Date:
Subject: Re: should there be a hard-limit on the number of transactionspending undo?