Re: PANIC: wrong buffer passed to visibilitymap_clear - Mailing list pgsql-hackers

From Andres Freund
Subject Re: PANIC: wrong buffer passed to visibilitymap_clear
Date
Msg-id 20210412204046.p3mh7wwfs44f2qid@alap3.anarazel.de
Whole thread Raw
In response to Re: PANIC: wrong buffer passed to visibilitymap_clear  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2021-04-11 13:55:30 -0400, Tom Lane wrote:
> Either way, it's hard to argue that heap_update hasn't crossed the
> complexity threshold where it's impossible to maintain safely.  We
> need to simplify it.

Yea, I think we're well beyond that point. I can see a few possible
steps to wrangle the existing complexity into an easier to understand
shape:

- Rename heapam.c goto labels, they're useless to understand what is
  happening.

- Move HeapTupleSatisfiesUpdate() call and the related branches
  afterwards into its own function.

- Move "temporarily mark it locked" branch into its own function. It's a
  minimal implementation of tuple locking, so it seems more than
  separate enough.

- Move the "store the new tuple" part into its own function (pretty much
  the critical section).

- Probably worth unifying the exit paths - there's a fair bit of
  duplication by now...

Half related:

- I think we might also need to do something about the proliferation of
  bitmaps in heap_update(). We now separately allocate 5 bitmapsets -
  that strikes me as fairly insane.


However, these would not really address the complexity in itself, just
make it easier to manage.

ISTM that a lot of the complexity is related to needing to retry (and
avoiding doing so unnecessarily), which in turn is related to avoiding
deadlocks. We actually know how to not need that to the same degree -
the (need_toast || newtupsize > pagefree) locks the tuple and afterwards
has a lot more freedom. We obviously can't just always do that, due to
the WAL logging overhead.

I wonder if we could make that path avoid the WAL logging overhead. We
don't actually need a full blown tuple lock, potentially even with its
own multixact, here.

The relevant comment (in heap_lock_tuple()) says:
    /*
     * XLOG stuff.  You might think that we don't need an XLOG record because
     * there is no state change worth restoring after a crash.  You would be
     * wrong however: we have just written either a TransactionId or a
     * MultiXactId that may never have been seen on disk before, and we need
     * to make sure that there are XLOG entries covering those ID numbers.
     * Else the same IDs might be re-used after a crash, which would be
     * disastrous if this page made it to disk before the crash.  Essentially
     * we have to enforce the WAL log-before-data rule even in this case.
     * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
     * entries for everything anyway.)
     */

But I don't really think that doing full-blown WAL tuple-locking WAL
logging is really the right solution.

- The "next xid" concerns are at least as easily solved by WAL logging a
  distinct "highest xid assigned" record when necessary. Either by
  having a shared memory variable saying "latestLoggedXid" or such, or
  by having end-of-recovery advance nextXid to beyond what ExtendCLOG()
  extended to. That reduces the overhead to at most once-per-xact (and
  commonly smaller) or nothing, respectively.

- While there's obviously a good bit of simplicity ensuring that a
  replica is exactly the same ("Also, in a PITR log-shipping or 2PC
  environment ..."), we don't actually enforce that strictly anyway -
  so I am not sure why it's necessary to pay the price here.

But maybe I'm all wet here, I certainly haven't had enough coffee.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Allowing to create LEAKPROOF functions to non-superuser
Next
From: Tomas Vondra
Date:
Subject: Re: Allowing to create LEAKPROOF functions to non-superuser