logical decoding bug when mapped relation with toast contents isrewritten repeatedly - Mailing list pgsql-hackers

From Andres Freund
Subject logical decoding bug when mapped relation with toast contents isrewritten repeatedly
Date
Msg-id 20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
Whole thread Raw
Responses Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly
Re: logical decoding bug when mapped relation with toast contents isrewritten repeatedly
List pgsql-hackers
Hi,

(Tomas, CCing you because you IIRC mentioned encountered an issue like
this)

I just spent quite a while debugging an issue where running logical
decoding yielded a:
ERROR:  could not map filenode "base/$X/$Y" to relation OID
error.

After discarding like 30 different theories, I have found the cause:

During rewrites (i.e. VACUUM FULL / CLUSTER) of a mapped relation with a
toast table with actual live toasted tuples (pg_proc in my case and
henceforth) heap inserts with the toast data happen into the new toast
relation, triggered by:

static void
raw_heap_insert(RewriteState state, HeapTuple tup)
...
    /*
     * If the new tuple is too big for storage or contains already toasted
     * out-of-line attributes from some other relation, invoke the toaster.
     *
     * Note: below this point, heaptup is the data we actually intend to store
     * into the relation; tup is the caller's original untoasted data.
     */
    if (state->rs_new_rel->rd_rel->relkind == RELKIND_TOASTVALUE)
    {
        /* toast table entries should never be recursively toasted */
        Assert(!HeapTupleHasExternal(tup));
        heaptup = tup;
    }
    else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
        heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
                                         HEAP_INSERT_SKIP_FSM |
                                         (state->rs_use_wal ?
                                          0 : HEAP_INSERT_SKIP_WAL));
    else
        heaptup = tup;


At that point the new toast relation does *NOT* appear to be a system
catalog, it's just appears as an "independent" table.  Therefore we do
not trigger, in heap_insert():

/*
 * RelationIsLogicallyLogged
 *        True if we need to log enough information to extract the data from the
 *        WAL stream.
 *
 * We don't log information for unlogged tables (since they don't WAL log
 * anyway) and for system tables (their content is hard to make sense of, and
 * it would complicate decoding slightly for little gain). Note that we *do*
 * log information for user defined catalog tables since they presumably are
 * interesting to the user...
 */
#define RelationIsLogicallyLogged(relation) \
    (XLogLogicalInfoActive() && \
     RelationNeedsWAL(relation) && \
     !IsCatalogRelation(relation))

        /*
         * For logical decoding, we need the tuple even if we're doing a full
         * page write, so make sure it's included even if we take a full-page
         * image. (XXX We could alternatively store a pointer into the FPW).
         */
        if (RelationIsLogicallyLogged(relation))
        {
            xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
            bufflags |= REGBUF_KEEP_DATA;
        }

i.e. the inserted toast tuple will be marked as
XLH_INSERT_CONTAINS_NEW_TUPLE - which it shouldn't, because it's a
system table. Which we currently do not allow do be logically decoded.

That normally ends up being harmless, because ReorderBufferCommit() has the
following check:
                    if (!RelationIsLogicallyLogged(relation))
                        goto change_done;

but to reach that check, we first have to map the relfilenode from the
WAL to the corresponding OID:
                    reloid = RelidByRelfilenode(change->data.tp.relnode.spcNode,
                                                change->data.tp.relnode.relNode);

That works correctly if there's only one rewrite - the relmapper
contains the data for the new toast table.  But if there's been *two*
consecutive rewrites, the relmapper *does not* contain the intermediary
relfilenode of pg_proc.  There's no such problem for non-mapped tables,
because historic snapshots allow us to access the relevant data, but the
relmapper isn't mvcc.

Therefore the catalog-rewrite escape hatch of:
                    /*
                     * Catalog tuple without data, emitted while catalog was
                     * in the process of being rewritten.
                     */
                    if (reloid == InvalidOid &&
                        change->data.tp.newtuple == NULL &&
                        change->data.tp.oldtuple == NULL)
                        goto change_done;
does not trigger and we run into:
                    else if (reloid == InvalidOid)
                        elog(ERROR, "could not map filenode \"%s\" to relation OID",
                             relpathperm(change->data.tp.relnode,
                                         MAIN_FORKNUM));


commenting out this error / converting it into a warning makes this case
harmless, but could obviously be problematic in other scenarios.


I suspect the proper fix would be to have a new HEAP_INSERT_NO_LOGICAL
option, and specify that in raw_heap_insert() iff
RelationIsLogicallyLogged(state->rs_old_rel) or something like that.

Attached is a *prototype* patch of that approach.  Without the code
level changes the addition to test_decoding's rewrite.sql trigger the
bug, after it they're fixed.


The only reason the scenario I was debugging hit this was that there was
a cluster wide VACUUM FULL a couple times a day, and replication was
several hours behind due to slow network / receiving side.


Now I'm having a beer outside.


Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: stat() on Windows might cause error if target file is largerthan 4GB
Next
From: Michael Paquier
Date:
Subject: Re: Cache lookup errors with functions manipulation object addresses