Re: REINDEX INDEX results in a crash for an index of pg_class since9.6 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: REINDEX INDEX results in a crash for an index of pg_class since9.6
Date
Msg-id 20190425181648.lgffj43btacmxszu@alap3.anarazel.de
Whole thread Raw
In response to Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
List pgsql-hackers
Hi,

On 2019-04-25 12:29:16 -0400, Tom Lane wrote:
> I wrote:
> > The problem here is that RelationSetNewRelfilenode is aggressively
> > changing the index's relcache entry before it's written out the
> > updated tuple, so that the tuple update tries to make an index
> > entry in the new storage which isn't filled yet.  I think we can
> > fix it by *not* doing that, but leaving it to the relcache inval
> > during the CommandCounterIncrement call to update the relcache
> > entry.  However, it looks like that will take some API refactoring,
> > because the storage-creation functions expect to get the new
> > relfilenode out of the relcache entry, and they'll have to be
> > changed to not do it that way.
> 
> So looking at that, it seems like the table_relation_set_new_filenode
> API is pretty darn ill-designed.  It assumes that it's passed an
> already-entirely-valid relcache entry, but it also supposes that
> it can pass back information that needs to go into the relation's
> pg_class entry.  One or the other side of that has to give, unless
> you want to doom everything to updating pg_class twice.

I'm not super happy about it either - but I think that's somewhat of an
outgrowth of how this worked before.  I mean there's two differences:

1) Previously the RelationCreateStorage() was called unconditionally,
now it's

        case RELKIND_INDEX:
        case RELKIND_SEQUENCE:
            RelationCreateStorage(relation->rd_node, persistence);
            RelationOpenSmgr(relation);
            break;

        case RELKIND_RELATION:
        case RELKIND_TOASTVALUE:
        case RELKIND_MATVIEW:
            table_relation_set_new_filenode(relation, persistence,
                                            &freezeXid, &minmulti);
            break;
    }

That seems pretty obviously necessary.


2) Previously AddNewRelationTuple() relation tuple determined the
initial horizon for table like things:
    /* Initialize relfrozenxid and relminmxid */
    if (relkind == RELKIND_RELATION ||
        relkind == RELKIND_MATVIEW ||
        relkind == RELKIND_TOASTVALUE)
    {
        /*
         * Initialize to the minimum XID that could put tuples in the table.
         * We know that no xacts older than RecentXmin are still running, so
         * that will do.
         */
        new_rel_reltup->relfrozenxid = RecentXmin;

        /*
         * Similarly, initialize the minimum Multixact to the first value that
         * could possibly be stored in tuples in the table.  Running
         * transactions could reuse values from their local cache, so we are
         * careful to consider all currently running multis.
         *
         * XXX this could be refined further, but is it worth the hassle?
         */
        new_rel_reltup->relminmxid = GetOldestMultiXactId();
    }

and inserted that. Now it's determined previously below heap_create(),
and passed as an argument to AddNewRelationTuple().

and similarly the caller to RelationSetNewRelfilenode() determined the
new horizons, but they also just were written into the relcache entry
and then updated:

11:
    classform->relfrozenxid = freezeXid;
    classform->relminmxid = minmulti;
    classform->relpersistence = persistence;

    CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);
master:
    classform->relfrozenxid = freezeXid;
    classform->relminmxid = minmulti;
    classform->relpersistence = persistence;

    CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);


I'm not quite sure why the current situation is any worse?


Perhaps that's because I don't quite understand what you mean with "It
assumes that it's passed an already-entirely-valid relcache entry". What
do you mean by that / where does it assume that?  I guess we could warn
a bit more about the underlying tuple not necessarily existing yet it in
the callback's docs, but other than that?  Previously heap.c also was
dealing with an relcache entry without backing pg_class entry but with
existing storage, no?


> I'm not really sure what's the point of giving the tableam control
> of relfrozenxid+relminmxid at all

Well, because not every AM is going to need those. It'd make very little
sense to e.g. set them for an undo based design like zheap's - there is
need to freeze ever. The need for each page to rewritten multiple times
(original write, hint bit sets, freezing for heap) imo is one of the
major reasons people are working on alternative AMs.  That seems to
fundamentally require AMs having control over the relfrozenxid


> and I notice that index_create for one is just Asserting that constant
> values are returned.

Well, that's not going to call into tableam at all?  Those asserts
previously were in RelationSetNewRelfilenode() itself:

    /* Indexes, sequences must have Invalid frozenxid; other rels must not */
    Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
            relation->rd_rel->relkind == RELKIND_SEQUENCE) ?
           freezeXid == InvalidTransactionId :
           TransactionIdIsNormal(freezeXid));

but given that e.g. not every tableam is going to have those values

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: jsonpath
Next
From: Alvaro Herrera
Date:
Subject: Re: tableam scan-API patch broke foreign key validation