Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS
Date
Msg-id 20141215154929.GI5023@alap3.anarazel.de
Whole thread Raw
In response to Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Something is broken in logical decoding with CLOBBER_CACHE_ALWAYS  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2014-12-15 10:15:30 -0500, Tom Lane wrote:
> The CLOBBER_CACHE_ALWAYS buildfarm members occasionally fail in
> contrib/test_decoding with
>
> TRAP: FailedAssertion("!(((bool)((relation)->rd_refcnt == 0)))", File: "relcache.c", Line: 1981)
>
> for example here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-14%2005%3A50%3A09
> and here:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=markhor&dt=2014-12-13%2021%3A26%3A05

Yes, I've been trying to debug that. I've finally managed to reproduce
locally. Unfortunately it's not directly reproducible on my laptop...

A fair amount of tinkering later I've found out that it's not actually
CLOBBER_CACHE_ALWAYS itself that triggers the problem but catchup
interrupts. It triggers with CLOBBER_CACHE_ALWAYS because that happens
to make parts of the code slow enough to not reach a ordinary
AcceptInvalidationMessages() in time.

The problem is that in the added regression test there can be a
situation where there's a relcache entry that's *not* currently visible
but still referenced. Consider:

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
CREATE TABLE somechange(id serial primary key);
INSERT INTO changeresult   SELECT data FROM pg_logical_slot_get_changes(...);

While get_changes stores it data into a tuplestore there can be a moment
where 'changeresult' has a refcnt > 0 due to the INSERT, but is
invisible because we're using a historic snapshot from when changeresult
doesn't exist.

Without catchup invalidations and/or an outside reference to a relation
that's normally not a problem because it won't get reloaded from the
caches at an inappropriate time while invisible. Until a few weeks ago
there was no regression test covering that case which is why these
crashes are only there now.

It triggers via:
RelationCacheInvalidate() -> RelationClearRelation(relation, true) ->       /* Build temporary entry, but don't link it
intohashtable */       newrel = RelationBuildDesc(save_relid, false);       if (newrel == NULL)       {           /*
Shouldonly get here if relation was deleted */           RelationCacheDelete(relation);
RelationDestroyRelation(relation,false);           elog(ERROR, "relation %u deleted while still in use", save_relid);
   }
 

That path is only hit while refcnt > 0

RelationDestroyRelation() has   Assert(RelationHasReferenceCountZero(relation));

So that path doesn't really work... Without assertions we'd "just" get a
transient ERROR. I think that path should generally loose the
RelationDestroyRelation() - if it's referenced that's surely not the
right thing to do.  I'm not actually convinced logical decoding is the
only way that assert can be reached.

Since logical decoding happens inside a subtransaction (when called via
SQL) and invalidate caches one relatively straightforward way to fix
this would be to add something roughly akin to:    Assert(!relation->rd_isvalid)    /*     * This should only happen
whenwe're doing logical decoding where     * it can happen when [above].     */    if (HistoricSnapshotActive())
return;

There's no chance that such a entry could survive too long in an invalid
state (minus preexisting bugs independent of decoding) since we hit the
same paths that subtransaction abort hits.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Join push-down support for foreign tables
Next
From: Bruce Momjian
Date:
Subject: Re: Fractions in GUC variables