Re: Some pgq table rewrite incompatibility with logical decoding? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Some pgq table rewrite incompatibility with logical decoding?
Date
Msg-id 19de930f-044f-2ee8-44b5-503c47450a35@2ndquadrant.com
Whole thread Raw
In response to Re: Some pgq table rewrite incompatibility with logical decoding?  (Jeremy Finzel <finzelj@gmail.com>)
Responses Re: Some pgq table rewrite incompatibility with logical decoding?
List pgsql-hackers
On 06/25/2018 07:48 PM, Jeremy Finzel wrote:
> 
> 
> On Mon, Jun 25, 2018 at 12:41 PM, Andres Freund <andres@anarazel.de 
> <mailto:andres@anarazel.de>> wrote:
> 
>     Hi,
> 
>     On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
>     > I am hoping someone here can shed some light on this issue - I apologize if
>     > this isn't the right place to ask this but I'm almost some of you all were
>     > involving in pgq's dev and might be able to answer this.
>     > 
>     > We are actually running 2 replication technologies on a few of our dbs,
>     > skytools and pglogical.  Although we are moving towards only using logical
>     > decoding-based replication, right now we have both for different purposes.
>     > 
>     > There seems to be a table rewrite happening on table pgq.event_58_1 that
>     > has happened twice, and it ends up in the decoding stream, resulting in the
>     > following error:
>     > 
>     > ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
>     > OID"
>     > 
>     > In retracing what happened, we discovered that this relfilenode was
>     > rewritten.  But somehow, it is ending up in the logical decoding stream as
>     > is "undecodable".  This is pretty disastrous because the only way to fix it
>     > really is to advance the replication slot and lose data.
>     > 
>     > The only obvious table rewrite I can find in the pgq codebase is a truncate
>     > in pgq.maint_rotate_tables.sql.  But there isn't anything surprising
>     > there.  If anyone has any ideas as to what might cause this so that we
>     > could somehow mitigate the possibility of this happening again until we
>     > move off pgq, that would be much appreciated.
> 
>     I suspect the issue might be that pgq does some updates to catalog
>     tables. Is that indeed the case?
> 
> 
> I also suspected this.  The only case I found of this is that it is 
> doing deletes and inserts to pg_autovacuum.  I could not find anything 
> quickly otherwise but I'm not sure if I'm missing something in some of 
> the C code.
> 

I don't think that's true, for two reasons.

Firstly, I don't think pgq updates catalogs directly, it simply 
truncates the queue tables when rotating them (which updates the 
relfilenode in pg_class, of course).

Secondly, we're occasionally seeing this on systems that do not use pgq, 
but that do VACUUM FULL on custom "queue" tables. The symptoms are 
exactly the same ("ERROR: could not map filenode"). It's pretty damn 
rare and we don't have direct access to the systems, so investigation is 
difficult :-( Our current hypothesis is that it's somewhat related to 
subtransactions (because of triggers with exception blocks).

Jeremy, are you able to reproduce the issue locally, using pgq? That 
would be very valuable.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: logical decoding / rewrite map vs. maxAllocatedDescs
Next
From: Alvaro Herrera
Date:
Subject: Re: logical decoding / rewrite map vs. maxAllocatedDescs