Thread: Some pgq table rewrite incompatibility with logical decoding?

Some pgq table rewrite incompatibility with logical decoding?

From
Jeremy Finzel
Date:
I am hoping someone here can shed some light on this issue - I apologize if this isn't the right place to ask this but I'm almost some of you all were involving in pgq's dev and might be able to answer this.

We are actually running 2 replication technologies on a few of our dbs, skytools and pglogical.  Although we are moving towards only using logical decoding-based replication, right now we have both for different purposes.

There seems to be a table rewrite happening on table pgq.event_58_1 that has happened twice, and it ends up in the decoding stream, resulting in the following error:

ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation OID"

In retracing what happened, we discovered that this relfilenode was rewritten.  But somehow, it is ending up in the logical decoding stream as is "undecodable".  This is pretty disastrous because the only way to fix it really is to advance the replication slot and lose data.

The only obvious table rewrite I can find in the pgq codebase is a truncate in pgq.maint_rotate_tables.sql.  But there isn't anything surprising there.  If anyone has any ideas as to what might cause this so that we could somehow mitigate the possibility of this happening again until we move off pgq, that would be much appreciated.

Thanks,
Jeremy


Re: Some pgq table rewrite incompatibility with logical decoding?

From
Andres Freund
Date:
Hi,

On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
> I am hoping someone here can shed some light on this issue - I apologize if
> this isn't the right place to ask this but I'm almost some of you all were
> involving in pgq's dev and might be able to answer this.
> 
> We are actually running 2 replication technologies on a few of our dbs,
> skytools and pglogical.  Although we are moving towards only using logical
> decoding-based replication, right now we have both for different purposes.
> 
> There seems to be a table rewrite happening on table pgq.event_58_1 that
> has happened twice, and it ends up in the decoding stream, resulting in the
> following error:
> 
> ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
> OID"
> 
> In retracing what happened, we discovered that this relfilenode was
> rewritten.  But somehow, it is ending up in the logical decoding stream as
> is "undecodable".  This is pretty disastrous because the only way to fix it
> really is to advance the replication slot and lose data.
> 
> The only obvious table rewrite I can find in the pgq codebase is a truncate
> in pgq.maint_rotate_tables.sql.  But there isn't anything surprising
> there.  If anyone has any ideas as to what might cause this so that we
> could somehow mitigate the possibility of this happening again until we
> move off pgq, that would be much appreciated.

I suspect the issue might be that pgq does some updates to catalog
tables. Is that indeed the case?

Greetings,

Andres Freund


Re: Some pgq table rewrite incompatibility with logical decoding?

From
Jeremy Finzel
Date:


On Mon, Jun 25, 2018 at 12:41 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
> I am hoping someone here can shed some light on this issue - I apologize if
> this isn't the right place to ask this but I'm almost some of you all were
> involving in pgq's dev and might be able to answer this.
>
> We are actually running 2 replication technologies on a few of our dbs,
> skytools and pglogical.  Although we are moving towards only using logical
> decoding-based replication, right now we have both for different purposes.
>
> There seems to be a table rewrite happening on table pgq.event_58_1 that
> has happened twice, and it ends up in the decoding stream, resulting in the
> following error:
>
> ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
> OID"
>
> In retracing what happened, we discovered that this relfilenode was
> rewritten.  But somehow, it is ending up in the logical decoding stream as
> is "undecodable".  This is pretty disastrous because the only way to fix it
> really is to advance the replication slot and lose data.
>
> The only obvious table rewrite I can find in the pgq codebase is a truncate
> in pgq.maint_rotate_tables.sql.  But there isn't anything surprising
> there.  If anyone has any ideas as to what might cause this so that we
> could somehow mitigate the possibility of this happening again until we
> move off pgq, that would be much appreciated.

I suspect the issue might be that pgq does some updates to catalog
tables. Is that indeed the case?

I also suspected this.  The only case I found of this is that it is doing deletes and inserts to pg_autovacuum.  I could not find anything quickly otherwise but I'm not sure if I'm missing something in some of the C code.

Thanks,
Jeremy

Re: Some pgq table rewrite incompatibility with logical decoding?

From
Tomas Vondra
Date:
On 06/25/2018 07:48 PM, Jeremy Finzel wrote:
> 
> 
> On Mon, Jun 25, 2018 at 12:41 PM, Andres Freund <andres@anarazel.de 
> <mailto:andres@anarazel.de>> wrote:
> 
>     Hi,
> 
>     On 2018-06-25 10:37:18 -0500, Jeremy Finzel wrote:
>     > I am hoping someone here can shed some light on this issue - I apologize if
>     > this isn't the right place to ask this but I'm almost some of you all were
>     > involving in pgq's dev and might be able to answer this.
>     > 
>     > We are actually running 2 replication technologies on a few of our dbs,
>     > skytools and pglogical.  Although we are moving towards only using logical
>     > decoding-based replication, right now we have both for different purposes.
>     > 
>     > There seems to be a table rewrite happening on table pgq.event_58_1 that
>     > has happened twice, and it ends up in the decoding stream, resulting in the
>     > following error:
>     > 
>     > ERROR,XX000,"could not map filenode ""base/16418/1173394526"" to relation
>     > OID"
>     > 
>     > In retracing what happened, we discovered that this relfilenode was
>     > rewritten.  But somehow, it is ending up in the logical decoding stream as
>     > is "undecodable".  This is pretty disastrous because the only way to fix it
>     > really is to advance the replication slot and lose data.
>     > 
>     > The only obvious table rewrite I can find in the pgq codebase is a truncate
>     > in pgq.maint_rotate_tables.sql.  But there isn't anything surprising
>     > there.  If anyone has any ideas as to what might cause this so that we
>     > could somehow mitigate the possibility of this happening again until we
>     > move off pgq, that would be much appreciated.
> 
>     I suspect the issue might be that pgq does some updates to catalog
>     tables. Is that indeed the case?
> 
> 
> I also suspected this.  The only case I found of this is that it is 
> doing deletes and inserts to pg_autovacuum.  I could not find anything 
> quickly otherwise but I'm not sure if I'm missing something in some of 
> the C code.
> 

I don't think that's true, for two reasons.

Firstly, I don't think pgq updates catalogs directly, it simply 
truncates the queue tables when rotating them (which updates the 
relfilenode in pg_class, of course).

Secondly, we're occasionally seeing this on systems that do not use pgq, 
but that do VACUUM FULL on custom "queue" tables. The symptoms are 
exactly the same ("ERROR: could not map filenode"). It's pretty damn 
rare and we don't have direct access to the systems, so investigation is 
difficult :-( Our current hypothesis is that it's somewhat related to 
subtransactions (because of triggers with exception blocks).

Jeremy, are you able to reproduce the issue locally, using pgq? That 
would be very valuable.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Some pgq table rewrite incompatibility with logical decoding?

From
Jeremy Finzel
Date:
I don't think that's true, for two reasons.

Firstly, I don't think pgq updates catalogs directly, it simply truncates the queue tables when rotating them (which updates the relfilenode in pg_class, of course).

Secondly, we're occasionally seeing this on systems that do not use pgq, but that do VACUUM FULL on custom "queue" tables. The symptoms are exactly the same ("ERROR: could not map filenode"). It's pretty damn rare and we don't have direct access to the systems, so investigation is difficult :-( Our current hypothesis is that it's somewhat related to subtransactions (because of triggers with exception blocks).

Jeremy, are you able to reproduce the issue locally, using pgq? That would be very valuable.

We have tried but have been unable to reproduce it.  If we do encounter it again, we will plan on reporting back and seeing if we can do some deep debugging.

Thanks,
Jeremy 

Re: Some pgq table rewrite incompatibility with logical decoding?

From
Jeremy Finzel
Date:
Jeremy, are you able to reproduce the issue locally, using pgq? That would be very valuable.

Tomas et al:

We have hit this error again, and we plan to snapshot the database as to be able to do whatever troubleshooting we can.  If someone could provide me guidance as to what exactly you would like me to do, please let me know.  I am able to provide an xlog dump and also debugging information upon request.

This is actually a different database system that also uses skytools, and the exact same table (pgq.event_58_1) is again the cause of the relfilenode error.  I did a point-in-time recovery to a point after this relfilenode appears using pg_xlogdump, and verified this was the table that appeared, then disappeared.

Here is the exact issue again I am having:

Provider logs:
2018-08-28 08:29:30.334 CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,2,"authentication",2018-08-28 08:29:30 CDT,289/78900643,0,LOG,00000,"replication connection authorized: user=foo_logical SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)",,,,,,,,,""
2018-08-28 08:29:30.366 CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,3,"idle",2018-08-28 08:29:30 CDT,289/0,0,LOG,00000,"starting logical decoding for slot ""pgl_foo_prod_providerb97b25d_foo336ddc1""","streaming transactions committing after C0A5/EEBBD4E8, reading WAL from C0A5/EC723AC8",,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.366 CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,4,"idle",2018-08-28 08:29:30 CDT,289/0,0,LOG,00000,"logical decoding found consistent point at C0A5/EC723AC8","There are no running transactions.",,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.448 CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,5,"idle",2018-08-28 08:29:30 CDT,289/0,0,ERROR,XX000,"could not map filenode ""base/16418/2800559918"" to relation OID",,,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.463 CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,6,"idle",2018-08-28 08:29:30 CDT,289/0,0,LOG,08006,"could not receive data from client: Connection reset by peer",,,,,,,,,"bar_foo_foo_a"
2018-08-28 08:29:30.463 CDT,"foo_logical","foo_prod",70236,"0.0.0.0:48314",5b854e3a.1125c,7,"idle",2018-08-28 08:29:30 CDT,,0,LOG,00000,"disconnection: session time: 0:00:00.135 user=foo_logical database=foo_prod host=0.0.0.0 port=48314",,,,,,,,,"bar_foo_foo_a"


Subscriber logs:
2018-08-28 08:32:10.040 CDT,,,71810,,5b854eda.11882,1,,2018-08-28 08:32:10 CDT,7/0,0,LOG,00000,"starting apply for subscription bar_foo_foo_a",,,,,,,,,"pglogical apply 16418:862837778"
2018-08-28 08:32:10.175 CDT,,,71810,,5b854eda.11882,2,,2018-08-28 08:32:10 CDT,7/242099,123886525,ERROR,XX000,"data stream ended",,,,,,,,,"pglogical apply 16418:862837778"
2018-08-28 08:32:10.175 CDT,,,71810,,5b854eda.11882,3,,2018-08-28 08:32:10 CDT,7/0,0,LOG,00000,"apply worker [71810] at slot 1 generation 366 exiting with error",,,,,,,,,"pglogical apply 16418:862837778"
2018-08-28 08:32:10.179 CDT,,,27173,,5b61d336.6a25,373,,2018-08-01 10:35:18 CDT,,0,LOG,00000,"worker process: pglogical apply 16418:862837778 (PID 71810) exited with exit code 1",,,,,,,,,""


Thanks,
Jeremy




Re: Some pgq table rewrite incompatibility with logical decoding?

From
Jeremy Finzel
Date:
We have hit this error again, and we plan to snapshot the database as to be able to do whatever troubleshooting we can. 

I am happy to report that we were able to get replication working again by running snapshots of the systems in question on servers running the latest point release 9.6.10, and replication simply works and skips over these previously erroring relfilenodes.  So whatever fixes were made in this point release to logical decoding seems to have fixed the issue.

Thanks,
Jeremy

Re: Some pgq table rewrite incompatibility with logical decoding?

From
Tomas Vondra
Date:
Hi Jeremy,

On 08/28/2018 10:46 PM, Jeremy Finzel wrote:
>     We have hit this error again, and we plan to snapshot the database
>     as to be able to do whatever troubleshooting we can. 
> 
> 
> I am happy to report that we were able to get replication working again
> by running snapshots of the systems in question on servers running the
> latest point release 9.6.10, and replication simply works and skips over
> these previously erroring relfilenodes.  So whatever fixes were made in
> this point release to logical decoding seems to have fixed the issue.
> 

Interesting.

So you were running 9.6.9 before, it triggered the issue (and was not
able to recover). You took a filesystem snapshot, started a 9.6.10 on
the snapshot, and it recovered without hitting the issue?

I quickly went through the commits in 9.6 branch between 9.6.9 and
9.6.10, looking for stuff that might be related, and these three commits
seem possibly related (usually because of invalidations, vacuum, ...):

  6a46aba1cd6dd7c5af5d52111a8157808cbc5e10
  Fix bugs in vacuum of shared rels, by keeping their relcache entries
  current.

  da10d6a8a94eec016fa072d007bced9159a28d39
  Fix "base" snapshot handling in logical decoding

  0a60a291c9a5b8ecdf44cbbfecc4504e3c21ef49
  Add table relcache invalidation to index builds.

But it's hard to say if/which of those commits did the trick, without
more information.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Some pgq table rewrite incompatibility with logical decoding?

From
Tomas Vondra
Date:
On 08/28/2018 07:41 PM, Jeremy Finzel wrote:
>         Jeremy, are you able to reproduce the issue locally, using pgq?
>         That would be very valuable.
> 
> 
> Tomas et al:
> 
> We have hit this error again, and we plan to snapshot the database as to
> be able to do whatever troubleshooting we can.  If someone could provide
> me guidance as to what exactly you would like me to do, please let me
> know.  I am able to provide an xlog dump and also debugging information
> upon request.
> 
> This is actually a different database system that also uses skytools,
> and the exact same table (pgq.event_58_1) is again the cause of the
> relfilenode error.  I did a point-in-time recovery to a point after this
> relfilenode appears using pg_xlogdump, and verified this was the table
> that appeared, then disappeared.
> 

Great!

Can you attach to the decoding process using gdb, and set a breakpoint
to the elog(ERROR) at reorderbuffer.c:1599, and find out at which LSN /
record it fails?

https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/replication/logical/reorderbuffer.c#L1599

If it fails too fast, making it difficult to attach gdb before the
crash, adding the LSN to the log message might be easier.

Once we have the LSN, it would be useful to see the pg_xlogdump
before/around that position.

Another interesting piece of information would be to know the contents
of the relmapper cache (which essentially means stepping through
RelationMapFilenodeToOid or something like that).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Some pgq table rewrite incompatibility with logical decoding?

From
Tomas Vondra
Date:
Hi,

On 08/29/2018 12:01 AM, Tomas Vondra wrote:
> On 08/28/2018 07:41 PM, Jeremy Finzel wrote:
>>         Jeremy, are you able to reproduce the issue locally, using pgq?
>>         That would be very valuable.
>>
>>
>> Tomas et al:
>>
>> We have hit this error again, and we plan to snapshot the database as to
>> be able to do whatever troubleshooting we can.  If someone could provide
>> me guidance as to what exactly you would like me to do, please let me
>> know.  I am able to provide an xlog dump and also debugging information
>> upon request.
>>
>> This is actually a different database system that also uses skytools,
>> and the exact same table (pgq.event_58_1) is again the cause of the
>> relfilenode error.  I did a point-in-time recovery to a point after this
>> relfilenode appears using pg_xlogdump, and verified this was the table
>> that appeared, then disappeared.
>>
> 
> Great!
> 
> Can you attach to the decoding process using gdb, and set a breakpoint
> to the elog(ERROR) at reorderbuffer.c:1599, and find out at which LSN /
> record it fails?
> 
> https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/replication/logical/reorderbuffer.c#L1599
> 
> If it fails too fast, making it difficult to attach gdb before the
> crash, adding the LSN to the log message might be easier.
> 
> Once we have the LSN, it would be useful to see the pg_xlogdump
> before/around that position.
> 
> Another interesting piece of information would be to know the contents
> of the relmapper cache (which essentially means stepping through
> RelationMapFilenodeToOid or something like that).
> 

Have you managed to get the backtrace, or investigate where exactly it
fails (which LSN etc.)? We've managed to get a backtrace for "our"
failure, and it'd be interesting to compare those.

Attached is a subset of pg_waldump, for the two relevant transactions.
2554301859 is the transaction doing VACUUM FULL on the user table (so
essentially the pgq table), and 2554301862 (with a single 2554301862
subxact, likely due to exception handled in plpgsql function) is the
transaction that fails during decoding - on the very first WAL record
after the 2554301859 commit.

In reality there are records from additional transactions intermixed,
but those are irrelevant I believe. It's 2554301862 that fails, because
it commits first (and thus gets decoded first).

At first I thought this might be related to the "could not read block 3"
issue discussed in another thread, but if you say upgrade to 9.6.10
fixes this, that seems unlikely (because that issue is still there).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Some pgq table rewrite incompatibility with logical decoding?

From
Jeremy Finzel
Date:

Interesting.

So you were running 9.6.9 before, it triggered the issue (and was not
able to recover). You took a filesystem snapshot, started a 9.6.10 on
the snapshot, and it recovered without hitting the issue?

I am resposting this to the list and not only to Tomas. Tomas, I can’t promise just yet to delve into this because given the patch fixes the issue it’s obviously much lower priority for our team. Are you hoping for me to confirm the exact scenario in which the 9.6.10 patch fixes the bug?

Actually, there were more things changed than that so I'm not positive it was the last patch:

BEFORE:
Provider - 9.6.8-1.pgdg16.04+1, pglogical 2.1.1-1.xenial+1
Subscriber - 9.6.9-2.pgdg16.04+1, 2.1.1-1.xenial+1

AFTER:
Provider - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
Subscriber - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
 
I quickly went through the commits in 9.6 branch between 9.6.9 and
9.6.10, looking for stuff that might be related, and these three commits
seem possibly related (usually because of invalidations, vacuum, ...):

  6a46aba1cd6dd7c5af5d52111a8157808cbc5e10
  Fix bugs in vacuum of shared rels, by keeping their relcache entries
  current.

  da10d6a8a94eec016fa072d007bced9159a28d39
  Fix "base" snapshot handling in logical decoding

  0a60a291c9a5b8ecdf44cbbfecc4504e3c21ef49
  Add table relcache invalidation to index builds.

But it's hard to say if/which of those commits did the trick, without
more information.

Let me know if that info gives you any more insight - actually 2 point version jumps for provider, 1 for subscriber.

Thanks,
Jeremy 

Re: Some pgq table rewrite incompatibility with logical decoding?

From
Petr Jelinek
Date:
On 01/09/18 18:25, Jeremy Finzel wrote:
> 
>     Interesting.
> 
>     So you were running 9.6.9 before, it triggered the issue (and was not
>     able to recover). You took a filesystem snapshot, started a 9.6.10 on
>     the snapshot, and it recovered without hitting the issue?
> 
> 
> I am resposting this to the list and not only to Tomas. Tomas, I can’t
> promise just yet to delve into this because given the patch fixes the
> issue it’s obviously much lower priority for our team. Are you hoping
> for me to confirm the exact scenario in which the 9.6.10 patch fixes the
> bug?
> 
> Actually, there were more things changed than that so I'm not positive
> it was the last patch:
> 
> BEFORE:
> Provider - 9.6.8-1.pgdg16.04+1, pglogical 2.1.1-1.xenial+1
> Subscriber - 9.6.9-2.pgdg16.04+1, 2.1.1-1.xenial+1
> 
> AFTER:
> Provider - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
> Subscriber - 9.6.10-1.pgdg16.04+1, pglogical 2.2.0-3.xenial+1
>  

I finally managed to reproduce this and it's indeed fixed in the 9.6.10
by the commit da10d6a. It was caused by the issue with subtransaction
and SnapBuilder snapshots described there.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services