Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker - Mailing list pgsql-bugs

From Tomas Vondra
Subject Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
Date
Msg-id 20191121165752.dffge6bh756xlfdg@development
Whole thread Raw
In response to Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker  (Ondřej Jirman <ienieghapheoghaiwida@xff.cz>)
Responses Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker
List pgsql-bugs
On Thu, Nov 21, 2019 at 05:15:02PM +0100, Ondřej Jirman wrote:
>On Thu, Nov 21, 2019 at 04:57:07PM +0100, Ondřej Jirman wrote:
>>
>> Maybe it has something to do with my upgrade method. I
>> dumped/restored the replica with pg_dumpall, and then just proceded
>> to enable subscription and refresh publication with (copy_data=false)
>> for all my subscriptions.
>
>OTOH, it may not. There are 2 more databases replicated the same way
>from the same database cluster, and they don't crash the replica
>server, and continue replicating. The one of the other databases also
>has bytea columns in some of the tables.
>
>It really just seems related to the machine restart (a regular one)
>that I did on the primary, minutes later replica crashed, and kept
>crashing ever since whenever connecting to the primary for the hometv
>database.
>

Hmmm. A restart of the primary certainly should not cause any such
damage, that'd be a bug too. And it'd be a bit strange that it correctly
sends the data and it crashes the replica. How exactly did you restart
the primary? What mode - smart/fast/immediate?

>So maybe something's wrong with the replica database (maybe because the
>connection got killed by the walsender at unfortunate time), rather
>than the original database, because I can replicate the original DB
>afresh into a new copy just fine and other databases continue
>replicating just fine if I disable the crashing subscription.
>

Possibly, but what would be the damaged bit? The only thing I can think
of is the replication slot info (i.e. snapshot), and I know there were
some timing issues in the serialization.

How far is the change from the restart point of the slot (visible in
pg_replication_slots)? If there are many changes since then, that'd mean
the corrupted snapshot is unlikely.

There's a lot of moving parts in this - you're replicating between major
versions, and from ARM to x86. All of that should work, of course, but
maybe there's a bug somewhere. So it might take time to investigate and
fix. Thanks for you patience ;-)

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16130: planner does not pick unique btree index and goes for seq scan but unsafe hash index works.
Next
From: Tomas Vondra
Date:
Subject: Re: BUG #16129: Segfault in tts_virtual_materialize in logicalreplication worker