Re: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Assertion failure in SnapBuildInitialSnapshot()
Date
Msg-id 20221115023837.cxbawee6pbtdg5yp@awork3.anarazel.de
Whole thread Raw
In response to Re: Assertion failure in SnapBuildInitialSnapshot()  (Andres Freund <andres@anarazel.de>)
Responses Re: Assertion failure in SnapBuildInitialSnapshot()
Re: Assertion failure in SnapBuildInitialSnapshot()
List pgsql-hackers
Hi,

On 2022-11-14 17:25:31 -0800, Andres Freund wrote:
> Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
> SnapBuildExportSnapshot()? Why aren't the error checks for
> SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
> transactions are we even in when we import the snapshot (cf.
> SnapBuildExportSnapshot() doing a StartTransactionCommand()).

Most of the checks for that are in CreateReplicationSlot() - but not al,
e.g. XactReadOnly isn't set, nor do we enforce in an obvious place that we
don't already hold a snapshot.

I first thought this might directly explain the problem, due to the
MyProc->xmin assignment in SnapBuildInitialSnapshot() overwriting a value that
could influence the return value for GetOldestSafeDecodingTransactionId(). But
that happens later, and we check that MyProc->xmin is invalid at the start.

But it still seems supicious. This will e.g. influence whether
StartupDecodingContext() sets PROC_IN_LOGICAL_DECODING. Which probably is
fine, but...


Another theory: I dimly remember Thomas mentioning that there's some different
behaviour of xlogreader during shutdown as part of the v15 changes. I don't
quite remember what the scenario leading up to that was. Thomas?


It's certainly interesting that we see stuff like:

2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR:  could not find
recordwhile sending logically-decoded data: missing contrecord at 0/1D3B710
 
2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT:
START_REPLICATIONSLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version '3', origin 'any',
publication_names'"testpub"')
 
ERROR:  could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR:  error while shutting down streaming COPY: ERROR:
couldnot find record while sending logically-decoded data: missing contrecord at 0/1D3B710
 

It could entirely be caused by postmaster slowly killing processes after the
assertion failure and that that is corrupting shared memory state though. But
it might also be related.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: List of Bitmapset (was Re: ExecRTCheckPerms() and many prunable partitions)
Next
From: Andres Freund
Date:
Subject: Re: Direct I/O