Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers

From Amit Kapila
Subject Assertion failure in SnapBuildInitialSnapshot()
Date
Msg-id CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Whole thread Raw
Responses Re: Assertion failure in SnapBuildInitialSnapshot()
List pgsql-hackers
Hi,

Thomas has reported this failure in an email [1] and shared the
following links offlist with me:
https://cirrus-ci.com/task/5311549010083840

https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/testrun/build/testrun/subscription/100_bugs/log/100_bugs_twoways.log

https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/crashlog/crashlog-postgres.exe_1c40_2022-11-08_00-20-28-110.txt

The call stack is as follows:
00000063`4edff670 00007ff7`1922fcdf     postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`198f8050
"TransactionIdPrecedesOrEquals(safeXid, snap->xmin)",
char * fileName = 0x00007ff7`198f8020
"../src/backend/replication/logical/snapbuild.c",
int lineNumber = 0n600)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
00000063`4edff6b0 00007ff7`192106df     postgres!SnapBuildInitialSnapshot(
struct SnapBuild * builder = 0x00000251`5b95bce8)+0x20f
[c:\cirrus\src\backend\replication\logical\snapbuild.c @ 600]
00000063`4edff730 00007ff7`1920d9f6     postgres!CreateReplicationSlot(
struct CreateReplicationSlotCmd * cmd = 0x00000251`5b94d828)+0x40f
[c:\cirrus\src\backend\replication\walsender.c @ 1152]
00000063`4edff870 00007ff7`192bc9c4     postgres!exec_replication_command(
char * cmd_string = 0x00000251`5b94ac68 "CREATE_REPLICATION_SLOT
"pg_16400_sync_16392_7163433409941550636" LOGICAL pgoutput (SNAPSHOT
'use')")+0x4a6 [c:\cirrus\src\backend\replication\walsender.c @ 1804]


As per my investigation based on the above logs, the failed test is
due to the following command in 100_bugs.pl:
$node_twoways->safe_psql('d2',
    "CREATE SUBSCRIPTION testsub CONNECTION \$\$"
  . $node_twoways->connstr('d1')
  . "\$\$ PUBLICATION testpub WITH (create_slot=false, "
  . "slot_name='testslot')");

It failed while creating the table sync slot.

The failure happens because the xmin computed by the snap builder is
lesser than what is computed by GetOldestSafeDecodingTransactionId()
during initial snapshot creation for the tablesync slot by
SnapBuildInitialSnapshot.

To investigate, I tried to study how the values of "safeXid" and
"snap->xmin" are computed in SnapBuildInitialSnapshot(). There appear
to be four places in the code where we assign value to xmin
(builder-xmin) during the snapshot building process and then we assign
the same to snap->xmin. Those places are: (a) Two places in
SnapBuildFindSnapshot(), (b) One place in SnapBuildRestore(), and (c)
One place in SnapBuildProcessRunningXacts()

Seeing the LOGS, it appears to me that we find a consistent point from
the below code in SnapBuildFindSnapshot() and the following line
assigns builder->xmin.

...
if (running->oldestRunningXid == running->nextXid)
{
...
builder->xmin = running->nextXid;

The reason is we only see "logical decoding found consistent point at
..." in LOGs. If SnapBuildFindSnapshot() has to go through the entire
machinery of snapshot building then, we should have seen "logical
decoding found initial starting point at ..." and similar other LOGs.
The builder->xmin can't be assigned from any other place in (b) or
(c). From (c), it can't be assigned because we are building a full
snapshot here which won't restore any serialized snapshot. It can't be
assigned from (b) because while creating a slot we stop as soon as we
find the consistent point, see
DecodingContextFindStartpoint()->DecodingContextReady()

In the above code snippet, the running->nextXid in the above code
snippet has been assigned from ShmemVariableCache->nextXid in
GetRunningTransactionData().

The safeXid computed from GetOldestSafeDecodingTransactionId() uses
ShmemVariableCache->nextXid as the starting point and keeps the slot's
xmin as the safe Xid limit.

It seems to me due to SnapBuilder's initial_xmin_horizon, we won't set
(SnapBuilder's) xmin lesser than slot's effective_xmin computed in
CreateInitDecodingContext(). See SnapBuildFindSnapshot(). So, ideally,
SnapBuildInitialSnapshot() should never compute safeXid which is based
on the minimum of all slot's effective_xmin to be greater than
SnapBuilder's xmin (or snapshot's xmin). In short, the code as written
seems correct to me.

I have also tried to analyze if any recent commit (7f13ac8123) could
cause this problem but can't think of any reason because the changes
are related to the restart of decoding and the failed test is related
to creating a slot the very first time.

I don't have any good ideas on how to proceed with this. Any thoughts
on this would be helpful?

Note: I have briefly discussed this issue with Sawada-San and
Kuroda-San, so kept them in Cc.

[1] -
https://www.postgresql.org/message-id/CA%2BhUKG%2BA_LyW%3DFAOi2ebA9Vr-1%3Desu%2BeBSm0dsVhU%3DEgfpipkg%40mail.gmail.com

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Printing backtrace of postgres processes
Next
From: sirisha chamarthi
Date:
Subject: Re: Reviving lost replication slots