Re: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 1173, ... - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 1173, ...
Date
Msg-id 67b46634-6916-4e6a-15c4-fad43f2fa0c1@enterprisedb.com
Whole thread Raw
In response to Re: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 1173, ...  (Andres Freund <andres@anarazel.de>)
Responses Re: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 1173, ...
Re: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 1173, ...
List pgsql-hackers

On 2/13/22 01:28, Andres Freund wrote:
> Hi, 
> 
> On February 12, 2022 3:57:28 PM PST, Thomas Munro <thomas.munro@gmail.com> wrote:
>> Hi,
>>
>> I don't know what happened here, but I my little script that scrapes
>> for build farm assertion failures hasn't seen this one before now.
>> It's on a 15 year old macOS release and compiled with 15 year old GCC
>> version, for some added mystery.
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2022-02-12%2022%3A07%3A21
>>
>> 2022-02-13 00:09:52.859 CET [92423:92] pg_regress/replorigin
>> STATEMENT:  SELECT data FROM
>> pg_logical_slot_get_changes('regression_slot', NULL, NULL,
>> 'include-xids', '0', 'skip-empty-xacts', '1', 'include-sequences',
>> '0');
>> TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File:
>> "reorderbuffer.c", Line: 1173, PID: 92423)
>> 0   postgres                            0x00592dd0 ExceptionalCondition + 160\\0
>> 1   postgres                            0x00391c40
>> ReorderBufferGetRelids + 304\\0
>> 2   postgres                            0x00394794
>> ReorderBufferAllocate + 1044\\0
>> 3   postgres                            0x003834dc heap_decode + 76\\0
>> 4   postgres                            0x003829c8
>> LogicalDecodingProcessRecord + 168\\0
>> 5   postgres                            0x0038a1d4
>> pg_logical_emit_message_text + 1876\\0
>> 6   postgres                            0x00241810
>> ExecMakeTableFunctionResult + 576\\0
>> 7   postgres                            0x002576c4
>> ExecInitFunctionScan + 1444\\0
>> 8   postgres                            0x00242140 ExecScan + 896\\0
>> 9   postgres                            0x002397cc standard_ExecutorRun + 540\\0
>> 10  postgres                            0x00427940
>> EnsurePortalSnapshotExists + 832\\0
>> 11  postgres                            0x00428e84 PortalRun + 644\\0
>> 12  postgres                            0x004252e0 pg_plan_queries + 3232\\0
>> 13  postgres                            0x00426270 PostgresMain + 3424\\0
>> 14  postgres                            0x0036b9ec PostmasterMain + 8060\\0
>> 15  postgres                            0x0029f24c main + 1356\\0
>> 16  postgres                            0x00002524 start + 68\\0
> 
> I think that may be the bug that Tomas fixed a short while ago. He
> said the skip empty xacts stuff didn't yet work for sequences. That's
> only likely to be hit in slower machines...
> 

I'm not sure how could this be related to the issue fixed by b779d7d8fd.
That's merely about handling of empty xacts in the output plugin, it has
little (nothing) to do with reorderbuffer - the failures was simply
about (not) printing BEGIN/COMMIT for empty xacts.

So why would it be triggering an Assert in reorderbuffer? Also, there
was a successful run with the test_decoding changes 80901b3291 (and also
with sequence decoding infrastructure committed a couple days ago) ...

Do we have access to the machine? If yes, it'd be interesting to run the
tests before the fix, and see how often it fails. Also, printing the LSN
would be interesting, so that we can correlate it to WAL.

BTW it's probably just a coincidence, but the cfbot does show a strange
failure on the macOS machine [1], as if we did not wait for all changes
to be applied by replica. I thought it's because of the issue with
tracking LSN for sequences [2], but this is with a reworked test that
does not do just nextval(). So I'm wondering if these two things may be
related ... both things happened on macOS only.


[1] https://cirrus-ci.com/task/6299472241098752

[2]
https://www.postgresql.org/message-id/712cad46-a9c8-1389-aef8-faf0203c9be9%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Consistently use "startup process"/"WAL sender" instead of "Startup process"/"WAL Sender" in comments and docs.
Next
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences