Thread: BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged

BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17284
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 14.1
Operating system:   Ubuntu 20.04
Description:

When running concurrent installchecks (x100) for src/test/isolation, I've
observed the following server crash:
Core was generated by `postgres: postgres regress30 127.0.0.1(55840) SEL'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fd1b69d5535 in __GI_abort () at abort.c:79
#2  0x000055d441cc3a02 in ExceptionalCondition (
    conditionName=conditionName@entry=0x55d441e34cd2
"TransactionIdIsValid(tailXid)", 
    errorType=errorType@entry=0x55d441d1c01d "FailedAssertion",
fileName=fileName@entry=0x55d441e34bc9 "predicate.c", 
    lineNumber=lineNumber@entry=928) at assert.c:67
#3  0x000055d441b8e0aa in SerialAdd
(minConflictCommitSeqNo=18446744073709551615, xid=95552353) at
predicate.c:928
#4  SummarizeOldestCommittedSxact () at predicate.c:1540
#5  GetSerializableTransactionSnapshotInt (snapshot=0x55d441fad980
<CurrentSnapshotData>, 
    sourcevxid=sourcevxid@entry=0x0, sourcepid=sourcepid@entry=-1) at
predicate.c:1816
#6  0x000055d441b90bc6 in GetSerializableTransactionSnapshot
(snapshot=<optimized out>) at predicate.c:1710
#7  0x000055d441d036f5 in GetTransactionSnapshot () at snapmgr.c:347
#8  GetTransactionSnapshot () at snapmgr.c:306
#9  0x000055d441b9da4d in exec_simple_query (query_string=0x55d443545990
"select id from D2;") at postgres.c:1128
#10 0x000055d441b9f0a9 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x55d4435793d8, dbname=<optimized out>, 
    username=<optimized out>) at postgres.c:4339
#11 0x000055d441b1236f in BackendRun (port=0x55d443574c30,
port=0x55d443574c30) at postmaster.c:4526
#12 BackendStartup (port=0x55d443574c30) at postmaster.c:4210
#13 ServerLoop () at postmaster.c:1739
#14 0x000055d441b1336c in PostmasterMain (argc=3, argv=0x55d443540330) at
postmaster.c:1412
#15 0x000055d441824f5b in main (argc=3, argv=0x55d443540330) at main.c:210

test two-ids                      ... FAILED (test process exited with exit
code 1)     4193 ms

It can be easily reproduced on the build made with 'CPPFLAGS="-O0
-DTEST_SUMMARIZE_SERIAL"' by the following script (based on the simplified
script by Thomas Munro from bug#17116):
concurrency=2
output=/tmp/junk
rm -rf $output
printf "$(printf "test: two-ids\n%.0s" `seq 100`)"
>/tmp/isolation_schedule
for c in `seq $concurrency`; do
mkdir -p $output/results_${c}
(EXTRA_REGRESS_OPTS="--dbname=regress_${c} --outputdir=$output/results_${c}
--schedule=/tmp/isolation_schedule" make -C src/test/isolation installcheck)
\
        > $output/out.${c}.log 2>&1 &
done
wait
coredumpctl --no-pager

("fsync=off" facilitates the bug reproduction for me.)

Reproduced on REL_13_STABLE..master.


On Sun, Nov 14, 2021 at 1:35 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> It can be easily reproduced on the build made with 'CPPFLAGS="-O0
> -DTEST_SUMMARIZE_SERIAL"' by the following script (based on the simplified
> script by Thomas Munro from bug#17116):

I assume from the separate bug report that this fails even if you
apply the patches from that other thread?

https://www.postgresql.org/message-id/flat/17116-d6ca217acc180e30%40postgresql.org



Re: BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged

From
Alexander Lakhin
Date:
Hello Thomas,
13.11.2021 21:23, Thomas Munro wrote:
> On Sun, Nov 14, 2021 at 1:35 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> It can be easily reproduced on the build made with 'CPPFLAGS="-O0
>> -DTEST_SUMMARIZE_SERIAL"' by the following script (based on the simplified
>> script by Thomas Munro from bug#17116):
> I assume from the separate bug report that this fails even if you
> apply the patches from that other thread?
>
> https://www.postgresql.org/message-id/flat/17116-d6ca217acc180e30%40postgresql.org
Yes, applying patches
v4-0001-Fix-assert-failures-in-parallel-SERIALIZABLE-READ and
v4-0002-Fix-race-in-SERIALIZABLE-READ-ONLY from that thread doesn't
affect this assert (tested on the master branch).

Best regards,
Alexander




Re: BUG #17284: Assert failed in SerialAdd() when the summarize_serial mode is engaged

From
Alexander Lakhin
Date:
Hello hackers,
13.11.2021 15:00, PG Bug reporting form wrote:
> When running concurrent installchecks (x100) for src/test/isolation, I've
> observed the following server crash:
> Core was generated by `postgres: postgres regress30 127.0.0.1(55840) SEL'.
> Program terminated with signal SIGABRT, Aborted.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> 50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x00007fd1b69d5535 in __GI_abort () at abort.c:79
> #2  0x000055d441cc3a02 in ExceptionalCondition (
>     conditionName=conditionName@entry=0x55d441e34cd2
> "TransactionIdIsValid(tailXid)", 
>     errorType=errorType@entry=0x55d441d1c01d "FailedAssertion",
> fileName=fileName@entry=0x55d441e34bc9 "predicate.c", 
>     lineNumber=lineNumber@entry=928) at assert.c:67
> #3  0x000055d441b8e0aa in SerialAdd
> (minConflictCommitSeqNo=18446744073709551615, xid=95552353) at
> predicate.c:928

The issue is reproduced (on REL_14_STABLE) with the following command
sequence (can be replayed with pgreplay):

2021-12-29 18:00:15.000 MSK|postgres|regression|61cc77ff.18af00|LOG: 
statement:
     create table D1 (id int not null);
     create table D2 (id int not null);
     insert into D1 values (1);
     insert into D2 values (1);
2021-12-29 18:00:15.100 MSK|postgres|regression|61cc77ff.18af00|LOG: 
disconnection: session time: 0:00:01.129 user=postgres
database=regression host=127.0.0.1 port=46730
2021-12-29 18:00:15.200 MSK|postgres|regression|61cc77ff.18af01|LOG: 
statement: begin isolation level serializable;
2021-12-29 18:00:15.300 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: begin isolation level serializable;
2021-12-29 18:00:15.400 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: update D2 set id = id + 1;
2021-12-29 18:00:15.500 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: commit;
2021-12-29 18:00:15.600 MSK|postgres|regression|61cc77ff.18af03|LOG: 
statement: begin isolation level serializable;
2021-12-29 18:00:15.700 MSK|postgres|regression|61cc77ff.18af01|LOG: 
statement: update D1 set id = id + 1;
2021-12-29 18:00:15.800 MSK|postgres|regression|61cc77ff.18af03|LOG: 
statement: select id from D1;
2021-12-29 18:00:15.800 MSK|postgres|regression|61cc77ff.18af01|LOG: 
disconnection: session time: 0:00:01.000 user=postgres
database=regression host=127.0.0.1 port=46731
2021-12-29 18:00:15.900 MSK|postgres|regression|61cc77ff.18af02|LOG: 
statement: select pg_sleep(1);

To ease a reproduction, the following delays can be added to predicate.c:
@@ -1502,6 +1507,7 @@ SummarizeOldestCommittedSxact(void)
 {
        SERIALIZABLEXACT *sxact;
 
+       pg_usleep(1000000);
        LWLockAcquire(SerializableFinishedListLock, LW_EXCLUSIVE);
 
        /*

@@ -3745,6 +3760,7 @@ ClearOldPredicateLocks(void)
        SERIALIZABLEXACT *finishedSxact;
        PREDICATELOCK *predlock;
 
+       pg_usleep(2000000);
        /*
         * Loop through finished transactions. They are in commit order,
so we can
         * stop as soon as we find one that's still interesting.

With these delays and CPPFLAGS="-DTEST_SUMMARIZE_SERIAL" the assertion
failure can be observed just on `make check`.
As I can see, on disconnect of session 18af01 the
ReleasePredicateLocks() function calls SetNewSxactGlobalXmin() where
SerialSetActiveSerXmin(called with PredXact->SxactGlobalXmin ==
InvalidTransactionId) sets serialControl->tailXid to InvalidTransactionId.
Then session 18af03 calls SummarizeOldestCommittedSxact(), that in case
!SHMQueueEmpty(FinishedSerializableTransactions) calls SerialAdd() where
the check TransactionIdIsValid(serialControl->tailXid) fails.
But if ReleasePredicateLocks() in session 18af01 happens to reach
ReleasePredicateLocks(), the FinishedSerializableTransactions queue gets
empty, and it would prevent calling SerialAdd() in session 18af03.
So there is a small window (that probably can be widened by
LWLockAcquire()) between
            SetNewSxactGlobalXmin();
            needToClear = true;
and
    if (needToClear)
        ClearOldPredicateLocks();
where a concurrent transaction can fail in SummarizeOldestCommittedSxact().

Best regards,
Alexander



This fell through the cracks at the time, but just to close the loop on 
this thread:

This was reported again at 
https://www.postgresql.org/message-id/18658-7dab125ec688c70b%40postgresql.org, 
and fixed in commit 1a43de5e0a.

-- 
Heikki Linnakangas
Neon (https://neon.tech)