Thread: TRAP: FailedAssertion("tabstat->trans == trans", File: "pgstat_relation.c", Line: 508

Hi,

I get this crash running the attached test program. On my slow-disked 
and old desktop it occurs once in 20 or so runs (it is yet another 
installment of an old test that runs pgbench with logical replication).

15devel compiled from d3609dd25.

(The bash deletes stuff, and without my environment it will need some 
tweaking)

Thanks,

Erik Rijkers




Attachment
On Sat, Apr 16, 2022 at 09:37:55AM +0200, Erik Rijkers wrote:
> I get this crash running the attached test program. On my slow-disked and
> old desktop it occurs once in 20 or so runs (it is yet another installment
> of an old test that runs pgbench with logical replication).
>
> 15devel compiled from d3609dd25.
>
> (The bash deletes stuff, and without my environment it will need some
> tweaking)

Thanks for the report, Erik.  This one is new, likely related to the
move of the stats to shared memory.  I have added an open item.
--
Michael

Attachment
Hi,

On 2022-04-16 09:37:55 +0200, Erik Rijkers wrote:
> I get this crash running the attached test program. On my slow-disked and
> old desktop it occurs once in 20 or so runs (it is yet another installment
> of an old test that runs pgbench with logical replication).
>
> 15devel compiled from d3609dd25.
> 
> (The bash deletes stuff, and without my environment it will need some
> tweaking)

Any chance for a backtrace? I'll otherwise try to adjust the script, but ...

Greetings,

Andres Freund



At Sat, 16 Apr 2022 11:23:23 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2022-04-16 09:37:55 +0200, Erik Rijkers wrote:
> > I get this crash running the attached test program. On my slow-disked and
> > old desktop it occurs once in 20 or so runs (it is yet another installment
> > of an old test that runs pgbench with logical replication).
> >
> > 15devel compiled from d3609dd25.
> > 
> > (The bash deletes stuff, and without my environment it will need some
> > tweaking)
> 
> Any chance for a backtrace? I'll otherwise try to adjust the script, but ...

FWIW, the script keep succussfully running more than 140 times for me.
(on master, Cent8) And I haven't find a hypothesis for the cause of
the symptom.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Op 18-04-2022 om 09:13 schreef Kyotaro Horiguchi:
> At Sat, 16 Apr 2022 11:23:23 -0700, Andres Freund <andres@anarazel.de> wrote in
>> Hi,
>>
>> On 2022-04-16 09:37:55 +0200, Erik Rijkers wrote:
>>> I get this crash running the attached test program. On my slow-disked and
>>> old desktop it occurs once in 20 or so runs (it is yet another installment
>>> of an old test that runs pgbench with logical replication).
>>>
>>> 15devel compiled from d3609dd25.
>>>
>>> (The bash deletes stuff, and without my environment it will need some
>>> tweaking)
>>
>> Any chance for a backtrace? I'll otherwise try to adjust the script, but ...
> 
> FWIW, the script keep succussfully running more than 140 times for me.
> (on master, Cent8) And I haven't find a hypothesis for the cause of
> the symptom.

Hm.  Just now I've recompiled and retried and after 5 runs got the same 
crash.  Then tried on another machine (also old, I'm afraid),
and built 1a8b11053 and ran the same thing.  That failed on the first 
try, and made core dump from which I extracted:


gdb ~/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres 
core-postgres-6-500-500-8289-1650269886  -ex bt -ex q


GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from 
/home/aardvark/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres...done.
[New LWP 8289]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: logical replication worker for 
subscription 16411                   '.
Program terminated with signal 6, Aborted.
#0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
#0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1  0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2  0x0000000000973fcb in ExceptionalCondition 
(conditionName=conditionName@entry=0xb20d76 "tabstat->trans == trans", 
errorType=errorType@entry=0x9c7c2b "FailedAssertion",
     fileName=fileName@entry=0xb20d0b "pgstat_relation.c", 
lineNumber=lineNumber@entry=508) at assert.c:69
#3  0x000000000086b77f in AtEOXact_PgStat_Relations 
(xact_state=xact_state@entry=0x26f0b50, isCommit=isCommit@entry=true) at 
pgstat_relation.c:508
#4  0x000000000086ec0f in AtEOXact_PgStat (isCommit=isCommit@entry=true, 
parallel=parallel@entry=false) at pgstat_xact.c:54
#5  0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
#6  0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
#7  0x00000000007ee72b in apply_handle_commit_internal 
(commit_data=commit_data@entry=0x7ffe4606a7a0) at worker.c:1532
#8  0x00000000007efac9 in apply_handle_commit (s=0x7ffe4606a940) at 
worker.c:845
#9  apply_dispatch () at worker.c:2473
#10 0x00000000007f11a7 in LogicalRepApplyLoop (last_received=74454600) 
at worker.c:2757
#11 start_apply () at worker.c:3526
#12 0x00000000007f175f in ApplyWorkerMain () at worker.c:3782
#13 0x00000000007bdba3 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000007c3241 in do_start_bgworker (rw=<optimized out>) at 
postmaster.c:5802
#15 maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000007c3b65 in sigusr1_handler 
(postgres_signal_arg=<optimized out>) at postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000007c41d6 in ServerLoop () at postmaster.c:1757
#20 0x00000000007c5c3b in PostmasterMain () at postmaster.c:1465
#21 0x0000000000720cfe in main (argc=11, argv=0x2615590) at main.c:202


I'm not sure that helps.


> 
> regards.
> 



At Mon, 18 Apr 2022 10:57:02 +0200, Erikjan Rijkers <er@xs4all.nl> wrote in 
> Hm.  Just now I've recompiled and retried and after 5 runs got the
> same crash.  Then tried on another machine (also old, I'm afraid),
> and built 1a8b11053 and ran the same thing.  That failed on the first
> try, and made core dump from which I extracted:

Thanks!

> gdb ~/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres
> core-postgres-6-500-500-8289-1650269886 -ex bt -ex q
> 
> #2 0x0000000000973fcb in ExceptionalCondition
> #(conditionName=conditionName@entry=0xb20d76 "tabstat->trans == trans",
> #errorType=errorType@entry=0x9c7c2b "FailedAssertion",
>     fileName=fileName@entry=0xb20d0b "pgstat_relation.c",
>     lineNumber=lineNumber@entry=508) at assert.c:69
> #3 0x000000000086b77f in AtEOXact_PgStat_Relations
> #(xact_state=xact_state@entry=0x26f0b50, isCommit=isCommit@entry=true)
> #at pgstat_relation.c:508

Could you read tabstat, *tabstat, trans, *trans here?

> #4 0x000000000086ec0f in AtEOXact_PgStat (isCommit=isCommit@entry=true,
> #parallel=parallel@entry=false) at pgstat_xact.c:54
> #5  0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
> #6  0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
> #7 0x00000000007ee72b in apply_handle_commit_internal
> #(commit_data=commit_data@entry=0x7ffe4606a7a0) at worker.c:1532
> #8 0x00000000007efac9 in apply_handle_commit (s=0x7ffe4606a940) at
> #worker.c:845
> #9  apply_dispatch () at worker.c:2473
> #10 0x00000000007f11a7 in LogicalRepApplyLoop (last_received=74454600)
> #at worker.c:2757
> #11 start_apply () at worker.c:3526
> #12 0x00000000007f175f in ApplyWorkerMain () at worker.c:3782

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Op 19-04-2022 om 02:15 schreef Kyotaro Horiguchi:
> At Mon, 18 Apr 2022 10:57:02 +0200, Erikjan Rijkers <er@xs4all.nl> wrote in
>> Hm.  Just now I've recompiled and retried and after 5 runs got the
>> same crash.  Then tried on another machine (also old, I'm afraid),
>> and built 1a8b11053 and ran the same thing.  That failed on the first
>> try, and made core dump from which I extracted:
> 
> Thanks!
> 
>> gdb ~/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres
>> core-postgres-6-500-500-8289-1650269886 -ex bt -ex q
>>
>> #2 0x0000000000973fcb in ExceptionalCondition
>> #(conditionName=conditionName@entry=0xb20d76 "tabstat->trans == trans",
>> #errorType=errorType@entry=0x9c7c2b "FailedAssertion",
>>      fileName=fileName@entry=0xb20d0b "pgstat_relation.c",
>>      lineNumber=lineNumber@entry=508) at assert.c:69
>> #3 0x000000000086b77f in AtEOXact_PgStat_Relations
>> #(xact_state=xact_state@entry=0x26f0b50, isCommit=isCommit@entry=true)
>> #at pgstat_relation.c:508
> 
> Could you read tabstat, *tabstat, trans, *trans here?

To be honest I'm not sure how to, but I gave it a try:

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from 
/home/aardvark/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres...done.
[New LWP 21839]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: logical replication worker for 
subscription 16411                   '.
Program terminated with signal 6, Aborted.
#0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1  0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2  0x000000000097400b in ExceptionalCondition 
(conditionName=conditionName@entry=0xb20df6 "tabstat->trans == trans", 
errorType=errorType@entry=0x9c7cab "FailedAssertion",
     fileName=fileName@entry=0xb20d8b "pgstat_relation.c", 
lineNumber=lineNumber@entry=508) at assert.c:69
#3  0x000000000086b7bf in AtEOXact_PgStat_Relations 
(xact_state=xact_state@entry=0x2d9ab50, isCommit=isCommit@entry=true) at 
pgstat_relation.c:508
#4  0x000000000086ec4f in AtEOXact_PgStat (isCommit=isCommit@entry=true, 
parallel=parallel@entry=false) at pgstat_xact.c:54
#5  0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
#6  0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
#7  0x00000000007ee76b in apply_handle_commit_internal 
(commit_data=commit_data@entry=0x7fffb90aa8e0) at worker.c:1532
#8  0x00000000007efb09 in apply_handle_commit (s=0x7fffb90aaa80) at 
worker.c:845
#9  apply_dispatch () at worker.c:2473
#10 0x00000000007f11e7 in LogicalRepApplyLoop (last_received=74695984) 
at worker.c:2757
#11 start_apply () at worker.c:3526
#12 0x00000000007f179f in ApplyWorkerMain () at worker.c:3782
#13 0x00000000007bdbb3 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000007c3251 in do_start_bgworker (rw=<optimized out>) at 
postmaster.c:5802
#15 maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000007c3b75 in sigusr1_handler 
(postgres_signal_arg=<optimized out>) at postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000007c41e6 in ServerLoop () at postmaster.c:1757
#20 0x00000000007c5c4b in PostmasterMain () at postmaster.c:1465
#21 0x0000000000720d0e in main (argc=11, argv=0x2cbf590) at main.c:202
(gdb) f 3
#3  0x000000000086b7bf in AtEOXact_PgStat_Relations 
(xact_state=xact_state@entry=0x2d9ab50, isCommit=isCommit@entry=true) at 
pgstat_relation.c:508
508                     Assert(tabstat->trans == trans);
(gdb) p tabstat
$1 = <optimized out>
(gdb) p *tabstat
value has been optimized out
(gdb) p trans
$2 = <optimized out>
(gdb) p *trans
value has been optimized out
(gdb)


> 
>> #4 0x000000000086ec0f in AtEOXact_PgStat (isCommit=isCommit@entry=true,
>> #parallel=parallel@entry=false) at pgstat_xact.c:54
>> #5  0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
>> #6  0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
>> #7 0x00000000007ee72b in apply_handle_commit_internal
>> #(commit_data=commit_data@entry=0x7ffe4606a7a0) at worker.c:1532
>> #8 0x00000000007efac9 in apply_handle_commit (s=0x7ffe4606a940) at
>> #worker.c:845
>> #9  apply_dispatch () at worker.c:2473
>> #10 0x00000000007f11a7 in LogicalRepApplyLoop (last_received=74454600)
>> #at worker.c:2757
>> #11 start_apply () at worker.c:3526
>> #12 0x00000000007f175f in ApplyWorkerMain () at worker.c:3782
> 
> regards.
> 



Thaks Erik.

At Tue, 19 Apr 2022 07:00:30 +0200, Erik Rijkers <er@xs4all.nl> wrote in 
> Op 19-04-2022 om 02:15 schreef Kyotaro Horiguchi:
> > Could you read tabstat, *tabstat, trans, *trans here?
> 
> To be honest I'm not sure how to, but I gave it a try:
>
> (gdb) p tabstat
> $1 = <optimized out>

Great! It is that.  But unfortunately they are optimized out..  Could
you cause the crash with -O0 binary?  You will see the variable with
it.  You can rebuild with the option as follows.

$ make clean; make install CUSTOM_COPT="-O0 -g"

You can dump only the whole xact_state chain from the current core
file but the result will give a bit obscure hint for diagnosis.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Op 19-04-2022 om 11:25 schreef Kyotaro Horiguchi:
> Thaks Erik.
> 
> At Tue, 19 Apr 2022 07:00:30 +0200, Erik Rijkers <er@xs4all.nl> wrote in
>> Op 19-04-2022 om 02:15 schreef Kyotaro Horiguchi:
>>> Could you read tabstat, *tabstat, trans, *trans here?
>>
>> To be honest I'm not sure how to, but I gave it a try:
>>


I rebuilt newest master (a62bff74b135)  with

export CUSTOM_COPT="-O0 -g"

The 12th run of statbug.sh crashed and gave a corefile.


GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from 
/home/aardvark/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres...done.
[New LWP 25058]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: logical replication worker for 
subscription 16411                   '.
Program terminated with signal 6, Aborted.
#0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1  0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2  0x0000000000b3bada in ExceptionalCondition (conditionName=0xd389a1 
"tabstat->trans == trans", errorType=0xd388b2 "FailedAssertion", 
fileName=0xd388a0 "pgstat_relation.c", lineNumber=508) at assert.c:69
#3  0x00000000009bf5dc in AtEOXact_PgStat_Relations 
(xact_state=0x31b1b50, isCommit=true) at pgstat_relation.c:508
#4  0x00000000009c4107 in AtEOXact_PgStat (isCommit=true, 
parallel=false) at pgstat_xact.c:54
#5  0x0000000000583764 in CommitTransaction () at xact.c:2360
#6  0x0000000000584354 in CommitTransactionCommand () at xact.c:3048
#7  0x000000000090b34e in apply_handle_commit_internal 
(commit_data=0x7ffd024b5940) at worker.c:1532
#8  0x000000000090a287 in apply_handle_commit (s=0x7ffd024b59b0) at 
worker.c:845
#9  0x000000000090ce3a in apply_dispatch (s=0x7ffd024b59b0) at worker.c:2473
#10 0x000000000090d41c in LogicalRepApplyLoop (last_received=74680880) 
at worker.c:2757
#11 0x000000000090e974 in start_apply (origin_startpos=0) at worker.c:3526
#12 0x000000000090f156 in ApplyWorkerMain (main_arg=0) at worker.c:3782
#13 0x00000000008c7623 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000008d1557 in do_start_bgworker (rw=0x30ff0a0) at 
postmaster.c:5802
#15 0x00000000008d1903 in maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000008d09ba in sigusr1_handler (postgres_signal_arg=10) at 
postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000008cc6c1 in ServerLoop () at postmaster.c:1757
#20 0x00000000008cc0aa in PostmasterMain (argc=11, argv=0x30d6590) at 
postmaster.c:1465
#21 0x00000000007c9256 in main (argc=11, argv=0x30d6590) at main.c:202
(gdb) f 3
#3  0x00000000009bf5dc in AtEOXact_PgStat_Relations 
(xact_state=0x31b1b50, isCommit=true) at pgstat_relation.c:508
508                     Assert(tabstat->trans == trans);
(gdb) p tabstat
$1 = (PgStat_TableStatus *) 0x319e630
(gdb) p *tabstat
$2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f, 
t_counts = {t_numscans = 9187201950435737471, t_tuples_returned = 
9187201950435737471, t_tuples_fetched = 9187201950435737471,
     t_tuples_inserted = 9187201950435737471, t_tuples_updated = 
9187201950435737471, t_tuples_deleted = 9187201950435737471, 
t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
     t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples = 
9187201950435737471, t_changed_tuples = 9187201950435737471, 
t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
   relation = 0x7f7f7f7f7f7f7f7f}
(gdb) p trans
$3 = (PgStat_TableXactStatus *) 0x31b1ba8
(gdb) p *trans
$4 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0, 
truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop 
= 0, deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
   parent = 0x319e630, next = 0x31b1ab8}
(gdb)



Looks like we're one step further, no?


Erik


>> (gdb) p tabstat
>> $1 = <optimized out>
> 
> Great! It is that.  But unfortunately they are optimized out..  Could
> you cause the crash with -O0 binary?  You will see the variable with
> it.  You can rebuild with the option as follows.
> 
> $ make clean; make install CUSTOM_COPT="-O0 -g"
> 
> You can dump only the whole xact_state chain from the current core
> file but the result will give a bit obscure hint for diagnosis.
> 
> regards.
> 



On 2022-Apr-19, Erik Rijkers wrote:

> (gdb) p tabstat
> $1 = (PgStat_TableStatus *) 0x319e630
> (gdb) p *tabstat
> $2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f,
> t_counts = {t_numscans = 9187201950435737471, t_tuples_returned =
> 9187201950435737471, t_tuples_fetched = 9187201950435737471,
>     t_tuples_inserted = 9187201950435737471, t_tuples_updated =
> 9187201950435737471, t_tuples_deleted = 9187201950435737471,
> t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
>     t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples =
> 9187201950435737471, t_changed_tuples = 9187201950435737471,
> t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
>   relation = 0x7f7f7f7f7f7f7f7f}

It looks like this struct is freed or is in a memory context that was
reset.  Perhaps its lifetime wasn't carefully considered in the logical
replication code, which takes some shortcuts.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")



Hi,

On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
> The 12th run of statbug.sh crashed and gave a corefile.

I ran through quite a few iterations by now, without reproducing :(

I guess there's some timing issue and you're hitting on your system
due to the slower disks.


> Program terminated with signal 6, Aborted.
> #0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
> #1  0x000000357d633cd5 in abort () from /lib64/libc.so.6
> #2  0x0000000000b3bada in ExceptionalCondition (conditionName=0xd389a1
> "tabstat->trans == trans", errorType=0xd388b2 "FailedAssertion",
> fileName=0xd388a0 "pgstat_relation.c", lineNumber=508) at assert.c:69
> #3  0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
> isCommit=true) at pgstat_relation.c:508
> #4  0x00000000009c4107 in AtEOXact_PgStat (isCommit=true, parallel=false) at
> pgstat_xact.c:54
> #5  0x0000000000583764 in CommitTransaction () at xact.c:2360
> #6  0x0000000000584354 in CommitTransactionCommand () at xact.c:3048
> #7  0x000000000090b34e in apply_handle_commit_internal
> (commit_data=0x7ffd024b5940) at worker.c:1532
> #8  0x000000000090a287 in apply_handle_commit (s=0x7ffd024b59b0) at
> worker.c:845
> #9  0x000000000090ce3a in apply_dispatch (s=0x7ffd024b59b0) at worker.c:2473
> #10 0x000000000090d41c in LogicalRepApplyLoop (last_received=74680880) at
> worker.c:2757
> #11 0x000000000090e974 in start_apply (origin_startpos=0) at worker.c:3526
> #12 0x000000000090f156 in ApplyWorkerMain (main_arg=0) at worker.c:3782
> #13 0x00000000008c7623 in StartBackgroundWorker () at bgworker.c:858
> #14 0x00000000008d1557 in do_start_bgworker (rw=0x30ff0a0) at
> postmaster.c:5802
> #15 0x00000000008d1903 in maybe_start_bgworkers () at postmaster.c:6026
> #16 0x00000000008d09ba in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5191
> #17 <signal handler called>
> #18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
> #19 0x00000000008cc6c1 in ServerLoop () at postmaster.c:1757
> #20 0x00000000008cc0aa in PostmasterMain (argc=11, argv=0x30d6590) at
> postmaster.c:1465
> #21 0x00000000007c9256 in main (argc=11, argv=0x30d6590) at main.c:202
> (gdb) f 3
> #3  0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
> isCommit=true) at pgstat_relation.c:508
> 508                     Assert(tabstat->trans == trans);
> (gdb) p tabstat
> $1 = (PgStat_TableStatus *) 0x319e630
> (gdb) p *tabstat
> $2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f,
> t_counts = {t_numscans = 9187201950435737471, t_tuples_returned =
> 9187201950435737471, t_tuples_fetched = 9187201950435737471,
>     t_tuples_inserted = 9187201950435737471, t_tuples_updated =
> 9187201950435737471, t_tuples_deleted = 9187201950435737471,
> t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
>     t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples =
> 9187201950435737471, t_changed_tuples = 9187201950435737471,
> t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
>   relation = 0x7f7f7f7f7f7f7f7f}
> (gdb) p trans
> $3 = (PgStat_TableXactStatus *) 0x31b1ba8
> (gdb) p *trans
> $4 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0,
> truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop = 0,
> deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
>   parent = 0x319e630, next = 0x31b1ab8}
> (gdb)

Could you print out
p xact_state
p *xact_state
p xact_state->first
p *xact_state->first

Do you have the server log file for the failed run / instance?

Greetings,

Andres Freund



Hi,

On 2022-04-19 10:36:24 -0700, Andres Freund wrote:
> On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
> > The 12th run of statbug.sh crashed and gave a corefile.
> 
> I ran through quite a few iterations by now, without reproducing :(
> 
> I guess there's some timing issue and you're hitting on your system
> due to the slower disks.

Ah. I found the issue. The new pgstat_report_stat(true) call in
LogicalRepApplyLoop()'s "timeout" section doesn't check if we're in a
transaction. And the transactional stats code doesn't handle that (never
has).

I think all that's needed is a if (IsTransactionState()) around that
pgstat_report_stat().

It might be possible to put an assertion into pgstat_report_stat(), but
I need to look at the process exit code to see if it is.

Greetings,

Andres Freund



Op 19-04-2022 om 19:36 schreef Andres Freund:
> Hi,
> 
> On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
>> The 12th run of statbug.sh crashed and gave a corefile.
> 
> I ran through quite a few iterations by now, without reproducing :(
> 
> I guess there's some timing issue and you're hitting on your system
> due to the slower disks.
> 
> 
>> Program terminated with signal 6, Aborted.
>> #0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
>> (gdb) bt
>> #0  0x000000357d6324f5 in raise () from /lib64/libc.so.6
>> #1  0x000000357d633cd5 in abort () from /lib64/libc.so.6
>> #2  0x0000000000b3bada in ExceptionalCondition (conditionName=0xd389a1
>> "tabstat->trans == trans", errorType=0xd388b2 "FailedAssertion",
>> fileName=0xd388a0 "pgstat_relation.c", lineNumber=508) at assert.c:69
>> #3  0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
>> isCommit=true) at pgstat_relation.c:508
>> #4  0x00000000009c4107 in AtEOXact_PgStat (isCommit=true, parallel=false) at
>> pgstat_xact.c:54
>> #5  0x0000000000583764 in CommitTransaction () at xact.c:2360
>> #6  0x0000000000584354 in CommitTransactionCommand () at xact.c:3048
>> #7  0x000000000090b34e in apply_handle_commit_internal
>> (commit_data=0x7ffd024b5940) at worker.c:1532
>> #8  0x000000000090a287 in apply_handle_commit (s=0x7ffd024b59b0) at
>> worker.c:845
>> #9  0x000000000090ce3a in apply_dispatch (s=0x7ffd024b59b0) at worker.c:2473
>> #10 0x000000000090d41c in LogicalRepApplyLoop (last_received=74680880) at
>> worker.c:2757
>> #11 0x000000000090e974 in start_apply (origin_startpos=0) at worker.c:3526
>> #12 0x000000000090f156 in ApplyWorkerMain (main_arg=0) at worker.c:3782
>> #13 0x00000000008c7623 in StartBackgroundWorker () at bgworker.c:858
>> #14 0x00000000008d1557 in do_start_bgworker (rw=0x30ff0a0) at
>> postmaster.c:5802
>> #15 0x00000000008d1903 in maybe_start_bgworkers () at postmaster.c:6026
>> #16 0x00000000008d09ba in sigusr1_handler (postgres_signal_arg=10) at
>> postmaster.c:5191
>> #17 <signal handler called>
>> #18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
>> #19 0x00000000008cc6c1 in ServerLoop () at postmaster.c:1757
>> #20 0x00000000008cc0aa in PostmasterMain (argc=11, argv=0x30d6590) at
>> postmaster.c:1465
>> #21 0x00000000007c9256 in main (argc=11, argv=0x30d6590) at main.c:202
>> (gdb) f 3
>> #3  0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
>> isCommit=true) at pgstat_relation.c:508
>> 508                     Assert(tabstat->trans == trans);
>> (gdb) p tabstat
>> $1 = (PgStat_TableStatus *) 0x319e630
>> (gdb) p *tabstat
>> $2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f,
>> t_counts = {t_numscans = 9187201950435737471, t_tuples_returned =
>> 9187201950435737471, t_tuples_fetched = 9187201950435737471,
>>      t_tuples_inserted = 9187201950435737471, t_tuples_updated =
>> 9187201950435737471, t_tuples_deleted = 9187201950435737471,
>> t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
>>      t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples =
>> 9187201950435737471, t_changed_tuples = 9187201950435737471,
>> t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
>>    relation = 0x7f7f7f7f7f7f7f7f}
>> (gdb) p trans
>> $3 = (PgStat_TableXactStatus *) 0x31b1ba8
>> (gdb) p *trans
>> $4 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0,
>> truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop = 0,
>> deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
>>    parent = 0x319e630, next = 0x31b1ab8}
>> (gdb)
> 
> Could you print out
> p xact_state
> p *xact_state
> p xact_state->first
> p *xact_state->first
> 
> Do you have the server log file for the failed run / instance?


(gdb) p xact_state
$5 = (PgStat_SubXactStatus *) 0x31b1b50

(gdb) p *xact_state
$6 = {nest_level = 1, prev = 0x0, pending_drops = {head = {prev = 
0x31b1b60, next = 0x31b1b60}}, pending_drops_count = 0, first = 0x31b1ba8}

(gdb) p xact_state->first
$7 = (PgStat_TableXactStatus *) 0x31b1ba8

(gdb) p *xact_state->first
$8 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0, 
truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop 
= 0, deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
   parent = 0x319e630, next = 0x31b1ab8}
(gdb)


The logfile is attached.


Erik


> Greetings,
> 
> Andres Freund
Attachment
At Tue, 19 Apr 2022 10:55:26 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2022-04-19 10:36:24 -0700, Andres Freund wrote:
> > On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
> > > The 12th run of statbug.sh crashed and gave a corefile.
> > 
> > I ran through quite a few iterations by now, without reproducing :(
> > 
> > I guess there's some timing issue and you're hitting on your system
> > due to the slower disks.
> 
> Ah. I found the issue. The new pgstat_report_stat(true) call in
> LogicalRepApplyLoop()'s "timeout" section doesn't check if we're in a
> transaction. And the transactional stats code doesn't handle that (never
> has).
> 
> I think all that's needed is a if (IsTransactionState()) around that
> pgstat_report_stat().

if (!IsTransactinoState()) ?

> It might be possible to put an assertion into pgstat_report_stat(), but
> I need to look at the process exit code to see if it is.

Inserting a sleep in pgoutput_commit_txn reproduced this. Crashes with
the same stack trace with the similar variable state.

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index b197bfd565..def4d751d3 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -568,6 +568,7 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
         return;
     }
 
+    sleep(2);
     OutputPluginPrepareWrite(ctx, true);
     logicalrep_write_commit(ctx->out, txn, commit_lsn);
     OutputPluginWrite(ctx, true);

The following  actuall works for this.

diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4171371296..f4e5359513 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2882,10 +2882,11 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             send_feedback(last_received, requestReply, requestReply);
 
             /*
-             * Force reporting to ensure long idle periods don't lead to
-             * arbitrarily delayed stats.
+             * Force reporting to ensure long out-of-transaction idle periods
+             * don't lead to arbitrarily delayed stats.
              */
-            pgstat_report_stat(true);
+            if (!IsTransactionState())
+                pgstat_report_stat(true);
         }
     }
 
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Op 20-04-2022 om 06:54 schreef Kyotaro Horiguchi:
> At Tue, 19 Apr 2022 10:55:26 -0700, Andres Freund <andres@anarazel.de> wrote in
>> Hi,
>>
>> On 2022-04-19 10:36:24 -0700, Andres Freund wrote:
>>> On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
>>>> The 12th run of statbug.sh crashed and gave a corefile.
>>>
>>> I ran through quite a few iterations by now, without reproducing :(
>>>
>>> I guess there's some timing issue and you're hitting on your system
>>> due to the slower disks.
>>
>> Ah. I found the issue. The new pgstat_report_stat(true) call in
>> LogicalRepApplyLoop()'s "timeout" section doesn't check if we're in a
>> transaction. And the transactional stats code doesn't handle that (never
>> has).
>>
>> I think all that's needed is a if (IsTransactionState()) around that
>> pgstat_report_stat().
> 
> if (!IsTransactinoState()) ?
> 
>> It might be possible to put an assertion into pgstat_report_stat(), but
>> I need to look at the process exit code to see if it is.
> 
> Inserting a sleep in pgoutput_commit_txn reproduced this. Crashes with
> the same stack trace with the similar variable state.
> 
> diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
> index b197bfd565..def4d751d3 100644
> --- a/src/backend/replication/pgoutput/pgoutput.c
> +++ b/src/backend/replication/pgoutput/pgoutput.c
> @@ -568,6 +568,7 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
>           return;
>       }
>   
> +    sleep(2);
>       OutputPluginPrepareWrite(ctx, true);
>       logicalrep_write_commit(ctx->out, txn, commit_lsn);
>       OutputPluginWrite(ctx, true);
> 
> The following  actuall works for this.
> 
> diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
> index 4171371296..f4e5359513 100644
> --- a/src/backend/replication/logical/worker.c
> +++ b/src/backend/replication/logical/worker.c
> @@ -2882,10 +2882,11 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
>               send_feedback(last_received, requestReply, requestReply);
>   
>               /*
> -             * Force reporting to ensure long idle periods don't lead to
> -             * arbitrarily delayed stats.
> +             * Force reporting to ensure long out-of-transaction idle periods
> +             * don't lead to arbitrarily delayed stats.
>                */
> -            pgstat_report_stat(true);
> +            if (!IsTransactionState())
> +                pgstat_report_stat(true);
>           }
>       }
>   

Yes, that seems to fix it: I applied that latter patch, and ran my 
program 250x without errors. Then I removed it again an it gave the 
error within 15x.

thanks!

Erik


> regards.
> 



On Wed, Apr 20, 2022 at 01:03:20PM +0200, Erik Rijkers wrote:
> Yes, that seems to fix it: I applied that latter patch, and ran my program
> 250x without errors. Then I removed it again an it gave the error within
> 15x.

That looks simple enough, indeed.  Andres, are you planning to address
this issue?
--
Michael

Attachment
On Mon, Apr 25, 2022 at 03:18:52PM +0900, Michael Paquier wrote:
> On Wed, Apr 20, 2022 at 01:03:20PM +0200, Erik Rijkers wrote:
>> Yes, that seems to fix it: I applied that latter patch, and ran my program
>> 250x without errors. Then I removed it again an it gave the error within
>> 15x.
>
> That looks simple enough, indeed.  Andres, are you planning to address
> this issue?

Ping.  It looks annoying to release beta1 with that, as assertions are
likely going to be enabled in a lot of test builds.
--
Michael

Attachment
Hi,

On 2022-05-11 15:48:40 +0900, Michael Paquier wrote:
> On Mon, Apr 25, 2022 at 03:18:52PM +0900, Michael Paquier wrote:
> > On Wed, Apr 20, 2022 at 01:03:20PM +0200, Erik Rijkers wrote:
> >> Yes, that seems to fix it: I applied that latter patch, and ran my program
> >> 250x without errors. Then I removed it again an it gave the error within
> >> 15x.
> > 
> > That looks simple enough, indeed.  Andres, are you planning to address
> > this issue?
> 
> Ping.  It looks annoying to release beta1 with that, as assertions are
> likely going to be enabled in a lot of test builds.

I'll try to fix it tomorrow... Sorry for the delay.

Greetings,

Andres Freund



On Wed, May 11, 2022 at 08:32:14PM -0700, Andres Freund wrote:
> I'll try to fix it tomorrow... Sorry for the delay.

Thanks, Andres.
--
Michael

Attachment
Hi,

I finally pushed the fix for this. Erik, thanks for the report! And thanks
Michael for the ping...

On 2022-05-11 20:32:14 -0700, Andres Freund wrote:
> On 2022-05-11 15:48:40 +0900, Michael Paquier wrote:

> > Ping.  It looks annoying to release beta1 with that, as assertions are
> > likely going to be enabled in a lot of test builds.

FWIW, it's somewhat hard to hit (basically the sender needs to stall while
sending out a transaction / network being really slow), so it'd not have been
likely to be hit by all that many people.

Greetings,

Andres Freund