Thread: BF mamba failure

BF mamba failure

From
Amit Kapila
Date:
Hi,

Peter Smith has recently reported a BF failure [1]. AFAICS, the call
stack of failure [2] is as follows:

0x1e66644 <ExceptionalCondition+0x8c> at postgres
0x1d0143c <pgstat_release_entry_ref+0x4c0> at postgres
0x1d02534 <pgstat_get_entry_ref+0x780> at postgres
0x1cfb120 <pgstat_prep_pending_entry+0x8c> at postgres
0x1cfd590 <pgstat_report_disconnect+0x34> at postgres
0x1cfbfc0 <pgstat_shutdown_hook+0xd4> at postgres
0x1ca7b08 <shmem_exit+0x7c> at postgres
0x1ca7c74 <proc_exit_prepare+0x70> at postgres
0x1ca7d2c <proc_exit+0x18> at postgres
0x1cdf060 <PostgresMain+0x584> at postgres
0x1c203f4 <ServerLoop.isra.0+0x1e88> at postgres
0x1c2161c <PostmasterMain+0xfa4> at postgres
0x1edcf94 <main+0x254> at postgres

I couldn't correlate it to the recent commits. Any thoughts?

[1] - https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com
[2] -
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-03-17%2005%3A36%3A10

-- 
With Regards,
Amit Kapila.



Re: BF mamba failure

From
Tom Lane
Date:
Amit Kapila <amit.kapila16@gmail.com> writes:
> Peter Smith has recently reported a BF failure [1]. AFAICS, the call
> stack of failure [2] is as follows:

Note the assertion report a few lines further up:

TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560,
PID:25004 

            regards, tom lane



Re: BF mamba failure

From
Alexander Lakhin
Date:
Hi,

18.03.2023 07:26, Tom Lane wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
Peter Smith has recently reported a BF failure [1]. AFAICS, the call
stack of failure [2] is as follows:
Note the assertion report a few lines further up:

TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 560, PID: 25004

This assertion failure can be reproduced easily with the attached patch:
============== running regression test queries        ==============
test oldest_xmin                  ... ok           55 ms
test oldest_xmin                  ... FAILED (test process exited with exit code 1)      107 ms
test oldest_xmin                  ... FAILED (test process exited with exit code 1)        8 ms
============== shutting down postmaster               ==============

contrib/test_decoding/output_iso/log/postmaster.log contains:
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line: 561, PID: 456844

With the sleep placed above Assert(entry_ref->shared_entry->dropped) this Assert fails too.

Best regards,
Alexander
Attachment

Re: BF mamba failure

From
Peter Smith
Date:
On Sun, Mar 19, 2023 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hi,
>
> 18.03.2023 07:26, Tom Lane wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
>
> Peter Smith has recently reported a BF failure [1]. AFAICS, the call
> stack of failure [2] is as follows:
>
> Note the assertion report a few lines further up:
>
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line:
560,PID: 25004 
>
>
> This assertion failure can be reproduced easily with the attached patch:
> ============== running regression test queries        ==============
> test oldest_xmin                  ... ok           55 ms
> test oldest_xmin                  ... FAILED (test process exited with exit code 1)      107 ms
> test oldest_xmin                  ... FAILED (test process exited with exit code 1)        8 ms
> ============== shutting down postmaster               ==============
>
> contrib/test_decoding/output_iso/log/postmaster.log contains:
> TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount) == 0"), File: "pgstat_shmem.c", Line:
561,PID: 456844 
>
> With the sleep placed above Assert(entry_ref->shared_entry->dropped) this Assert fails too.
>
> Best regards,
> Alexander

I used a slightly modified* patch of Alexander's [1] applied to the
latest HEAD code (but with my "toptxn" patch reverted).
--- the patch was modified in that I injected 'sleep' both above and
below the Assert(entry_ref->shared_entry->dropped).

Using this I was also able to reproduce the problem. But test failures
were rare. The make check-world seemed OK, and indeed the
test_decoding tests would also appear to PASS around 14 out of 15
times.

============== running regression test queries        ==============
test oldest_xmin                  ... ok          342 ms
test oldest_xmin                  ... ok          121 ms
test oldest_xmin                  ... ok          283 ms
============== shutting down postmaster               ==============
============== removing temporary instance            ==============

=====================
 All 3 tests passed.
=====================

~~

Often (but not always) depite the test_decoding reported PASS all 3
tests as "ok", I still observed there was a TRAP in the logfile
(contrib/test_decoding/output_iso/log/postmaster.log).
TRAP: failed Assert("entry_ref->shared_entry->dropped")

~~

Occasionally (about 1 in 15 test runs) the test would fail the same
way as described by Alexander [1], with the accompanying TRAP.
TRAP: failed Assert("pg_atomic_read_u32(&entry_ref->shared_entry->refcount)
== 0"), File: "pgstat_shmem.c", Line: 562, PID: 32013

============== running regression test queries        ==============
test oldest_xmin                  ... ok          331 ms
test oldest_xmin                  ... ok           91 ms
test oldest_xmin                  ... FAILED      702 ms
============== shutting down postmaster               ==============

======================
 1 of 3 tests failed.
======================


~~


FWIW, the "toptxn" patch. whose push coincided with the build-farm
error I first reported [2], turns out to be an innocent party in this
TRAP. We know this because all of the above results were running using
HEAD code but with that "toptxn" patch reverted.

------
[1] https://www.postgresql.org/message-id/1941b7e2-be7c-9c4c-8505-c0fd05910e9a%40gmail.com
[2] https://www.postgresql.org/message-id/CAHut%2BPsHdWFjU43VEX%2BR-8de6dFQ-_JWrsqs%3DvWek1hULexP4Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia