Thread: BUG #18893: Segfault during analyze pg_database

BUG #18893: Segfault during analyze pg_database

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18893
Logged by:          Robins Tharakan
Email address:      tharakan@gmail.com
PostgreSQL version: Unsupported/Unknown
Operating system:   Ubuntu
Description:

Creating a few Databases followed by CHECKPOINT causes a segfault.

Tested on a recent - 847bbb21f8c4eb0e2b47417684ad2ba9255c9e80.

Backtrace below but to add, every time I stepped on this, postgres was
always analyzing pg_database.


Repro (a few runs may be required)
=====
-- seq 1 100 | xargs -i psql -Atq -c "DROP   DATABASE t{};" postgres
seq 1 100 | xargs -i psql -Atq -c "CREATE DATABASE t{};" postgres
psql -Atq -c "CHECKPOINT" postgres



Error Log (for multiple crashes)
=========
$ tail -10000 logfile | grep "Failed process was running"
2025-04-12 07:23:55.096 ACST [2833183] DETAIL:  Failed process was running:
autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2025-04-12 07:24:55.634 ACST [2833183] DETAIL:  Failed process was running:
autovacuum: ANALYZE pg_catalog.pg_database
2025-04-12 07:31:02.634 ACST [2833183] DETAIL:  Failed process was running:
autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2025-04-12 11:59:31.411 ACST [2845956] DETAIL:  Failed process was running:
autovacuum: ANALYZE pg_catalog.pg_database
2025-04-12 12:13:09.974 ACST [2846810] DETAIL:  Failed process was running:
autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2025-04-12 12:38:07.432 ACST [2846810] DETAIL:  Failed process was running:
autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2025-04-12 12:41:42.729 ACST [2846810] DETAIL:  Failed process was running:
autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2025-04-12 12:43:13.276 ACST [2846810] DETAIL:  Failed process was running:
autovacuum: VACUUM ANALYZE pg_catalog.pg_database


Error Log (for 1 crash)
=========
2025-04-12 12:43:03.279 ACST [2849996] LOG:  checkpoint starting: immediate
force wait
2025-04-12 12:43:13.276 ACST [2846810] LOG:  autovacuum worker (PID 2851288)
was terminated by signal 11: Segmentation fault
2025-04-12 12:43:13.276 ACST [2846810] DETAIL:  Failed process was running:
autovacuum: VACUUM ANALYZE pg_catalog.pg_database
2025-04-12 12:43:13.276 ACST [2846810] LOG:  terminating any other active
server processes
2025-04-12 12:43:13.280 ACST [2846810] LOG:  all server processes
terminated; reinitializing
2025-04-12 12:43:13.346 ACST [2851293] LOG:  database system was
interrupted; last known up at 2025-04-12 12:42:59 ACST
2025-04-12 12:43:23.175 ACST [2851293] LOG:  database system was not
properly shut down; automatic recovery in progress
2025-04-12 12:43:23.196 ACST [2851293] LOG:  redo starts at 0/BB5A2BE0
2025-04-12 12:43:23.197 ACST [2851293] WARNING:  could not open directory
"base/49251": No such file or directory
2025-04-12 12:43:23.197 ACST [2851293] CONTEXT:  WAL redo at 0/BB5A2CB0 for
Database/DROP: dir 1663/49251
2025-04-12 12:43:23.197 ACST [2851293] WARNING:  some useless files may be
left behind in old database directory "base/49251"
2025-04-12 12:43:23.197 ACST [2851293] CONTEXT:  WAL redo at 0/BB5A2CB0 for
Database/DROP: dir 1663/49251
2025-04-12 12:43:24.620 ACST [2851293] LOG:  unexpected pageaddr 0/A6D3A000
in WAL segment 0000000100000000000000D5, LSN 0/D5D3A000, offset 13869056
2025-04-12 12:43:24.620 ACST [2851293] LOG:  redo done at 0/D5D39198 system
usage: CPU: user: 0.88 s, system: 0.07 s, elapsed: 1.42 s
2025-04-12 12:43:24.633 ACST [2851294] LOG:  checkpoint starting:
end-of-recovery immediate wait
2025-04-12 12:43:44.451 ACST [2851294] LOG:  checkpoint complete: wrote
16284 buffers (99.4%), wrote 3 SLRU buffers; 0 WAL file(s) added, 0 removed,
26 recycled; write=0.173 s, sync=19.592 s, total=19.820 s; sync files=29806,
longest=0.019 s, average=0.001 s; distance=433757 kB, estimate=433757 kB;
lsn=0/D5D3A048, redo lsn=0/D5D3A048
2025-04-12 12:43:44.467 ACST [2846810] LOG:  database system is ready to
accept connections


SQL Output
==========
postgres=# checkpoint;
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 3485.895 ms (00:03.486)
!?> 


Backtrace
=========
(gdb) bt
#0  PopActiveSnapshot () at snapmgr.c:766
#1  0x0000559978e4aff5 in vacuum (relations=0x55999bb4f510,
params=0x55999bb48120, bstrategy=0x55999bb42880, vac_context=0x55999bb4f3c0,
isTopLevel=true) at vacuum.c:611
#2  0x000055997905242c in autovacuum_do_vac_analyze (tab=0x55999bb48118,
bstrategy=0x55999bb42880) at autovacuum.c:3160
#3  0x0000559979051164 in do_autovacuum () at autovacuum.c:2439
#4  0x000055997904fd05 in AutoVacWorkerMain (startup_data=0x0,
startup_data_len=0) at autovacuum.c:1594
#5  0x0000559979056ab7 in postmaster_child_launch
(child_type=B_AUTOVAC_WORKER, child_slot=2022, startup_data=0x0,
startup_data_len=0, client_sock=0x0) at launch_backend.c:290
#6  0x000055997905da7e in StartChildProcess (type=B_AUTOVAC_WORKER) at
postmaster.c:3973
#7  0x000055997905dc0d in StartAutovacuumWorker () at postmaster.c:4037
#8  0x000055997905d6ce in process_pm_pmsignal () at postmaster.c:3794
#9  0x000055997905a803 in ServerLoop () at postmaster.c:1695
#10 0x000055997905a1d2 in PostmasterMain (argc=3, argv=0x55999ba24f80) at
postmaster.c:1400
#11 0x0000559978f021f3 in main (argc=3, argv=0x55999ba24f80) at main.c:227


Backtrace Full
==============
#0  PopActiveSnapshot () at snapmgr.c:766
        newstack = 0x55999bb4f3c0
#1  0x0000559978e4aff5 in vacuum (relations=0x55999bb4f510,
params=0x55999bb48120, bstrategy=0x55999bb42880, vac_context=0x55999bb4f3c0,
isTopLevel=true) at vacuum.c:611
        in_vacuum = false
        stmttype = 0x55997951e3d0 "VACUUM"
        in_outer_xact = false
        use_own_xacts = true
        __func__ = "vacuum"
#2  0x000055997905242c in autovacuum_do_vac_analyze (tab=0x55999bb48118,
bstrategy=0x55999bb42880) at autovacuum.c:3160
        rangevar = 0x55999bb4d4b0
        rel = 0x55999bb4d500
        rel_list = 0x55999bb4d530
        vac_context = 0x55999bb4f3c0
#3  0x0000559979051164 in do_autovacuum () at autovacuum.c:2439
        _save_exception_stack = 0x7ffef3180850
        _save_context_stack = 0x0
        _local_sigjmp_buf = {{__jmpbuf = {140732976861944,
2786496174943352778, 0, 140732976861976, 94117656164696, 139965642006560,
2786496174997878730, 8242866857011034058},
            __mask_was_saved = 0, __saved_mask = {__val = {5460319232,
94118230406032, 6656, 94117652561175, 94118230399168, 16, 94117649680895,
26, 6240, 94118230406064,
                94117652046186, 6656, 94118230399408, 140732976858672,
94117652048505, 0}}}}
        _do_rethrow = false
        tab = 0x55999bb48118
        skipit = false
        iter = {cur = 0x7f4c4571d828, end = 0x7f4c4571d828}
        relid = 1262
        classTup = 0x7f4c47393e18
        isshared = true
        cell__state = {l = 0x55999bb47b38, i = 0}
        classRel = 0x7f4c494eaa88
        tuple = 0x0
        relScan = 0x55999bb42470
        dbForm = 0x7f4c47392d80
        table_oids = 0x55999bb47b38
        orphan_oids = 0x0
        ctl = {num_partitions = 0, ssize = 0, dsize = 140732976858832,
max_dsize = 94117644661494, keysize = 4, entrysize = 104, hash =
0x5599797a6b00 <TopTransactionStateData>,
          match = 0x79361810, keycopy = 0x3f, alloc = 0x7ffef31812f8, hcxt =
0x7ffef3180710, hctl = 0x559978c6ff8e
<CommitTransactionCommandInternal+177>}
        table_toast_map = 0x55999bb43470
        cell = 0x55999bb47b50
        bstrategy = 0x55999bb42880
        key = {sk_flags = 0, sk_attno = 18, sk_strategy = 3, sk_subtype = 0,
sk_collation = 950, sk_func = {fn_addr = 0x5599791aecf4 <chareq>, fn_oid =
61, fn_nargs = 2,
            fn_strict = true, fn_retset = false, fn_stats = 2 '\002',
fn_extra = 0x0, fn_mcxt = 0x55999bb41360, fn_expr = 0x0}, sk_argument =
116}
        pg_class_desc = 0x55999bb41460
        effective_multixact_freeze_max_age = 400000000
        did_vacuum = false
        found_concurrent_worker = false
        i = 21913
        __func__ = "do_autovacuum"
#4  0x000055997904fd05 in AutoVacWorkerMain (startup_data=0x0,
startup_data_len=0) at autovacuum.c:1594
        dbname =

"template1\000\000\000\000\000\000\000p\030\000\000\000\000\000\000\0002os\276C\025C\200\000\000\000\000\000\000\000m\271<y\231U\000\000O\267<y\231U\000\000\0002os\036\000\000"
        local_sigjmp_buf = {{__jmpbuf = {140732976861944,
2786496174865758154, 0, 140732976861976, 94117656164696, 139965642006560,
2786496174828009418, 8242866840620218314},
            __mask_was_saved = 1, __saved_mask = {__val =
{18446744066192964099, 11214622847848677400, 139965631788948,
140732976859328, 4833844260311609856, 16, 140732976859424,
                140732976859360, 4833844260311609856, 0, 139965642011360, 1,
94117649519579, 140732976861944, 94118229463072, 140732976859424}}}}
        dbid = 1
        __func__ = "AutoVacWorkerMain"
#5  0x0000559979056ab7 in postmaster_child_launch
(child_type=B_AUTOVAC_WORKER, child_slot=2022, startup_data=0x0,
startup_data_len=0, client_sock=0x0) at launch_backend.c:290
        pid = 0
#6  0x000055997905da7e in StartChildProcess (type=B_AUTOVAC_WORKER) at
postmaster.c:3973
        pmchild = 0x55999bab4528
        pid = 32766
        __func__ = "StartChildProcess"
#7  0x000055997905dc0d in StartAutovacuumWorker () at postmaster.c:4037
        bn = 0x5000097e7
#8  0x000055997905d6ce in process_pm_pmsignal () at postmaster.c:3794
        request_state_update = false
        __func__ = "process_pm_pmsignal"


-
robins
https://robins.in


Re: BUG #18893: Segfault during analyze pg_database

From
Robins Tharakan
Date:

On Sat, 12 Apr 2025 at 13:05, PG Bug reporting form <noreply@postgresql.org> wrote:
>
>
> Creating a few Databases followed by CHECKPOINT causes a segfault.
>

Simply running pgbench (or any workload that triggers AV) causes a segfault.

(gdb) bt
#0  PopActiveSnapshot () at snapmgr.c:766
#1  0x000055a93d8bb02c in vacuum (relations=0x55a9715e7530, params=0x55a9715d4b10, bstrategy=0x55a9715cb050, vac_context=0x55a9715e73e0, isTopLevel=true) at vacuum.c:611
#2  0x000055a93dac2463 in autovacuum_do_vac_analyze (tab=0x55a9715d4b08, bstrategy=0x55a9715cb050) at autovacuum.c:3160
#3  0x000055a93dac119b in do_autovacuum () at autovacuum.c:2439
#4  0x000055a93dabfd3c in AutoVacWorkerMain (startup_data=0x0, startup_data_len=0) at autovacuum.c:1594
#5  0x000055a93dac6aee in postmaster_child_launch (child_type=B_AUTOVAC_WORKER, child_slot=2022, startup_data=0x0, startup_data_len=0, client_sock=0x0) at launch_backend.c:290
#6  0x000055a93dacdab5 in StartChildProcess (type=B_AUTOVAC_WORKER) at postmaster.c:3973
#7  0x000055a93dacdc44 in StartAutovacuumWorker () at postmaster.c:4037
#8  0x000055a93dacd705 in process_pm_pmsignal () at postmaster.c:3794
#9  0x000055a93daca83a in ServerLoop () at postmaster.c:1695
#10 0x000055a93daca209 in PostmasterMain (argc=3, argv=0x55a9714acf90) at postmaster.c:1400
#11 0x000055a93d97222a in main (argc=3, argv=0x55a9714acf90) at main.c:227



(gdb) bt
#0  PopActiveSnapshot () at snapmgr.c:766
#1  0x000055a93d8bb02c in vacuum (relations=0x55a9715dd790, params=0x55a9715cb608, bstrategy=0x55a9715cabe0, vac_context=0x55a9715dd640, isTopLevel=true) at vacuum.c:611
#2  0x000055a93dac2463 in autovacuum_do_vac_analyze (tab=0x55a9715cb600, bstrategy=0x55a9715cabe0) at autovacuum.c:3160
#3  0x000055a93dac119b in do_autovacuum () at autovacuum.c:2439
#4  0x000055a93dabfd3c in AutoVacWorkerMain (startup_data=0x0, startup_data_len=0) at autovacuum.c:1594
#5  0x000055a93dac6aee in postmaster_child_launch (child_type=B_AUTOVAC_WORKER, child_slot=2022, startup_data=0x0, startup_data_len=0, client_sock=0x0) at launch_backend.c:290
#6  0x000055a93dacdab5 in StartChildProcess (type=B_AUTOVAC_WORKER) at postmaster.c:3973
#7  0x000055a93dacdc44 in StartAutovacuumWorker () at postmaster.c:4037
#8  0x000055a93dacd705 in process_pm_pmsignal () at postmaster.c:3794
#9  0x000055a93daca83a in ServerLoop () at postmaster.c:1695
#10 0x000055a93daca209 in PostmasterMain (argc=3, argv=0x55a9714acf90) at postmaster.c:1400
#11 0x000055a93d97222a in main (argc=3, argv=0x55a9714acf90) at main.c:227


-
robins
https://robins.in

Re: BUG #18893: Segfault during analyze pg_database

From
Tom Lane
Date:
Robins Tharakan <tharakan@gmail.com> writes:
> On Sat, 12 Apr 2025 at 13:05, PG Bug reporting form <noreply@postgresql.org>
> wrote:
>> Creating a few Databases followed by CHECKPOINT causes a segfault.

> Simply running pgbench (or any workload that triggers AV) causes a segfault.

At this point I'm starting to suspect a compiler bug (or hardware fault?)
on your machine.  I spent awhile last night trying to replicate your
previous report and failed, both on x86_64/RHEL8 and Apple M4/Sequoia.
Moreover, we're not seeing the sort of instability in the buildfarm
that would inevitably appear if AV were as broken as it seems to be
for you.

            regards, tom lane



Re: BUG #18893: Segfault during analyze pg_database

From
Robins Tharakan
Date:


On Sun, 13 Apr 2025 at 23:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robins Tharakan <tharakan@gmail.com> writes:
> On Sat, 12 Apr 2025 at 13:05, PG Bug reporting form <noreply@postgresql.org>
> wrote:
>> Creating a few Databases followed by CHECKPOINT causes a segfault.
 
At this point I'm starting to suspect a compiler bug (or hardware fault?)
on your machine.  I spent awhile last night trying to replicate your
previous report and failed, both on x86_64/RHEL8 and Apple M4/Sequoia.
Moreover, we're not seeing the sort of instability in the buildfarm
that would inevitably appear if AV were as broken as it seems to be
for you.

Thanks for confirming (non-reproducibility).

I didn't see any disk errors / memtest returned nothing / machine uses stock gcc (v12.2.0-14 - Ubuntu bookworm).

I could reproduce this at will until this morning but not any more. A fresh git clone and compile in another folder on the same machine proved that the machine is okay, so my current guess is that I fell victim to not doing a 'git clean -xdf' when I was double-checking (before reporting). In most of my reports I try to triage the commit (which ensures this doesn't happen) - but I had to bypass that step this time since this was not a simple repro.

Apologies for the noise, and once again thanks for taking a look.
-
robins

Re: BUG #18893: Segfault during analyze pg_database

From
Michael Paquier
Date:
On Mon, Apr 14, 2025 at 06:48:32PM +0930, Robins Tharakan wrote:
> I didn't see any disk errors / memtest returned nothing / machine uses
> stock gcc (v12.2.0-14 - Ubuntu bookworm).
>
> I could reproduce this at will until this morning but not any more. A fresh
> git clone and compile in another folder on the same machine proved that the
> machine is okay, so my current guess is that I fell victim to not doing a
> 'git clean -xdf' when I was double-checking (before reporting). In most of
> my reports I try to triage the commit (which ensures this doesn't happen) -
> but I had to bypass that step this time since this was not a simple repro.
>
> Apologies for the noise, and once again thanks for taking a look.

FWIW, I've been puzzled by your report here, and also run similar
workloads on HEAD with an aggressive autovacuum setup to stress more
the code paths you have mentioned in your backtraces.  That could be a
lack of luck because of a lack of friction with several concurrent
actions required, of course, but I've not been able to see your
problem.  :/
--
Michael

Attachment