Thread: pg15b3: crash in paralell vacuum

pg15b3: crash in paralell vacuum

From
Justin Pryzby
Date:
Immediately after upgrading an internal instance, a loop around "vacuum" did
this:

TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", File: "vacuumparallel.c", Line: 611, PID:
27635)
postgres: postgres pryzbyj [local] VACUUM(ExceptionalCondition+0x8d)[0x99d9fd]
postgres: postgres pryzbyj [local] VACUUM[0x6915db]
postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6]
postgres: postgres pryzbyj [local] VACUUM[0x68e97a]
postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e]
postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e]
postgres: postgres pryzbyj [local] VACUUM(standard_ProcessUtility+0x530)[0x8567b0]
/usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450]
postgres: postgres pryzbyj [local] VACUUM[0x85490a]
postgres: postgres pryzbyj [local] VACUUM[0x854a53]
postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029]
postgres: postgres pryzbyj [local] VACUUM[0x85099b]
postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a]
postgres: postgres pryzbyj [local] VACUUM[0x496a21]
postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980]
postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5]
postgres: postgres pryzbyj [local] VACUUM[0x498c59]
< 2022-08-18 07:56:51.963 CDT  >LOG:  server process (PID 27635) was terminated by signal 6: Aborted
< 2022-08-18 07:56:51.963 CDT  >DETAIL:  Failed process was running: VACUUM ANALYZE alarms

Unfortunately, it looks like the RPM packages are compiled with -O2, so this is
of limited use.

Core was generated by `postgres: postgres pryzbyj [local] VACUUM              '.
Program terminated with signal 6, Aborted.
#0  0x00007f52c4b9d207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64
cyrus-sasl-lib-2.1.26-23.el7.x86_64elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64
glibc-2.17-260.el7_6.3.x86_64keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64
libattr-2.4.46-13.el7.x86_64libcap-2.22-9.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64
libgcc-4.8.5-39.el7.x86_64libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64
libselinux-2.5-15.el7.x86_64libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64
lz4-1.7.5-2.el7.x86_64nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.1.el7_6.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64
nss-util-3.36.0-1.1.el7_6.x86_64openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-22.el7_9.x86_64
pam-1.1.8-22.el7.x86_64pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64
zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007f52c4b9d207 in raise () from /lib64/libc.so.6
#1  0x00007f52c4b9e8f8 in abort () from /lib64/libc.so.6
#2  0x000000000099da1e in ExceptionalCondition (conditionName=conditionName@entry=0xafae40 "indstats->status ==
PARALLEL_INDVAC_STATUS_INITIAL",errorType=errorType@entry=0x9fb4b7 "FailedAssertion",  
    fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", lineNumber=lineNumber@entry=611) at assert.c:69
#3  0x00000000006915db in parallel_vacuum_process_all_indexes (pvs=0x2e85f80, num_index_scans=<optimized out>,
vacuum=<optimizedout>) at vacuumparallel.c:611 
#4  0x00000000005083e6 in heap_vacuum_rel (rel=<optimized out>, params=<optimized out>, bstrategy=<optimized out>) at
vacuumlazy.c:2679
#5  0x000000000068e97a in table_relation_vacuum (bstrategy=<optimized out>, params=0x7fff46de9a80, rel=0x7f52c7bc2c10)
at../../../src/include/access/tableam.h:1680 
#6  vacuum_rel (relid=52187497, relation=<optimized out>, params=0x7fff46de9a80) at vacuum.c:2092
#7  0x000000000068fe9e in vacuum (relations=0x2dbeee8, params=params@entry=0x7fff46de9a80, bstrategy=<optimized out>,
bstrategy@entry=0x0,isTopLevel=isTopLevel@entry=true) at vacuum.c:475 
#8  0x000000000069065e in ExecVacuum (pstate=pstate@entry=0x2dc38d0, vacstmt=vacstmt@entry=0x2d9f3a0,
isTopLevel=isTopLevel@entry=true)at vacuum.c:275 
#9  0x00000000008567b0 in standard_ProcessUtility (pstmt=pstmt@entry=0x2d9f7a0, queryString=queryString@entry=0x2d9e8a0
"VACUUMANALYZE alarms", readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL,  
    params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x2d9f890, qc=qc@entry=0x7fff46dea0c0) at
utility.c:866
#10 0x00007f52b891c450 in pgss_ProcessUtility (pstmt=0x2d9f7a0, queryString=0x2d9e8a0 "VACUUM ANALYZE alarms",
readOnlyTree=<optimizedout>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2d9f890,  
    qc=0x7fff46dea0c0) at pg_stat_statements.c:1143
#11 0x000000000085490a in PortalRunUtility (portal=portal@entry=0x2e20fc0, pstmt=0x2d9f7a0,
isTopLevel=isTopLevel@entry=true,setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x2d9f890, qc=0x7fff46dea0c0) at
pquery.c:1158
#12 0x0000000000854a53 in PortalRunMulti (portal=portal@entry=0x2e20fc0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x2d9f890, altdest=altdest@entry=0x2d9f890,  
    qc=qc@entry=0x7fff46dea0c0) at pquery.c:1322
#13 0x0000000000855029 in PortalRun (portal=0x2e20fc0, count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimizedout>, dest=0x2d9f890, altdest=0x2d9f890, qc=0x7fff46dea0c0) at pquery.c:791 
#14 0x000000000085099b in exec_simple_query (query_string=0x2d9e8a0 "VACUUM ANALYZE alarms") at postgres.c:1250
#15 0x000000000085268a in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4581
#16 0x0000000000496a21 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4504
#17 BackendStartup (port=0x2dbe9c0) at postmaster.c:4232
#18 ServerLoop () at postmaster.c:1806
#19 0x00000000007b3980 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2d99280) at postmaster.c:1478
#20 0x00000000004986a6 in main (argc=3, argv=0x2d99280) at main.c:202

(gdb) p *pvs
$2 = {pcxt = 0x2e84490, indrels = 0x2e84220, nindexes = 8, shared = 0x2aaaaf142380, indstats = 0x2aaaaf1423c0,
dead_items= 0x2aaaab142380, buffer_usage = 0x2aaaab142260, wal_usage = 0x2aaaab142220,  
  will_parallel_vacuum = 0x2e8f750, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0,
nindexes_parallel_condcleanup= 5, bstrategy = 0x2dbed40, relnamespace = 0x0, relname = 0x0, indname = 0x0,  
  status = PARALLEL_INDVAC_STATUS_INITIAL}

(gdb) info locals
indstats = <optimized out>
i = <optimized out>
nworkers = 2

(gdb) p *pvs
$4 = {pcxt = 0x2e84490, indrels = 0x2e84220, nindexes = 8, shared = 0x2aaaaf142380, indstats = 0x2aaaaf1423c0,
dead_items= 0x2aaaab142380, buffer_usage = 0x2aaaab142260, wal_usage = 0x2aaaab142220,  
  will_parallel_vacuum = 0x2e8f750, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0,
nindexes_parallel_condcleanup= 5, bstrategy = 0x2dbed40, relnamespace = 0x0, relname = 0x0, indname = 0x0,  
  status = PARALLEL_INDVAC_STATUS_INITIAL}

I reproduced it like this:

pryzbyj=# VACUUM (PARALLEL 2,VERBOSE,INDEX_CLEANUP on) alarms; -- DISABLE_PAGE_SKIPPING true
INFO:  vacuuming "pryzbyj.public.alarms"
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.

So I'll be back shortly with more...

--
Justin



Re: pg15b3: crash in paralell vacuum

From
Justin Pryzby
Date:
On Thu, Aug 18, 2022 at 08:34:06AM -0500, Justin Pryzby wrote:
> Unfortunately, it looks like the RPM packages are compiled with -O2, so this is
> of limited use.  So I'll be back shortly with more...

#3  0x00000000006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, num_index_scans=0,
vacuum=vacuum@entry=false)at vacuumparallel.c:611
 
611                     Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);

(gdb) p *pvs
$1 = {pcxt = 0x25bc1e0, indrels = 0x25bbf70, nindexes = 8, shared = 0x7fc5184393a0, indstats = 0x7fc5184393e0,
dead_items= 0x7fc5144393a0, buffer_usage = 0x7fc514439280, wal_usage = 0x7fc514439240, 
 
  will_parallel_vacuum = 0x266d818, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0,
nindexes_parallel_condcleanup= 5, bstrategy = 0x264f120, relnamespace = 0x0, relname = 0x0, indname = 0x0, 
 
  status = PARALLEL_INDVAC_STATUS_INITIAL}

(gdb) p *indstats
$2 = {status = 11, parallel_workers_can_process = false, istat_updated = false, istat = {num_pages = 0, estimated_count
=false, num_index_tuples = 0, tuples_removed = 0, pages_newly_deleted = 0, pages_deleted = 1, 
 
    pages_free = 0}}

(gdb) bt f
...
#3  0x00000000006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, num_index_scans=0,
vacuum=vacuum@entry=false)at vacuumparallel.c:611
 
        indstats = 0x7fc5184393e0
        i = 0
        nworkers = 2
        new_status = PARALLEL_INDVAC_STATUS_NEED_CLEANUP
        __func__ = "parallel_vacuum_process_all_indexes"
#4  0x0000000000687ef0 in parallel_vacuum_cleanup_all_indexes (pvs=<optimized out>,
num_table_tuples=num_table_tuples@entry=409149,num_index_scans=<optimized out>,
estimated_count=estimated_count@entry=true)
    at vacuumparallel.c:486
No locals.
#5  0x00000000004f80b8 in lazy_cleanup_all_indexes (vacrel=vacrel@entry=0x25bc510) at vacuumlazy.c:2679
        reltuples = 409149
        estimated_count = true
#6  0x00000000004f884a in lazy_scan_heap (vacrel=vacrel@entry=0x25bc510) at vacuumlazy.c:1278
        rel_pages = 67334
        blkno = 67334
        next_unskippable_block = 67334
        next_failsafe_block = 0
        next_fsm_block_to_vacuum = 0
        dead_items = 0x7fc5144393a0
        vmbuffer = 1300
        next_unskippable_allvis = true
        skipping_current_range = false
        initprog_index = {0, 1, 5}
        initprog_val = {1, 67334, 11184809}
        __func__ = "lazy_scan_heap"
#7  0x00000000004f925f in heap_vacuum_rel (rel=0x7fc52df6b820, params=0x7ffd74f74620, bstrategy=0x264f120) at
vacuumlazy.c:534
        vacrel = 0x25bc510
        verbose = true
        instrument = <optimized out>
        aggressive = false
        skipwithvm = true
        frozenxid_updated = false
        minmulti_updated = false
        OldestXmin = 32759288
        FreezeLimit = 4277726584
        OldestMxact = 157411
        MultiXactCutoff = 4290124707
        orig_rel_pages = 67334
        new_rel_pages = <optimized out>
        new_rel_allvisible = 4
        ru0 = {tv = {tv_sec = 1660830451, tv_usec = 473980}, ru = {ru_utime = {tv_sec = 0, tv_usec = 317891}, ru_stime
={tv_sec = 1, tv_usec = 212372}, {ru_maxrss = 74524, __ru_maxrss_word = 74524}, {ru_ixrss = 0, 
 
              __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0},
{ru_minflt= 18870, __ru_minflt_word = 18870}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, 
 
              __ru_nswap_word = 0}, {ru_inblock = 1124750, __ru_inblock_word = 1124750}, {ru_oublock = 0,
__ru_oublock_word= 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0,

              __ru_nsignals_word = 0}, {ru_nvcsw = 42, __ru_nvcsw_word = 42}, {ru_nivcsw = 35, __ru_nivcsw_word =
35}}}
        starttime = 714145651473980
        startreadtime = 0
        startwritetime = 0
        startwalusage = {wal_records = 2, wal_fpi = 0, wal_bytes = 421}
        StartPageHit = 50
        StartPageMiss = 0
        StartPageDirty = 0
        errcallback = {previous = 0x0, callback = 0x4f5f41 <vacuum_error_callback>, arg = 0x25bc510}
        indnames = 0x266d838
        __func__ = "heap_vacuum_rel"

This is a qemu VM which (full disclosure) has crashed a few times recently due
to OOM.  This is probably a postgres bug, but conceivably it's being tickled by
bad data (although the vm crashing shouldn't cause that, either, following
recovery).  This is also an instance that was pg_upgraded from v14 (and earlier
versions) to v15b1 and then b2, so it's conceivably possible there's weird data
pages that wouldn't be written by beta3.  But that doesn't seem to be the issue
here anyway.

-- 
Justin



Re: pg15b3: crash in paralell vacuum

From
Masahiko Sawada
Date:
Hi,

On Thu, Aug 18, 2022 at 10:34 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> Immediately after upgrading an internal instance, a loop around "vacuum" did
> this:

Thank you for the report!

>
> TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", File: "vacuumparallel.c", Line: 611, PID:
27635)
> postgres: postgres pryzbyj [local] VACUUM(ExceptionalCondition+0x8d)[0x99d9fd]
> postgres: postgres pryzbyj [local] VACUUM[0x6915db]
> postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6]
> postgres: postgres pryzbyj [local] VACUUM[0x68e97a]
> postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e]
> postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e]
> postgres: postgres pryzbyj [local] VACUUM(standard_ProcessUtility+0x530)[0x8567b0]
> /usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450]
> postgres: postgres pryzbyj [local] VACUUM[0x85490a]
> postgres: postgres pryzbyj [local] VACUUM[0x854a53]
> postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029]
> postgres: postgres pryzbyj [local] VACUUM[0x85099b]
> postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a]
> postgres: postgres pryzbyj [local] VACUUM[0x496a21]
> postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980]
> postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5]
> postgres: postgres pryzbyj [local] VACUUM[0x498c59]
> < 2022-08-18 07:56:51.963 CDT  >LOG:  server process (PID 27635) was terminated by signal 6: Aborted
> < 2022-08-18 07:56:51.963 CDT  >DETAIL:  Failed process was running: VACUUM ANALYZE alarms
>
> Unfortunately, it looks like the RPM packages are compiled with -O2, so this is
> of limited use.
>
> Core was generated by `postgres: postgres pryzbyj [local] VACUUM              '.
> Program terminated with signal 6, Aborted.
> #0  0x00007f52c4b9d207 in raise () from /lib64/libc.so.6
> Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64
cyrus-sasl-lib-2.1.26-23.el7.x86_64elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64
glibc-2.17-260.el7_6.3.x86_64keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64
libattr-2.4.46-13.el7.x86_64libcap-2.22-9.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64
libgcc-4.8.5-39.el7.x86_64libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64
libselinux-2.5-15.el7.x86_64libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64
lz4-1.7.5-2.el7.x86_64nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.1.el7_6.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64
nss-util-3.36.0-1.1.el7_6.x86_64openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-22.el7_9.x86_64
pam-1.1.8-22.el7.x86_64pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64
zlib-1.2.7-18.el7.x86_64
> (gdb) bt
> #0  0x00007f52c4b9d207 in raise () from /lib64/libc.so.6
> #1  0x00007f52c4b9e8f8 in abort () from /lib64/libc.so.6
> #2  0x000000000099da1e in ExceptionalCondition (conditionName=conditionName@entry=0xafae40 "indstats->status ==
PARALLEL_INDVAC_STATUS_INITIAL",errorType=errorType@entry=0x9fb4b7 "FailedAssertion", 
>     fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", lineNumber=lineNumber@entry=611) at assert.c:69
> #3  0x00000000006915db in parallel_vacuum_process_all_indexes (pvs=0x2e85f80, num_index_scans=<optimized out>,
vacuum=<optimizedout>) at vacuumparallel.c:611 
> #4  0x00000000005083e6 in heap_vacuum_rel (rel=<optimized out>, params=<optimized out>, bstrategy=<optimized out>) at
vacuumlazy.c:2679

It seems that parallel_vacuum_cleanup_all_indexes() got called[1],
which means this was the first time to perform parallel vacuum (i.e.,
index cleanup).

I'm not convinced yet but it could be a culprit that we missed doing
memset(0) for the shared array of PVIndStats in
parallel_vacuum_init(). This shared array was introduced in PG15.

[1] https://github.com/postgres/postgres/blob/REL_15_STABLE/src/backend/access/heap/vacuumlazy.c#L2679

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pg15b3: crash in paralell vacuum

From
Masahiko Sawada
Date:
On Thu, Aug 18, 2022 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi,
>
> On Thu, Aug 18, 2022 at 10:34 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > Immediately after upgrading an internal instance, a loop around "vacuum" did
> > this:
>
> Thank you for the report!
>
> >
> > TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", File: "vacuumparallel.c", Line: 611,
PID:27635) 
> > postgres: postgres pryzbyj [local] VACUUM(ExceptionalCondition+0x8d)[0x99d9fd]
> > postgres: postgres pryzbyj [local] VACUUM[0x6915db]
> > postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6]
> > postgres: postgres pryzbyj [local] VACUUM[0x68e97a]
> > postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e]
> > postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e]
> > postgres: postgres pryzbyj [local] VACUUM(standard_ProcessUtility+0x530)[0x8567b0]
> > /usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450]
> > postgres: postgres pryzbyj [local] VACUUM[0x85490a]
> > postgres: postgres pryzbyj [local] VACUUM[0x854a53]
> > postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029]
> > postgres: postgres pryzbyj [local] VACUUM[0x85099b]
> > postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a]
> > postgres: postgres pryzbyj [local] VACUUM[0x496a21]
> > postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980]
> > postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6]
> > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5]
> > postgres: postgres pryzbyj [local] VACUUM[0x498c59]
> > < 2022-08-18 07:56:51.963 CDT  >LOG:  server process (PID 27635) was terminated by signal 6: Aborted
> > < 2022-08-18 07:56:51.963 CDT  >DETAIL:  Failed process was running: VACUUM ANALYZE alarms
> >
> > Unfortunately, it looks like the RPM packages are compiled with -O2, so this is
> > of limited use.
> >
> > Core was generated by `postgres: postgres pryzbyj [local] VACUUM              '.
> > Program terminated with signal 6, Aborted.
> > #0  0x00007f52c4b9d207 in raise () from /lib64/libc.so.6
> > Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64
cyrus-sasl-lib-2.1.26-23.el7.x86_64elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64
glibc-2.17-260.el7_6.3.x86_64keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64
libattr-2.4.46-13.el7.x86_64libcap-2.22-9.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64
libgcc-4.8.5-39.el7.x86_64libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64
libselinux-2.5-15.el7.x86_64libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64
lz4-1.7.5-2.el7.x86_64nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.1.el7_6.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64
nss-util-3.36.0-1.1.el7_6.x86_64openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-22.el7_9.x86_64
pam-1.1.8-22.el7.x86_64pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64
zlib-1.2.7-18.el7.x86_64
> > (gdb) bt
> > #0  0x00007f52c4b9d207 in raise () from /lib64/libc.so.6
> > #1  0x00007f52c4b9e8f8 in abort () from /lib64/libc.so.6
> > #2  0x000000000099da1e in ExceptionalCondition (conditionName=conditionName@entry=0xafae40 "indstats->status ==
PARALLEL_INDVAC_STATUS_INITIAL",errorType=errorType@entry=0x9fb4b7 "FailedAssertion", 
> >     fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", lineNumber=lineNumber@entry=611) at assert.c:69
> > #3  0x00000000006915db in parallel_vacuum_process_all_indexes (pvs=0x2e85f80, num_index_scans=<optimized out>,
vacuum=<optimizedout>) at vacuumparallel.c:611 
> > #4  0x00000000005083e6 in heap_vacuum_rel (rel=<optimized out>, params=<optimized out>, bstrategy=<optimized out>)
atvacuumlazy.c:2679 
>
> It seems that parallel_vacuum_cleanup_all_indexes() got called[1],
> which means this was the first time to perform parallel vacuum (i.e.,
> index cleanup).

Sorry, this explanation is wrong. But according to the recent
information from Justin it was the first time to perform parallel
vacuum:

#3  0x00000000006874f1 in parallel_vacuum_process_all_indexes
(pvs=0x25bdce0, num_index_scans=0, vacuum=vacuum@entry=false) at
vacuumparallel.c:611
611                     Assert(indstats->status ==
PARALLEL_INDVAC_STATUS_INITIAL);

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: pg15b3: crash in paralell vacuum

From
Masahiko Sawada
Date:
On Thu, Aug 18, 2022 at 11:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Thu, Aug 18, 2022 at 08:34:06AM -0500, Justin Pryzby wrote:
> > Unfortunately, it looks like the RPM packages are compiled with -O2, so this is
> > of limited use.  So I'll be back shortly with more...
>
> #3  0x00000000006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, num_index_scans=0,
vacuum=vacuum@entry=false)at vacuumparallel.c:611
 
> 611                     Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL);
>
> (gdb) p *pvs
> $1 = {pcxt = 0x25bc1e0, indrels = 0x25bbf70, nindexes = 8, shared = 0x7fc5184393a0, indstats = 0x7fc5184393e0,
dead_items= 0x7fc5144393a0, buffer_usage = 0x7fc514439280, wal_usage = 0x7fc514439240,
 
>   will_parallel_vacuum = 0x266d818, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0,
nindexes_parallel_condcleanup= 5, bstrategy = 0x264f120, relnamespace = 0x0, relname = 0x0, indname = 0x0,
 
>   status = PARALLEL_INDVAC_STATUS_INITIAL}
>
> (gdb) p *indstats
> $2 = {status = 11, parallel_workers_can_process = false, istat_updated = false, istat = {num_pages = 0,
estimated_count= false, num_index_tuples = 0, tuples_removed = 0, pages_newly_deleted = 0, pages_deleted = 1,
 
>     pages_free = 0}}

The status = 11 is invalid value. Probably because indstats was not
initialized to 0 as I mentioned.

Justin, if it's reproducible in your environment, could you please try
it again with the attached patch?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: pg15b3: crash in paralell vacuum

From
Justin Pryzby
Date:
On Thu, Aug 18, 2022 at 11:24:22PM +0900, Masahiko Sawada wrote:
> The status = 11 is invalid value. Probably because indstats was not
> initialized to 0 as I mentioned.
> 
> Justin, if it's reproducible in your environment, could you please try
> it again with the attached patch?

Yes, this seems to resolve the problem.

Thanks,
-- 
Justin



Re: pg15b3: crash in parallel vacuum

From
Justin Pryzby
Date:
On Thu, Aug 18, 2022 at 09:52:36AM -0500, Justin Pryzby wrote:
> On Thu, Aug 18, 2022 at 11:24:22PM +0900, Masahiko Sawada wrote:
> > The status = 11 is invalid value. Probably because indstats was not
> > initialized to 0 as I mentioned.
> > 
> > Justin, if it's reproducible in your environment, could you please try
> > it again with the attached patch?
> 
> Yes, this seems to resolve the problem.

It seems a bit crazy that this escaped detection until now.
Are these allocations especially vulnerable to uninitialized data ?

-- 
Justin



Re: pg15b3: crash in paralell vacuum

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2022 at 7:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Justin, if it's reproducible in your environment, could you please try
> it again with the attached patch?

Pushed, thanks.

I wonder how this issue could have been caught earlier, or even
avoided in the first place. Would the bug have been caught if Valgrind
had known to mark dynamic shared memory VALGRIND_MAKE_MEM_UNDEFINED()
when it is first allocated? ISTM that we should do something that is
analogous to aset.c's Valgrind handling for palloc() requests.

Similar work on buffers in shared memory led to us catching a tricky
bug involving unsafe access to a buffer, a little while ago -- see
bugfix commit 7b7ed046. The bug in question would probably have taken
much longer to catch without the instrumentation. In fact, it seems
like a good idea to use Valgrind for *anything* where it *might* catch
bugs, just in case.

Valgrind can work well for shared memory without any extra work. The
backend's own idea of the memory (the memory mapping used by the
process) is all that Valgrind cares about. You don't have to worry
about Valgrind instrumentation in one backend causing confusion in
another backend. It's very practical, and very general purpose. I
think that most of the protection comes from a basic understanding of
"this memory is unsafe to access, this memory contains uninitialized
data that cannot be assumed to have any particular value, this memory
is initialized and safe".

--
Peter Geoghegan