Thread: pg15b3: crash in paralell vacuum
Immediately after upgrading an internal instance, a loop around "vacuum" did this: TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", File: "vacuumparallel.c", Line: 611, PID: 27635) postgres: postgres pryzbyj [local] VACUUM(ExceptionalCondition+0x8d)[0x99d9fd] postgres: postgres pryzbyj [local] VACUUM[0x6915db] postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6] postgres: postgres pryzbyj [local] VACUUM[0x68e97a] postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e] postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e] postgres: postgres pryzbyj [local] VACUUM(standard_ProcessUtility+0x530)[0x8567b0] /usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450] postgres: postgres pryzbyj [local] VACUUM[0x85490a] postgres: postgres pryzbyj [local] VACUUM[0x854a53] postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029] postgres: postgres pryzbyj [local] VACUUM[0x85099b] postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a] postgres: postgres pryzbyj [local] VACUUM[0x496a21] postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980] postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5] postgres: postgres pryzbyj [local] VACUUM[0x498c59] < 2022-08-18 07:56:51.963 CDT >LOG: server process (PID 27635) was terminated by signal 6: Aborted < 2022-08-18 07:56:51.963 CDT >DETAIL: Failed process was running: VACUUM ANALYZE alarms Unfortunately, it looks like the RPM packages are compiled with -O2, so this is of limited use. Core was generated by `postgres: postgres pryzbyj [local] VACUUM '. Program terminated with signal 6, Aborted. #0 0x00007f52c4b9d207 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-23.el7.x86_64elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-260.el7_6.3.x86_64keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64 libattr-2.4.46-13.el7.x86_64libcap-2.22-9.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-39.el7.x86_64libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64 libselinux-2.5-15.el7.x86_64libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64 lz4-1.7.5-2.el7.x86_64nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.1.el7_6.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.1.el7_6.x86_64openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-22.el7_9.x86_64 pam-1.1.8-22.el7.x86_64pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 0x00007f52c4b9d207 in raise () from /lib64/libc.so.6 #1 0x00007f52c4b9e8f8 in abort () from /lib64/libc.so.6 #2 0x000000000099da1e in ExceptionalCondition (conditionName=conditionName@entry=0xafae40 "indstats->status == PARALLEL_INDVAC_STATUS_INITIAL",errorType=errorType@entry=0x9fb4b7 "FailedAssertion", fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", lineNumber=lineNumber@entry=611) at assert.c:69 #3 0x00000000006915db in parallel_vacuum_process_all_indexes (pvs=0x2e85f80, num_index_scans=<optimized out>, vacuum=<optimizedout>) at vacuumparallel.c:611 #4 0x00000000005083e6 in heap_vacuum_rel (rel=<optimized out>, params=<optimized out>, bstrategy=<optimized out>) at vacuumlazy.c:2679 #5 0x000000000068e97a in table_relation_vacuum (bstrategy=<optimized out>, params=0x7fff46de9a80, rel=0x7f52c7bc2c10) at../../../src/include/access/tableam.h:1680 #6 vacuum_rel (relid=52187497, relation=<optimized out>, params=0x7fff46de9a80) at vacuum.c:2092 #7 0x000000000068fe9e in vacuum (relations=0x2dbeee8, params=params@entry=0x7fff46de9a80, bstrategy=<optimized out>, bstrategy@entry=0x0,isTopLevel=isTopLevel@entry=true) at vacuum.c:475 #8 0x000000000069065e in ExecVacuum (pstate=pstate@entry=0x2dc38d0, vacstmt=vacstmt@entry=0x2d9f3a0, isTopLevel=isTopLevel@entry=true)at vacuum.c:275 #9 0x00000000008567b0 in standard_ProcessUtility (pstmt=pstmt@entry=0x2d9f7a0, queryString=queryString@entry=0x2d9e8a0 "VACUUMANALYZE alarms", readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x2d9f890, qc=qc@entry=0x7fff46dea0c0) at utility.c:866 #10 0x00007f52b891c450 in pgss_ProcessUtility (pstmt=0x2d9f7a0, queryString=0x2d9e8a0 "VACUUM ANALYZE alarms", readOnlyTree=<optimizedout>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2d9f890, qc=0x7fff46dea0c0) at pg_stat_statements.c:1143 #11 0x000000000085490a in PortalRunUtility (portal=portal@entry=0x2e20fc0, pstmt=0x2d9f7a0, isTopLevel=isTopLevel@entry=true,setHoldSnapshot=setHoldSnapshot@entry=false, dest=0x2d9f890, qc=0x7fff46dea0c0) at pquery.c:1158 #12 0x0000000000854a53 in PortalRunMulti (portal=portal@entry=0x2e20fc0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x2d9f890, altdest=altdest@entry=0x2d9f890, qc=qc@entry=0x7fff46dea0c0) at pquery.c:1322 #13 0x0000000000855029 in PortalRun (portal=0x2e20fc0, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimizedout>, dest=0x2d9f890, altdest=0x2d9f890, qc=0x7fff46dea0c0) at pquery.c:791 #14 0x000000000085099b in exec_simple_query (query_string=0x2d9e8a0 "VACUUM ANALYZE alarms") at postgres.c:1250 #15 0x000000000085268a in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4581 #16 0x0000000000496a21 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4504 #17 BackendStartup (port=0x2dbe9c0) at postmaster.c:4232 #18 ServerLoop () at postmaster.c:1806 #19 0x00000000007b3980 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2d99280) at postmaster.c:1478 #20 0x00000000004986a6 in main (argc=3, argv=0x2d99280) at main.c:202 (gdb) p *pvs $2 = {pcxt = 0x2e84490, indrels = 0x2e84220, nindexes = 8, shared = 0x2aaaaf142380, indstats = 0x2aaaaf1423c0, dead_items= 0x2aaaab142380, buffer_usage = 0x2aaaab142260, wal_usage = 0x2aaaab142220, will_parallel_vacuum = 0x2e8f750, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0, nindexes_parallel_condcleanup= 5, bstrategy = 0x2dbed40, relnamespace = 0x0, relname = 0x0, indname = 0x0, status = PARALLEL_INDVAC_STATUS_INITIAL} (gdb) info locals indstats = <optimized out> i = <optimized out> nworkers = 2 (gdb) p *pvs $4 = {pcxt = 0x2e84490, indrels = 0x2e84220, nindexes = 8, shared = 0x2aaaaf142380, indstats = 0x2aaaaf1423c0, dead_items= 0x2aaaab142380, buffer_usage = 0x2aaaab142260, wal_usage = 0x2aaaab142220, will_parallel_vacuum = 0x2e8f750, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0, nindexes_parallel_condcleanup= 5, bstrategy = 0x2dbed40, relnamespace = 0x0, relname = 0x0, indname = 0x0, status = PARALLEL_INDVAC_STATUS_INITIAL} I reproduced it like this: pryzbyj=# VACUUM (PARALLEL 2,VERBOSE,INDEX_CLEANUP on) alarms; -- DISABLE_PAGE_SKIPPING true INFO: vacuuming "pryzbyj.public.alarms" server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. The connection to the server was lost. Attempting reset: Failed. So I'll be back shortly with more... -- Justin
On Thu, Aug 18, 2022 at 08:34:06AM -0500, Justin Pryzby wrote: > Unfortunately, it looks like the RPM packages are compiled with -O2, so this is > of limited use. So I'll be back shortly with more... #3 0x00000000006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, num_index_scans=0, vacuum=vacuum@entry=false)at vacuumparallel.c:611 611 Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL); (gdb) p *pvs $1 = {pcxt = 0x25bc1e0, indrels = 0x25bbf70, nindexes = 8, shared = 0x7fc5184393a0, indstats = 0x7fc5184393e0, dead_items= 0x7fc5144393a0, buffer_usage = 0x7fc514439280, wal_usage = 0x7fc514439240, will_parallel_vacuum = 0x266d818, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0, nindexes_parallel_condcleanup= 5, bstrategy = 0x264f120, relnamespace = 0x0, relname = 0x0, indname = 0x0, status = PARALLEL_INDVAC_STATUS_INITIAL} (gdb) p *indstats $2 = {status = 11, parallel_workers_can_process = false, istat_updated = false, istat = {num_pages = 0, estimated_count =false, num_index_tuples = 0, tuples_removed = 0, pages_newly_deleted = 0, pages_deleted = 1, pages_free = 0}} (gdb) bt f ... #3 0x00000000006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, num_index_scans=0, vacuum=vacuum@entry=false)at vacuumparallel.c:611 indstats = 0x7fc5184393e0 i = 0 nworkers = 2 new_status = PARALLEL_INDVAC_STATUS_NEED_CLEANUP __func__ = "parallel_vacuum_process_all_indexes" #4 0x0000000000687ef0 in parallel_vacuum_cleanup_all_indexes (pvs=<optimized out>, num_table_tuples=num_table_tuples@entry=409149,num_index_scans=<optimized out>, estimated_count=estimated_count@entry=true) at vacuumparallel.c:486 No locals. #5 0x00000000004f80b8 in lazy_cleanup_all_indexes (vacrel=vacrel@entry=0x25bc510) at vacuumlazy.c:2679 reltuples = 409149 estimated_count = true #6 0x00000000004f884a in lazy_scan_heap (vacrel=vacrel@entry=0x25bc510) at vacuumlazy.c:1278 rel_pages = 67334 blkno = 67334 next_unskippable_block = 67334 next_failsafe_block = 0 next_fsm_block_to_vacuum = 0 dead_items = 0x7fc5144393a0 vmbuffer = 1300 next_unskippable_allvis = true skipping_current_range = false initprog_index = {0, 1, 5} initprog_val = {1, 67334, 11184809} __func__ = "lazy_scan_heap" #7 0x00000000004f925f in heap_vacuum_rel (rel=0x7fc52df6b820, params=0x7ffd74f74620, bstrategy=0x264f120) at vacuumlazy.c:534 vacrel = 0x25bc510 verbose = true instrument = <optimized out> aggressive = false skipwithvm = true frozenxid_updated = false minmulti_updated = false OldestXmin = 32759288 FreezeLimit = 4277726584 OldestMxact = 157411 MultiXactCutoff = 4290124707 orig_rel_pages = 67334 new_rel_pages = <optimized out> new_rel_allvisible = 4 ru0 = {tv = {tv_sec = 1660830451, tv_usec = 473980}, ru = {ru_utime = {tv_sec = 0, tv_usec = 317891}, ru_stime ={tv_sec = 1, tv_usec = 212372}, {ru_maxrss = 74524, __ru_maxrss_word = 74524}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt= 18870, __ru_minflt_word = 18870}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 1124750, __ru_inblock_word = 1124750}, {ru_oublock = 0, __ru_oublock_word= 0}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 42, __ru_nvcsw_word = 42}, {ru_nivcsw = 35, __ru_nivcsw_word = 35}}} starttime = 714145651473980 startreadtime = 0 startwritetime = 0 startwalusage = {wal_records = 2, wal_fpi = 0, wal_bytes = 421} StartPageHit = 50 StartPageMiss = 0 StartPageDirty = 0 errcallback = {previous = 0x0, callback = 0x4f5f41 <vacuum_error_callback>, arg = 0x25bc510} indnames = 0x266d838 __func__ = "heap_vacuum_rel" This is a qemu VM which (full disclosure) has crashed a few times recently due to OOM. This is probably a postgres bug, but conceivably it's being tickled by bad data (although the vm crashing shouldn't cause that, either, following recovery). This is also an instance that was pg_upgraded from v14 (and earlier versions) to v15b1 and then b2, so it's conceivably possible there's weird data pages that wouldn't be written by beta3. But that doesn't seem to be the issue here anyway. -- Justin
Hi, On Thu, Aug 18, 2022 at 10:34 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Immediately after upgrading an internal instance, a loop around "vacuum" did > this: Thank you for the report! > > TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", File: "vacuumparallel.c", Line: 611, PID: 27635) > postgres: postgres pryzbyj [local] VACUUM(ExceptionalCondition+0x8d)[0x99d9fd] > postgres: postgres pryzbyj [local] VACUUM[0x6915db] > postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6] > postgres: postgres pryzbyj [local] VACUUM[0x68e97a] > postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e] > postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e] > postgres: postgres pryzbyj [local] VACUUM(standard_ProcessUtility+0x530)[0x8567b0] > /usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450] > postgres: postgres pryzbyj [local] VACUUM[0x85490a] > postgres: postgres pryzbyj [local] VACUUM[0x854a53] > postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029] > postgres: postgres pryzbyj [local] VACUUM[0x85099b] > postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a] > postgres: postgres pryzbyj [local] VACUUM[0x496a21] > postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980] > postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6] > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5] > postgres: postgres pryzbyj [local] VACUUM[0x498c59] > < 2022-08-18 07:56:51.963 CDT >LOG: server process (PID 27635) was terminated by signal 6: Aborted > < 2022-08-18 07:56:51.963 CDT >DETAIL: Failed process was running: VACUUM ANALYZE alarms > > Unfortunately, it looks like the RPM packages are compiled with -O2, so this is > of limited use. > > Core was generated by `postgres: postgres pryzbyj [local] VACUUM '. > Program terminated with signal 6, Aborted. > #0 0x00007f52c4b9d207 in raise () from /lib64/libc.so.6 > Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-23.el7.x86_64elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-260.el7_6.3.x86_64keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64 libattr-2.4.46-13.el7.x86_64libcap-2.22-9.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-39.el7.x86_64libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64 libselinux-2.5-15.el7.x86_64libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64 lz4-1.7.5-2.el7.x86_64nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.1.el7_6.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.1.el7_6.x86_64openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-22.el7_9.x86_64 pam-1.1.8-22.el7.x86_64pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 > (gdb) bt > #0 0x00007f52c4b9d207 in raise () from /lib64/libc.so.6 > #1 0x00007f52c4b9e8f8 in abort () from /lib64/libc.so.6 > #2 0x000000000099da1e in ExceptionalCondition (conditionName=conditionName@entry=0xafae40 "indstats->status == PARALLEL_INDVAC_STATUS_INITIAL",errorType=errorType@entry=0x9fb4b7 "FailedAssertion", > fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", lineNumber=lineNumber@entry=611) at assert.c:69 > #3 0x00000000006915db in parallel_vacuum_process_all_indexes (pvs=0x2e85f80, num_index_scans=<optimized out>, vacuum=<optimizedout>) at vacuumparallel.c:611 > #4 0x00000000005083e6 in heap_vacuum_rel (rel=<optimized out>, params=<optimized out>, bstrategy=<optimized out>) at vacuumlazy.c:2679 It seems that parallel_vacuum_cleanup_all_indexes() got called[1], which means this was the first time to perform parallel vacuum (i.e., index cleanup). I'm not convinced yet but it could be a culprit that we missed doing memset(0) for the shared array of PVIndStats in parallel_vacuum_init(). This shared array was introduced in PG15. [1] https://github.com/postgres/postgres/blob/REL_15_STABLE/src/backend/access/heap/vacuumlazy.c#L2679 Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Aug 18, 2022 at 11:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > On Thu, Aug 18, 2022 at 10:34 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > Immediately after upgrading an internal instance, a loop around "vacuum" did > > this: > > Thank you for the report! > > > > > TRAP: FailedAssertion("indstats->status == PARALLEL_INDVAC_STATUS_INITIAL", File: "vacuumparallel.c", Line: 611, PID:27635) > > postgres: postgres pryzbyj [local] VACUUM(ExceptionalCondition+0x8d)[0x99d9fd] > > postgres: postgres pryzbyj [local] VACUUM[0x6915db] > > postgres: postgres pryzbyj [local] VACUUM(heap_vacuum_rel+0x12b6)[0x5083e6] > > postgres: postgres pryzbyj [local] VACUUM[0x68e97a] > > postgres: postgres pryzbyj [local] VACUUM(vacuum+0x48e)[0x68fe9e] > > postgres: postgres pryzbyj [local] VACUUM(ExecVacuum+0x2ae)[0x69065e] > > postgres: postgres pryzbyj [local] VACUUM(standard_ProcessUtility+0x530)[0x8567b0] > > /usr/pgsql-15/lib/pg_stat_statements.so(+0x5450)[0x7f52b891c450] > > postgres: postgres pryzbyj [local] VACUUM[0x85490a] > > postgres: postgres pryzbyj [local] VACUUM[0x854a53] > > postgres: postgres pryzbyj [local] VACUUM(PortalRun+0x179)[0x855029] > > postgres: postgres pryzbyj [local] VACUUM[0x85099b] > > postgres: postgres pryzbyj [local] VACUUM(PostgresMain+0x199a)[0x85268a] > > postgres: postgres pryzbyj [local] VACUUM[0x496a21] > > postgres: postgres pryzbyj [local] VACUUM(PostmasterMain+0x11c0)[0x7b3980] > > postgres: postgres pryzbyj [local] VACUUM(main+0x1c6)[0x4986a6] > > /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f52c4b893d5] > > postgres: postgres pryzbyj [local] VACUUM[0x498c59] > > < 2022-08-18 07:56:51.963 CDT >LOG: server process (PID 27635) was terminated by signal 6: Aborted > > < 2022-08-18 07:56:51.963 CDT >DETAIL: Failed process was running: VACUUM ANALYZE alarms > > > > Unfortunately, it looks like the RPM packages are compiled with -O2, so this is > > of limited use. > > > > Core was generated by `postgres: postgres pryzbyj [local] VACUUM '. > > Program terminated with signal 6, Aborted. > > #0 0x00007f52c4b9d207 in raise () from /lib64/libc.so.6 > > Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.4-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-23.el7.x86_64elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-260.el7_6.3.x86_64keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64 libattr-2.4.46-13.el7.x86_64libcap-2.22-9.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libgcc-4.8.5-39.el7.x86_64libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-17.el7.x86_64 libselinux-2.5-15.el7.x86_64libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 libzstd-1.5.2-1.el7.x86_64 lz4-1.7.5-2.el7.x86_64nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-7.1.el7_6.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.1.el7_6.x86_64openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-22.el7_9.x86_64 pam-1.1.8-22.el7.x86_64pcre-8.32-17.el7.x86_64 systemd-libs-219-62.el7_6.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64 > > (gdb) bt > > #0 0x00007f52c4b9d207 in raise () from /lib64/libc.so.6 > > #1 0x00007f52c4b9e8f8 in abort () from /lib64/libc.so.6 > > #2 0x000000000099da1e in ExceptionalCondition (conditionName=conditionName@entry=0xafae40 "indstats->status == PARALLEL_INDVAC_STATUS_INITIAL",errorType=errorType@entry=0x9fb4b7 "FailedAssertion", > > fileName=fileName@entry=0xafb0c0 "vacuumparallel.c", lineNumber=lineNumber@entry=611) at assert.c:69 > > #3 0x00000000006915db in parallel_vacuum_process_all_indexes (pvs=0x2e85f80, num_index_scans=<optimized out>, vacuum=<optimizedout>) at vacuumparallel.c:611 > > #4 0x00000000005083e6 in heap_vacuum_rel (rel=<optimized out>, params=<optimized out>, bstrategy=<optimized out>) atvacuumlazy.c:2679 > > It seems that parallel_vacuum_cleanup_all_indexes() got called[1], > which means this was the first time to perform parallel vacuum (i.e., > index cleanup). Sorry, this explanation is wrong. But according to the recent information from Justin it was the first time to perform parallel vacuum: #3 0x00000000006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, num_index_scans=0, vacuum=vacuum@entry=false) at vacuumparallel.c:611 611 Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL); Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Thu, Aug 18, 2022 at 11:04 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Thu, Aug 18, 2022 at 08:34:06AM -0500, Justin Pryzby wrote: > > Unfortunately, it looks like the RPM packages are compiled with -O2, so this is > > of limited use. So I'll be back shortly with more... > > #3 0x00000000006874f1 in parallel_vacuum_process_all_indexes (pvs=0x25bdce0, num_index_scans=0, vacuum=vacuum@entry=false)at vacuumparallel.c:611 > 611 Assert(indstats->status == PARALLEL_INDVAC_STATUS_INITIAL); > > (gdb) p *pvs > $1 = {pcxt = 0x25bc1e0, indrels = 0x25bbf70, nindexes = 8, shared = 0x7fc5184393a0, indstats = 0x7fc5184393e0, dead_items= 0x7fc5144393a0, buffer_usage = 0x7fc514439280, wal_usage = 0x7fc514439240, > will_parallel_vacuum = 0x266d818, nindexes_parallel_bulkdel = 5, nindexes_parallel_cleanup = 0, nindexes_parallel_condcleanup= 5, bstrategy = 0x264f120, relnamespace = 0x0, relname = 0x0, indname = 0x0, > status = PARALLEL_INDVAC_STATUS_INITIAL} > > (gdb) p *indstats > $2 = {status = 11, parallel_workers_can_process = false, istat_updated = false, istat = {num_pages = 0, estimated_count= false, num_index_tuples = 0, tuples_removed = 0, pages_newly_deleted = 0, pages_deleted = 1, > pages_free = 0}} The status = 11 is invalid value. Probably because indstats was not initialized to 0 as I mentioned. Justin, if it's reproducible in your environment, could you please try it again with the attached patch? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Thu, Aug 18, 2022 at 11:24:22PM +0900, Masahiko Sawada wrote: > The status = 11 is invalid value. Probably because indstats was not > initialized to 0 as I mentioned. > > Justin, if it's reproducible in your environment, could you please try > it again with the attached patch? Yes, this seems to resolve the problem. Thanks, -- Justin
On Thu, Aug 18, 2022 at 09:52:36AM -0500, Justin Pryzby wrote: > On Thu, Aug 18, 2022 at 11:24:22PM +0900, Masahiko Sawada wrote: > > The status = 11 is invalid value. Probably because indstats was not > > initialized to 0 as I mentioned. > > > > Justin, if it's reproducible in your environment, could you please try > > it again with the attached patch? > > Yes, this seems to resolve the problem. It seems a bit crazy that this escaped detection until now. Are these allocations especially vulnerable to uninitialized data ? -- Justin
On Thu, Aug 18, 2022 at 7:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Justin, if it's reproducible in your environment, could you please try > it again with the attached patch? Pushed, thanks. I wonder how this issue could have been caught earlier, or even avoided in the first place. Would the bug have been caught if Valgrind had known to mark dynamic shared memory VALGRIND_MAKE_MEM_UNDEFINED() when it is first allocated? ISTM that we should do something that is analogous to aset.c's Valgrind handling for palloc() requests. Similar work on buffers in shared memory led to us catching a tricky bug involving unsafe access to a buffer, a little while ago -- see bugfix commit 7b7ed046. The bug in question would probably have taken much longer to catch without the instrumentation. In fact, it seems like a good idea to use Valgrind for *anything* where it *might* catch bugs, just in case. Valgrind can work well for shared memory without any extra work. The backend's own idea of the memory (the memory mapping used by the process) is all that Valgrind cares about. You don't have to worry about Valgrind instrumentation in one backend causing confusion in another backend. It's very practical, and very general purpose. I think that most of the protection comes from a basic understanding of "this memory is unsafe to access, this memory contains uninitialized data that cannot be assumed to have any particular value, this memory is initialized and safe". -- Peter Geoghegan