Thread: pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue

pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue

From
Justin Pryzby
Date:
An instance compiled locally, without assertions, failed like this:

< 2023-07-09 22:04:46.470 UTC  >LOG:  process 30002 detected deadlock while waiting for ShareLock on transaction
813219577after 333.228 ms
 
< 2023-07-09 22:04:46.470 UTC  >DETAIL:  Process holding the lock: 2103. Wait queue: 30001.
< 2023-07-09 22:04:46.470 UTC  >CONTEXT:  while checking uniqueness of tuple (549,4) in relation
"pg_statistic_ext_data"
< 2023-07-09 22:04:46.470 UTC  >STATEMENT:  REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.474 UTC  >ERROR:  deadlock detected
< 2023-07-09 22:04:46.474 UTC  >DETAIL:  Process 30001 waits for ShareLock on transaction 813219577; blocked by process
2103.
        Process 2103 waits for RowExclusiveLock on relation 3429 of database 16588; blocked by process 30001.
        Process 30001: REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
        Process 2103: autovacuum: ANALYZE child.ericsson_sgsn_rac_202307
< 2023-07-09 22:04:46.474 UTC  >HINT:  See server log for query details.
< 2023-07-09 22:04:46.474 UTC  >CONTEXT:  while checking uniqueness of tuple (549,4) in relation
"pg_statistic_ext_data"
< 2023-07-09 22:04:46.474 UTC  >STATEMENT:  REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.483 UTC  >LOG:  background worker "parallel worker" (PID 30002) exited with exit code 1
< 2023-07-09 22:04:46.487 UTC postgres >ERROR:  deadlock detected
< 2023-07-09 22:04:46.487 UTC postgres >DETAIL:  Process 30001 waits for ShareLock on transaction 813219577; blocked by
process2103.
 
        Process 2103 waits for RowExclusiveLock on relation 3429 of database 16588; blocked by process 30001.
< 2023-07-09 22:04:46.487 UTC postgres >HINT:  See server log for query details.
< 2023-07-09 22:04:46.487 UTC postgres >CONTEXT:  while checking uniqueness of tuple (549,4) in relation
"pg_statistic_ext_data"
        parallel worker
< 2023-07-09 22:04:46.487 UTC postgres >STATEMENT:  REINDEX INDEX pg_statistic_ext_data_stxoid_inh_index
< 2023-07-09 22:04:46.848 UTC  >LOG:  server process (PID 30001) was terminated by signal 11: Segmentation fault
< 2023-07-09 22:04:46.848 UTC  >DETAIL:  Failed process was running: REINDEX INDEX
pg_statistic_ext_data_stxoid_inh_index

=> REINDEX was running, with parallel workers, but deadlocked with
ANALYZE, and then crashed.

It looks like parallel workers are needed to hit this issue.
I'm not sure if the issue is specific to extended stats - probably not.

I reproduced the crash with manual REINDEX+ANALYZE, and with assertions (which
were not hit), and on a more recent commit (1124cb2cf).  The crash is hit about
30% of the time when running a loop around REINDEX and then also running
ANALYZE.

I hope someone has a hunch where to look; so far, I wasn't able to create a
minimal reproducer.  

Core was generated by `postgres: pryzbyj ts [local] REINDEX                 '.
Program terminated with signal 11, Segmentation fault.
#0  RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at ../src/backend/storage/lmgr/lock.c:1898
1898            LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
(gdb) bt
#0  RemoveFromWaitQueue (proc=0x2aaabc1289e0, hashcode=2627626119) at ../src/backend/storage/lmgr/lock.c:1898
#1  0x00000000007ab56b in LockErrorCleanup () at ../src/backend/storage/lmgr/proc.c:735
#2  0x0000000000548a7e in AbortTransaction () at ../src/backend/access/transam/xact.c:2735
#3  0x0000000000549405 in AbortCurrentTransaction () at ../src/backend/access/transam/xact.c:3414
#4  0x00000000007b6414 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at
../src/backend/tcop/postgres.c:4352
#5  0x0000000000730e9a in BackendRun (port=<optimized out>, port=<optimized out>) at
../src/backend/postmaster/postmaster.c:4461
#6  BackendStartup (port=0x12a8a50) at ../src/backend/postmaster/postmaster.c:4189
#7  ServerLoop () at ../src/backend/postmaster/postmaster.c:1779
#8  0x000000000073207d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x127a230) at
../src/backend/postmaster/postmaster.c:1463
#9  0x00000000004b5535 in main (argc=3, argv=0x127a230) at ../src/backend/main/main.c:198

(gdb) l
1893    RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode)
1894    {
1895            LOCK       *waitLock = proc->waitLock;
1896            PROCLOCK   *proclock = proc->waitProcLock;
1897            LOCKMODE        lockmode = proc->waitLockMode;
1898            LOCKMETHODID lockmethodid = LOCK_LOCKMETHOD(*waitLock);
1899
1900            /* Make sure proc is waiting */
1901            Assert(proc->waitStatus == PROC_WAIT_STATUS_WAITING);
1902            Assert(proc->links.next != NULL);

(gdb) p waitLock
$1 = (LOCK *) 0x0

Another variant on this crash:

Jul 11 00:55:19 telsa kernel: postgres[25415]: segfault at f ip 000000000081111a sp 00007ffdbc01ea90 error 4 in
postgres[400000+8df000]

Core was generated by `postgres: parallel worker for PID 27096  waiting     '.

(gdb) bt
#0  RemoveFromWaitQueue (proc=0x2aaabc154040, hashcode=2029421528) at ../src/backend/storage/lmgr/lock.c:1874
#1  0x000000000081de2f in LockErrorCleanup () at ../src/backend/storage/lmgr/proc.c:735
#2  0x0000000000826990 in ProcessInterrupts () at ../src/backend/tcop/postgres.c:3207
#3  0x000000000081e355 in ProcSleep (locallock=locallock@entry=0x252a9d0,
lockMethodTable=lockMethodTable@entry=0xee1260<default_lockmethod>) at ../src/backend/storage/lmgr/proc.c:1295
 
#4  0x000000000080eff1 in WaitOnLock (locallock=locallock@entry=0x252a9d0, owner=owner@entry=0x253b548) at
../src/backend/storage/lmgr/lock.c:1818
#5  0x00000000008107ce in LockAcquireExtended (locktag=locktag@entry=0x7ffdbc01ee10, lockmode=lockmode@entry=5,
sessionLock=sessionLock@entry=false,dontWait=dontWait@entry=false,
 
    reportMemoryError=reportMemoryError@entry=true, locallockp=locallockp@entry=0x0) at
../src/backend/storage/lmgr/lock.c:1082
#6  0x00000000008110a4 in LockAcquire (locktag=locktag@entry=0x7ffdbc01ee10, lockmode=lockmode@entry=5,
sessionLock=sessionLock@entry=false,dontWait=dontWait@entry=false) at ../src/backend/storage/lmgr/lock.c:740
 
#7  0x000000000080e316 in XactLockTableWait (xid=xid@entry=816478533, rel=rel@entry=0x7f7332090468,
ctid=ctid@entry=0x2596374,oper=oper@entry=XLTW_InsertIndexUnique) at ../src/backend/storage/lmgr/lmgr.c:702
 
#8  0x00000000005190bb in heapam_index_build_range_scan (heapRelation=0x7f7332090468, indexRelation=0x7f7332099008,
indexInfo=0x2596888,allow_sync=<optimized out>, anyvisible=false, progress=false, start_blockno=0,
 
    numblocks=4294967295, callback=0x53c8c0 <_bt_build_callback>, callback_state=0x7ffdbc01f310, scan=0x2596318) at
../src/backend/access/heap/heapam_handler.c:1496
#9  0x000000000053ca77 in table_index_build_scan (scan=<optimized out>, callback_state=0x7ffdbc01f310,
callback=0x53c8c0<_bt_build_callback>, progress=false, allow_sync=true, index_info=0x2596888, index_rel=<optimized
out>,
    table_rel=<optimized out>) at ../src/include/access/tableam.h:1781
#10 _bt_parallel_scan_and_sort (btspool=btspool@entry=0x2596d08, btspool2=btspool2@entry=0x2596d38,
btshared=btshared@entry=0x2aaaaad423a0,sharedsort=sharedsort@entry=0x2aaaaad42340,
 
    sharedsort2=sharedsort2@entry=0x2aaaaad422e0, sortmem=<optimized out>, progress=progress@entry=false) at
../src/backend/access/nbtree/nbtsort.c:1985
#11 0x000000000053ef6f in _bt_parallel_build_main (seg=<optimized out>, toc=0x2aaaaad42080) at
../src/backend/access/nbtree/nbtsort.c:1888
#12 0x0000000000564ec8 in ParallelWorkerMain (main_arg=<optimized out>) at
../src/backend/access/transam/parallel.c:1520
#13 0x00000000007892f8 in StartBackgroundWorker () at ../src/backend/postmaster/bgworker.c:861
#14 0x0000000000493269 in do_start_bgworker (rw=0x2531fc0) at ../src/backend/postmaster/postmaster.c:5762
#15 maybe_start_bgworkers () at ../src/backend/postmaster/postmaster.c:5986
#16 0x000000000078dc5a in process_pm_pmsignal () at ../src/backend/postmaster/postmaster.c:5149
#17 ServerLoop () at ../src/backend/postmaster/postmaster.c:1770
#18 0x0000000000790635 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x24fe230) at
../src/backend/postmaster/postmaster.c:1463
#19 0x00000000004b80c5 in main (argc=3, argv=0x24fe230) at ../src/backend/main/main.c:198

-- 
Justin



Re: pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue

From
Justin Pryzby
Date:
On Mon, Jul 10, 2023 at 09:01:37PM -0500, Justin Pryzby wrote:
> An instance compiled locally, without assertions, failed like this:
> 
...
> 
> => REINDEX was running, with parallel workers, but deadlocked with
> ANALYZE, and then crashed.
> 
> It looks like parallel workers are needed to hit this issue.
> I'm not sure if the issue is specific to extended stats - probably not.
> 
> I reproduced the crash with manual REINDEX+ANALYZE, and with assertions (which
> were not hit), and on a more recent commit (1124cb2cf).  The crash is hit about
> 30% of the time when running a loop around REINDEX and then also running
> ANALYZE.
> 
> I hope someone has a hunch where to look; so far, I wasn't able to create a
> minimal reproducer.  

I was able to reproduce this in isolation by reloading data into a test
instance, ANALYZEing the DB to populate pg_statistic_ext_data (so it's
over 3MB in size), and then REINDEXing the stats_ext index in a loop
while ANALYZEing a table with extended stats.

I still don't have a minimal reproducer, but on a hunch I found that
this fails at 5764f611e but not its parent.

commit 5764f611e10b126e09e37fdffbe884c44643a6ce
Author: Andres Freund <andres@anarazel.de>
Date:   Wed Jan 18 11:41:14 2023 -0800

    Use dlist/dclist instead of PROC_QUEUE / SHM_QUEUE for heavyweight locks

I tried compiling with -DILIST_DEBUG, but that shows nothing beyond
segfaulting, which seems to show that the lists themselves are fine.

-- 
Justin



Re: pg16b2: REINDEX segv on null pointer in RemoveFromWaitQueue

From
Masahiko Sawada
Date:
On Wed, Jul 12, 2023 at 8:52 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Mon, Jul 10, 2023 at 09:01:37PM -0500, Justin Pryzby wrote:
> > An instance compiled locally, without assertions, failed like this:
> >
> ...
> >
> > => REINDEX was running, with parallel workers, but deadlocked with
> > ANALYZE, and then crashed.
> >
> > It looks like parallel workers are needed to hit this issue.
> > I'm not sure if the issue is specific to extended stats - probably not.
> >
> > I reproduced the crash with manual REINDEX+ANALYZE, and with assertions (which
> > were not hit), and on a more recent commit (1124cb2cf).  The crash is hit about
> > 30% of the time when running a loop around REINDEX and then also running
> > ANALYZE.
> >
> > I hope someone has a hunch where to look; so far, I wasn't able to create a
> > minimal reproducer.
>
> I was able to reproduce this in isolation by reloading data into a test
> instance, ANALYZEing the DB to populate pg_statistic_ext_data (so it's
> over 3MB in size), and then REINDEXing the stats_ext index in a loop
> while ANALYZEing a table with extended stats.
>
> I still don't have a minimal reproducer, but on a hunch I found that
> this fails at 5764f611e but not its parent.
>
> commit 5764f611e10b126e09e37fdffbe884c44643a6ce
> Author: Andres Freund <andres@anarazel.de>
> Date:   Wed Jan 18 11:41:14 2023 -0800
>
>     Use dlist/dclist instead of PROC_QUEUE / SHM_QUEUE for heavyweight locks
>

Good catch. I didn't realize this email but while investigating the
same issue that has been reported recently[1], I reached the same
commit. I've sent my analysis and a patch to fix this issue there.
Andres, since this issue seems to be relevant with your commit
5764f611e, could you please look at this issue and my patch?

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDs7vzK7NErse7xTruqY-FXmM%2B3K00SdBtMcQhiRNkoeQ%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com