Thread: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

Hi,all

In our PostgreSQL 10.2 database, two sessions of insert and autovacuum of gin index blocked in LWLock:buffer_content.
This blocked checkpoint and dead tuple recycle,and we had to restart the datebase.
According to the information collected from gcore, a deadlock occurred when acquiring the buffer lock.

This system is a citus 7.2.1 cluster. It has been running for more than a year. This problem has not happened before. 
10/19 the cluster has been expanded from 8 workers to 16 workers. I don't know if it has anything to do with expansion. 
The problem happend on the coordinator node of citus, but the table is an normal table(not partition table).
And the load of this coordinator node has always been heavy.

After 10/25 this issue occurred three times in the 4 days, but we can not reproduce this problem in test environment.
3 faults are stuck on 3 different tables, but the stack is the same.

## stack of insert:Acquire lock 0x2aaab4009564 and hold 0x2aaab670dfe4, 0x2aaac587ae64
----------------------------------------------------------------------------
    (gdb) bt
    #0  0x00007fe11552379b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
    #1  0x00007fe11552382f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
    #2  0x00007fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
    #3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac02958) at pg_sema.c:310
    #4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab4009564, mode=LW_EXCLUSIVE) at lwlock.c:1233
    #5  0x0000000000490a32 in ginStepRight (buffer=6713826, index=<optimized out>, lockmode=lockmode@entry=2) at ginbtree.c:174
    #6  0x0000000000490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8, freestack=freestack@entry=1 '\001', buildStats=buildStats@entry=0x0) at ginbtree.c:701
    #7  0x0000000000491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, stack=<optimized out>, insertdata=insertdata@entry=0x7ffd81e4f940, buildStats=buildStats@entry=0x0)
        at ginbtree.c:773
    #8  0x000000000048fb42 in ginInsertItemPointers (index=<optimized out>, rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
        at gindatapage.c:1907
    #9  0x000000000048a9ea in ginEntryInsert (ginstate=ginstate@entry=0x28e7ef8, attnum=<optimized out>, key=42920440, category=<optimized out>, items=0x2916598, nitem=353, 
        buildStats=buildStats@entry=0x0) at gininsert.c:214
    #10 0x0000000000496d94 in ginInsertCleanup (ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', fill_fsm=fill_fsm@entry=1 '\001', 
        forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at ginfast.c:883
    #11 0x0000000000497727 in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) at ginfast.c:448
    #12 0x000000000048b209 in gininsert (index=<optimized out>, values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, heapRel=<optimized out>, checkUnique=<optimized out>, 
        indexInfo=0x28b5aa8) at gininsert.c:522
    #13 0x00000000005ee8dd in ExecInsertIndexTuples (slot=slot@entry=0x28b5d58, tupleid=tupleid@entry=0x280d98c, estate=estate@entry=0x28b5288, noDupErr=noDupErr@entry=1 '\001', 
        specConflict=specConflict@entry=0x7ffd81e5013b "", arbiterIndexes=arbiterIndexes@entry=0x28c6dd8) at execIndexing.c:386
    #14 0x000000000060ccf5 in ExecInsert (canSetTag=1 '\001', estate=0x28b5288, onconflict=ONCONFLICT_UPDATE, arbiterIndexes=0x28c6dd8, planSlot=0x28b5d58, slot=0x28b5d58, mtstate=0x28b5628)
        at nodeModifyTable.c:564
    #15 ExecModifyTable (pstate=0x28b5628) at nodeModifyTable.c:1766
    #16 0x00000000005ef612 in standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>)
        at ../../../src/include/executor/executor.h:250
    #17 0x00007fe10607d15d in pgss_ExecutorRun (queryDesc=0x28cc1d8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:889
    #18 0x000000000071a7ba in ProcessQuery (plan=<optimized out>, 
        sourceText=0x2849058 "INSERT INTO bi_dm.tdm_sncity_workorder_analytic_statistics (CITYCODE,CITYNAME,DISTRICT,DISTRICT_NAME,ZL_WERKS,ZL_WERKS_NAME,AREA_CD,AREA_NM,ZSIZE,DEPT_CD,TMALL_FLG,ORDER_DATE,DISTRIBUTE,WORKORDER,FNSH"..., params=0x28b2260, queryEnv=0x0, dest=0xc9e2a0 <donothingDR>, completionTag=0x7ffd81e507b0 "") at pquery.c:161
    #19 0x000000000071a9f7 in PortalRunMulti (portal=portal@entry=0x289fdd8, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=0xc9e2a0 <donothingDR>, 
        dest@entry=0x284d778, altdest=0xc9e2a0 <donothingDR>, altdest@entry=0x284d778, completionTag=completionTag@entry=0x7ffd81e507b0 "") at pquery.c:1286
    #20 0x000000000071b535 in PortalRun (portal=<optimized out>, count=1, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x284d778, altdest=0x284d778, 
        completionTag=0x7ffd81e507b0 "") at pquery.c:799
    #21 0x0000000000718f84 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:1984
    #22 0x000000000047ad3c in BackendRun (port=0x27cbc40) at postmaster.c:4405
    #23 BackendStartup (port=0x27cbc40) at postmaster.c:4077
    #24 ServerLoop () at postmaster.c:1755
    #25 0x00000000006afacf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x277ba20) at postmaster.c:1363
    #26 0x000000000047bb6f in main (argc=3, argv=0x277ba20) at main.c:228
    (gdb) f 4
    #4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab4009564, mode=LW_EXCLUSIVE) at lwlock.c:1233
    1233 PGSemaphoreLock(proc->sem);
    (gdb) p num_held_lwlocks
    $16 = 2
    (gdb) p InterruptHoldoffCount
    $17 = 3
    (gdb) p held_lwlocks
    $18 = {{lock = 0x2aaab670dfe4, mode = LW_EXCLUSIVE}, {lock = 0x2aaac587ae64, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac0a380, mode = LW_SHARED}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {
        lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {
        lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15a80, mode = LW_EXCLUSIVE}, {lock = 0x0, 
        mode = LW_EXCLUSIVE} <repeats 189 times>}


## stack of autovacuum:Acquire lock 0x2aaab670dfe4 and hold 0x2aaab4009564
--------------------------------------
    (gdb) bt
    #0  0x00007fe11552379b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
    #1  0x00007fe11552382f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
    #2  0x00007fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
    #3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac07eb8) at pg_sema.c:310
    #4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab670dfe4, mode=LW_EXCLUSIVE) at lwlock.c:1233
    #5  0x00000000004947ef in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=701, isRoot=isRoot@entry=0 '\000', parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=37) at ginvacuum.c:262
    #6  0x0000000000494874 in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=blkno@entry=9954, isRoot=isRoot@entry=1 '\001', parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=0)
        at ginvacuum.c:277
    #7  0x0000000000494ed1 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=9954, isRoot=isRoot@entry=0 '\000') at ginvacuum.c:404
    #8  0x0000000000494e21 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=644, isRoot=isRoot@entry=1 '\001') at ginvacuum.c:372
    #9  0x0000000000495540 in ginVacuumPostingTree (rootBlkno=<optimized out>, gvs=0x7ffd81e4d0f0) at ginvacuum.c:426
    #10 ginbulkdelete (info=0x7ffd81e4f720, stats=<optimized out>, callback=<optimized out>, callback_state=<optimized out>) at ginvacuum.c:649
    #11 0x00000000005e1194 in lazy_vacuum_index (indrel=0x3146e28, stats=stats@entry=0x28ec200, vacrelstats=vacrelstats@entry=0x28ebc28) at vacuumlazy.c:1621
    #12 0x00000000005e214d in lazy_scan_heap (aggressive=<optimized out>, nindexes=<optimized out>, Irel=<optimized out>, vacrelstats=<optimized out>, options=16, onerel=0x28ec1f8)
        at vacuumlazy.c:1311
    #13 lazy_vacuum_rel (onerel=onerel@entry=0x3144fa8, options=options@entry=99, params=params@entry=0x289f270, bstrategy=<optimized out>) at vacuumlazy.c:258
    #14 0x00000000005dfc35 in vacuum_rel (relid=relid@entry=2060396948, relation=relation@entry=0x7ffd81e4fe50, options=options@entry=99, params=params@entry=0x289f270) at vacuum.c:1463
    #15 0x00000000005e0ad2 in vacuum (options=99, relation=relation@entry=0x7ffd81e4fe50, relid=<optimized out>, params=params@entry=0x289f270, va_cols=va_cols@entry=0x0, 
        bstrategy=<optimized out>, bstrategy@entry=0x289e988, isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:306
    #16 0x00000000006a01ff in autovacuum_do_vac_analyze (bstrategy=0x289e988, tab=0x289f268) at autovacuum.c:3095
    #17 do_autovacuum () at autovacuum.c:2450
    #18 0x00000000006a06a9 in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1693
    #19 0x00000000006a0f49 in StartAutoVacWorker () at autovacuum.c:1498
    #20 0x00000000006aeaea in StartAutovacuumWorker () at postmaster.c:5462
    #21 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5159
    #22 <signal handler called>
    #23 0x00007fe11334bb83 in __select_nocancel () from /lib64/libc.so.6
    #24 0x000000000047a3a3 in ServerLoop () at postmaster.c:1719
    #25 0x00000000006afacf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x277ba20) at postmaster.c:1363
    #26 0x000000000047bb6f in main (argc=3, argv=0x277ba20) at main.c:228
    (gdb) f 4
    #4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab670dfe4, mode=LW_EXCLUSIVE) at lwlock.c:1233
    1233 PGSemaphoreLock(proc->sem);
    (gdb) 
    (gdb)  p num_held_lwlocks
    $1 = 1
    (gdb) p InterruptHoldoffCount
    $2 = 2
    (gdb)  p held_lwlocks
    $3 = {{lock = 0x2aaab4009564, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac0aa00, mode = LW_SHARED}, {lock = 0x2aaaba8da364, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15900, mode = LW_EXCLUSIVE}, {
        lock = 0x2aaab66eb064, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15900, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15c80, mode = LW_EXCLUSIVE}, {lock = 0x0, 
        mode = LW_EXCLUSIVE} <repeats 193 times>}
    
# env
- CentOS 7.3 64bit
- PostgreSQL 10.2(installed via rpm download from postgresql.org)

# application SQL

begin;
delete from bi_dm.tdm_sncity_workorder_analytic_statistics where order_date between '2018-10-09' and '2018-10-09';

INSERT INTO bi_dm.tdm_sncity_workorder_analytic_statistics (CITYCODE,CITYNAME,DISTRICT,DISTRICT_NAME,ZL_WERKS,ZL_WERKS_NAME,AREA_CD,AREA_NM,ZSIZE,DEPT_CD,TMALL_FLG,ORDER_DATE,DISTRIBUTE,WORKORDER,FNSH
WORKORDER,STATISFACTION_CHILD,STATISFACTION_MOTHER,ONLYONCE_FNSH,REPLY_1H,FNSH_24H,FNSH_48H,ETL_TIME,SALE_PTY,SALE_PTY_FLAG,VBELN_TP,DEALCOUNT_3,NOT_FNSH_NUM) VALUES($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11
,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26,$27) ON CONFLICT(order_date,citycode,cityname,district,district_name,zl_werks,zl_werks_name,area_cd,area_nm,zsize,dept_cd,tmall_flg,sale_pty
,sale_pty_flag,vbeln_tp) DO UPDATE SET CITYCODE = excluded.CITYCODE,CITYNAME = excluded.CITYNAME,DISTRICT = excluded.DISTRICT,DISTRICT_NAME = excluded.DISTRICT_NAME,ZL_WERKS = excluded.ZL_WERKS,
ZL_WERKS_NAME = excluded.ZL_WERKS_NAME,AREA_CD = excluded.AREA_CD,AREA_NM = excluded.AREA_NM,ZSIZE = excluded.ZSIZE,DEPT_CD = excluded.DEPT_CD,TMALL_FLG = excluded.TMALL_FLG,ORDER_DATE = 
excluded.ORDER_DATE,DISTRIBUTE = excluded.DISTRIBUTE,WORKORDER = excluded.WORKORDER,FNSHWORKORDER = excluded.FNSHWORKORDER,STATISFACTION_CHILD = excluded.STATISFACTION_CHILD,STATISFACTION_MOTHER = excluded.STATISFACTIOH_24H
,FNSH_48H = excluded.FNSH_48H,ETL_TIME = excluded.ETL_TIME,SALE_PTY = excluded.SALE_PTY,SALE_PTY_FLAG = excluded.SALE_PTY_FLAG,VBELN_TP = excluded.VBELN_TP,DEALCOUNT_3 = excluded.DEALCOUNT_3,
NOT_N_MOTHER,ONLYONCE_FNSH = excluded.ONLYONCE_FNSH,REPLY_1H = excluded.REPLY_1H,FNSH_24H = excluded.FNSH_24H,FNSH_48H = excluded.FNSH_48H,ETL_TIME = excluded.ETL_TIME,SALE_PTY = excluded.SALE_PTY,
SALE_PTY_FLAG = excluded.SALE_PTY_FLAG,VBELN_TP = excluded.VBELN_TP,DEALCOUNT_3 = excluded.DEALCOUNT_3,NOT_FNSH_NUM = excluded.NOT_FNSH_NUM 

-- use jdbc batch update to execute insert

commit;

# table def
db1=# \d bi_dm.tdm_sncity_workorder_analytic_statistics
             Table "bi_dm.tdm_sncity_workorder_analytic_statistics"
        Column        |          Type          | Collation | Nullable | Default
----------------------+------------------------+-----------+----------+---------
citycode             | character varying(100) |           |          |
cityname             | character varying(100) |           |          |
district             | character varying(100) |           |          |
district_name        | character varying(100) |           |          |
zl_werks             | character varying(100) |           |          |
zl_werks_name        | character varying(100) |           |          |
area_cd              | character varying(100) |           |          |
area_nm              | character varying(100) |           |          |
zsize                | character varying(100) |           |          |
dept_cd              | character varying(100) |           |          |
tmall_flg            | character varying(100) |           |          |
order_date           | character varying(100) |           |          |
distribute           | integer                |           |          |
workorder            | integer                |           |          |
fnshworkorder        | integer                |           |          |
statisfaction_child  | integer                |           |          |
statisfaction_mother | integer                |           |          |
onlyonce_fnsh        | integer                |           |          |
reply_1h             | integer                |           |          |
fnsh_24h             | integer                |           |          |
fnsh_48h             | integer                |           |          |
etl_time             | character varying(100) |           |          |
sale_pty             | character varying(100) |           |          |
sale_pty_flag        | character varying(100) |           |
vbeln_tp             | character varying(100) |           |          |
dealcount_3          | integer                |           |          |
not_fnsh_num         | integer                |           |          |
Indexes:
    "idx1_tdm_sncity_workorder_analytic_statistics" UNIQUE, btree (order_date, citycode, cityname, district, district_name, zl_werks, zl_werks_name, area_cd, area_nm, zsize, dept_cd, tmall_flg, sale_pty, sale_pty_flag, vbeln_tp)
    "gin_tdm_sncity_workorder_analytic_statistics" gin (order_date)

# another phenomenon

The above phenomenon just happened on the primary node. The standby node hang on another stack four times.

## stack of recovery process
--------------------------------------
#0  0x00007f7aaf72a79b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f7aaf72a82f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f7aaf72a8cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac07ff8) at pg_sema.c:310
#4  0x00000000007095ac in LWLockAcquire (lock=0x2aaaad50e2e4, mode=LW_EXCLUSIVE) at lwlock.c:1233
#5  0x0000000000503c1a in XLogReadBufferForRedoExtended (record=record@entry=0x105ce58, block_id=block_id@entry=2 '\002', mode=mode@entry=RBM_NORMAL, get_cleanup_lock=get_cleanup_lock@entry=0 '\000', buf=buf@entry=0x7fff550d93c0) at xlogutils.c:399
#6  0x0000000000503d00 in XLogReadBufferForRedo (record=record@entry=0x105ce58, block_id=block_id@entry=2 '\002', buf=buf@entry=0x7fff550d93c0) at xlogutils.c:293
#7  0x000000000048bcd5 in ginRedoDeletePage (record=0x105ce58) at ginxlog.c:482
#8  gin_redo (record=0x105ce58) at ginxlog.c:731
#9  0x00000000004f8b83 in StartupXLOG () at xlog.c:7183
#10 0x00000000006b0091 in StartupProcessMain () at startup.c:217
#11 0x000000000050689a in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff550dcbe0) at bootstrap.c:426
#12 0x00000000006ad160 in StartChildProcess (type=StartupProcess) at postmaster.c:5361
#13 0x00000000006afaab in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x101ba20) at postmaster.c:1355
#14 0x000000000047bb6f in main (argc=1, argv=0x101ba20) at main.c:228

## stack of backend process(tens of such process)
--------------------------------------
#0  0x00007f7aaf72a79b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f7aaf72a82f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f7aaf72a8cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac074b8) at pg_sema.c:310
#4  0x00000000007095ac in LWLockAcquire (lock=0x2aaac6c081e4, mode=LW_SHARED) at lwlock.c:1233
#5  0x0000000000490943 in ginTraverseLock (buffer=7034160, searchMode=searchMode@entry=1 '\001') at ginbtree.c:40
#6  0x00000000004910e0 in ginFindLeafPage (btree=btree@entry=0x1fd6318, searchMode=searchMode@entry=1 '\001', snapshot=snapshot@entry=0x200a958) at ginbtree.c:97
#7  0x000000000048fe5b in ginScanBeginPostingTree (btree=btree@entry=0x1fd6318, index=<optimized out>, rootBlkno=rootBlkno@entry=2246, snapshot=snapshot@entry=0x200a958) at gindatapage.c:1924
#8  0x000000000049319c in startScanEntry (ginstate=ginstate@entry=0x1eade90, entry=<optimized out>, snapshot=0x200a958) at ginget.c:390
#9  0x0000000000493a92 in startScan (scan=0x1e6c348, scan=0x1e6c348) at ginget.c:543
#10 gingetbitmap (scan=0x1e6c348, tbm=0x20df978) at ginget.c:1829
#11 0x00000000004c245a in index_getbitmap (scan=scan@entry=0x1e6c348, bitmap=bitmap@entry=0x20df978) at indexam.c:726
#12 0x0000000000600c52 in MultiExecBitmapIndexScan (node=0x1ebfcd0) at nodeBitmapIndexscan.c:104
#13 0x00000000005f40b9 in MultiExecProcNode (node=<optimized out>) at execProcnode.c:490
#14 0x00000000006004c7 in BitmapHeapNext (node=node@entry=0x1a0c618) at nodeBitmapHeapscan.c:117
#15 0x00000000005f5094 in ExecScanFetch (recheckMtd=0x5ff990 <BitmapHeapRecheck>, accessMtd=0x5ffa20 <BitmapHeapNext>, node=0x1a0c618) at execScan.c:97
#16 ExecScan (node=0x1a0c618, accessMtd=0x5ffa20 <BitmapHeapNext>, recheckMtd=0x5ff990 <BitmapHeapRecheck>) at execScan.c:164
#17 0x0000000000604a3f in ExecProcNode (node=0x1a0c618) at ../../../src/include/executor/executor.h:250
#18 ExecHashJoin (pstate=0x1ebf670) at nodeHashjoin.c:145
#19 0x00000000005fad89 in ExecProcNode (node=0x1ebf670) at ../../../src/include/executor/executor.h:250
#20 fetch_input_tuple (aggstate=aggstate@entry=0x1ebef78) at nodeAgg.c:695
#21 0x00000000005fcfd3 in agg_fill_hash_table (aggstate=0x1ebef78) at nodeAgg.c:2539
#22 ExecAgg (pstate=0x1ebef78) at nodeAgg.c:2151
#23 0x0000000000610636 in ExecProcNode (node=0x1ebef78) at ../../../src/include/executor/executor.h:250
#24 ExecSort (pstate=0x1ebed08) at nodeSort.c:106
#25 0x0000000000607cfd in ExecProcNode (node=0x1ebed08) at ../../../src/include/executor/executor.h:250
#26 ExecLimit (pstate=0x1ebe0f8) at nodeLimit.c:95
#27 0x00000000005ef612 in ExecProcNode (node=0x1ebe0f8) at ../../../src/include/executor/executor.h:250
#28 ExecutePlan (execute_once=<optimized out>, dest=0x10e34f8, direction=<optimized out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1ebe0f8, estate=0x1ebde98) at execMain.c:1722
#29 standard_ExecutorRun (queryDesc=0x18caf88, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:363
#30 0x00007f7aa028415d in pgss_ExecutorRun (queryDesc=0x18caf88, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:889
#31 0x000000000071a28b in PortalRunSelect (portal=portal@entry=0x1111d88, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x10e34f8) at pquery.c:932
#32 0x000000000071b63f in PortalRun (portal=<optimized out>, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x10e34f8, altdest=0x10e34f8, completionTag=0x7fff550dc220 "") at pquery.c:773
#33 0x0000000000718f84 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:1984
#34 0x000000000047ad3c in BackendRun (port=0x106a590) at postmaster.c:4405
#35 BackendStartup (port=0x106a590) at postmaster.c:4077
#36 ServerLoop () at postmaster.c:1755
#37 0x00000000006afacf in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x101ba20) at postmaster.c:1363
#38 0x000000000047bb6f in main (argc=1, argv=0x101ba20) at main.c:228

# Similar problem

I noticed that there were some similar bug reports before. 

- https://www.postgresql.org/message-id/CAK7KUdBjF4TN5Z2Rf0fv1MVUa%3DuqQssKMhoTXUa2y88R4PPjsg%40mail.gmail.com
- https://www.postgresql.org/message-id/1509467408126-0.post@n3.nabble.com
- https://www.postgresql-archive.org/Stalls-on-PGSemaphoreLock-td5797385.html

According to this https://www.postgresql-archive.org/Stalls-on-PGSemaphoreLock-td5797385.html, this issue may be related to the huge page. 
So we turn off the huge page, and I don't know if this problem will happen again, but it hasn't happened yet to now.

Maybe should only turn off transparent hugepage

BTW:
Related configuration of our machine is as following:

cat /sys/kernel/mm/transparent_hugepage/defrag
    [always] madvise never 
        
cat /proc/sys/kernel/sem 
250 32000 32 128

/proc/meminfo
AnonHugePages:   4446208 kB
HugePages_Total:   38912
HugePages_Free:     5907
HugePages_Rsvd:       40
HugePages_Surp:        0

Maybe we should only trun off transparent hugepage instead of huge page.

Now, I am not sure if this problem is a kernel configuration issue or a gin index implementation bug, any tips?

Regard,
Chen Huajun
Hi,all
 
I  analyzed the btree block where lwlock deadlock occurred, as follows:

## insert process(ginInsertValue())

                     644(root blkno)
                      |
     7054(2. held LWLock:0x2aaac587ae64)     ----rightlink---->    xxxx(3. Acquire LWLock:0x2aaab4009564,buffer = 2119038,blkno should be 9954)
       |
701(1. held LWLock:0x2aaab670dfe4)


The ginInsertValue() function above gets the lwlock in the order described in the README.

src/backend/access/gin/README
---------------------------------------------------------------
To avoid deadlocks, B-tree pages must always be locked in the same order:
left to right, and bottom to top.
...
-----------------------------------------------------------------


## autovacuum process(ginScanToDelete())

                     644(root blkno)
                      |
...  9954(1. held LWLock:0x2aaab4009564) ...
       |
701(2. Acquire LWLock:0x2aaab670dfe4)

note:according to autovacuum's core 701's parent is 9954;while insert's core shows 701's parent is 7054, rightlink of 7054 is 9954!

However, ginScanToDelete() depth-first scans the btree and gets the EXCLUSIVE lock, which creates a deadlock.
Is the above statement correct? If so, deadlocks should easily happen.


static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
                DataPageDeleteStack *parent, OffsetNumber myoff)
{
...
    if (!isRoot)
        LockBuffer(buffer, GIN_EXCLUSIVE);

...
        for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++)
        {
            PostingItem *pitem = GinDataPageGetPostingItem(page, i);

            if (ginScanToDelete(gvs, PostingItemGetBlockNumber(pitem), FALSE, me, i))
                i--;
        }
...
}

src/backend/access/gin/README
-----------------------------------------------------------------
The previous paragraph's reasoning only applies to searches, and only to
posting trees. To protect from inserters following a downlink to a deleted
page, vacuum simply locks out all concurrent insertions to the posting tree,
by holding a super-exclusive lock on the parent page of subtree with deletable
pages. 
...
posting tree. To exclude interference with readers vacuum takes exclusive
locks in a depth-first scan in left-to-right order of page tuples. 
...
-----------------------------------------------------------------

# stacks
## stack of insert:Acquire lock 0x2aaab4009564(blkno:9954) and hold 0x2aaab670dfe4(blkno:701), 0x2aaac587ae64(blkno:7054)
-----------------------------------------------------------------
(gdb) bt
#0  0x00007fe11552379b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007fe11552382f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac02958) at pg_sema.c:310
#4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab4009564, mode=LW_EXCLUSIVE) at lwlock.c:1233
#5  0x0000000000490a32 in ginStepRight (buffer=6713826, index=<optimized out>, lockmode=lockmode@entry=2) at ginbtree.c:174
#6  0x0000000000490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8, freestack=freestack@entry=1 '\001', buildStats=buildStats@entry=0x0) at ginbtree.c:701
#7  0x0000000000491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, stack=<optimized out>, insertdata=insertdata@entry=0x7ffd81e4f940, buildStats=buildStats@entry=0x0)
    at ginbtree.c:773
#8  0x000000000048fb42 in ginInsertItemPointers (index=<optimized out>, rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
    at gindatapage.c:1907
#9  0x000000000048a9ea in ginEntryInsert (ginstate=ginstate@entry=0x28e7ef8, attnum=<optimized out>, key=42920440, category=<optimized out>, items=0x2916598, nitem=353, 
    buildStats=buildStats@entry=0x0) at gininsert.c:214
#10 0x0000000000496d94 in ginInsertCleanup (ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', fill_fsm=fill_fsm@entry=1 '\001', 
    forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at ginfast.c:883
#11 0x0000000000497727 in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) at ginfast.c:448
#12 0x000000000048b209 in gininsert (index=<optimized out>, values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, heapRel=<optimized out>, checkUnique=<optimized out>, 
    indexInfo=0x28b5aa8) at gininsert.c:522
#13 0x00000000005ee8dd in ExecInsertIndexTuples (slot=slot@entry=0x28b5d58, tupleid=tupleid@entry=0x280d98c, estate=estate@entry=0x28b5288, noDupErr=noDupErr@entry=1 '\001', 
    specConflict=specConflict@entry=0x7ffd81e5013b "", arbiterIndexes=arbiterIndexes@entry=0x28c6dd8) at execIndexing.c:386
#14 0x000000000060ccf5 in ExecInsert (canSetTag=1 '\001', estate=0x28b5288, onconflict=ONCONFLICT_UPDATE, arbiterIndexes=0x28c6dd8, planSlot=0x28b5d58, slot=0x28b5d58, mtstate=0x28b5628)
    at nodeModifyTable.c:564
#15 ExecModifyTable (pstate=0x28b5628) at nodeModifyTable.c:1766
#16 0x00000000005ef612 in standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>, execute_once=<optimized out>)
    at ../../../src/include/executor/executor.h:250
#17 0x00007fe10607d15d in pgss_ExecutorRun (queryDesc=0x28cc1d8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:889
#18 0x000000000071a7ba in ProcessQuery (plan=<optimized out>, 
    sourceText=0x2849058 "INSERT INTO bi_dm.tdm_sncity_workorder_analytic_statistics (CITYCODE,CITYNAME,DISTRICT,DISTRICT_NAME,ZL_WERKS,ZL_WERKS_NAME,AREA_CD,AREA_NM,ZSIZE,DEPT_CD,TMALL_FLG,ORDER_DATE,DISTRIBUTE,WORKORDER,FNSH"..., params=0x28b2260, queryEnv=0x0, dest=0xc9e2a0 <donothingDR>, completionTag=0x7ffd81e507b0 "") at pquery.c:161
#19 0x000000000071a9f7 in PortalRunMulti (portal=portal@entry=0x289fdd8, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=0xc9e2a0 <donothingDR>, 
    dest@entry=0x284d778, altdest=0xc9e2a0 <donothingDR>, altdest@entry=0x284d778, completionTag=completionTag@entry=0x7ffd81e507b0 "") at pquery.c:1286
#20 0x000000000071b535 in PortalRun (portal=<optimized out>, count=1, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x284d778, altdest=0x284d778, 
    completionTag=0x7ffd81e507b0 "") at pquery.c:799
#21 0x0000000000718f84 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:1984
#22 0x000000000047ad3c in BackendRun (port=0x27cbc40) at postmaster.c:4405
#23 BackendStartup (port=0x27cbc40) at postmaster.c:4077
#24 ServerLoop () at postmaster.c:1755
#25 0x00000000006afacf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x277ba20) at postmaster.c:1363
#26 0x000000000047bb6f in main (argc=3, argv=0x277ba20) at main.c:228
(gdb) p num_held_lwlocks 
$1 = 2
(gdb) p lwlocks
$2 = {{lock = 0x2aaab670dfe4, mode = LW_EXCLUSIVE}, {lock = 0x2aaac587ae64, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac0a380, mode = LW_SHARED}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {
    lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {
    lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaab670a2e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15a80, mode = LW_EXCLUSIVE}, {lock = 0x0, 
    mode = LW_EXCLUSIVE} <repeats 189 times>}

(gdb) f 6
#6  0x0000000000490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8, freestack=freestack@entry=1 '\001', buildStats=buildStats@entry=0x0) at ginbtree.c:701
701                parent->buffer = ginStepRight(parent->buffer, btree->index, GIN_EXCLUSIVE);

(gdb) p *stack
$27 = {blkno = 701, buffer = 2758312, off = 0, iptr = {ip_blkid = {bi_hi = 5, bi_lo = 0}, ip_posid = 40884}, predictNumber = 1, parent = 0x28eea68}
(gdb) p ((LWLock*) (&((&BufferDescriptors[(2758312 - 1)].bufferdesc))->content_lock))
$28 = (LWLock *) 0x2aaab670dfe4

(gdb) p *parent
$29 = {blkno = 7054, buffer = 6713826, off = 0, iptr = {ip_blkid = {bi_hi = 8, bi_lo = 0}, ip_posid = 40886}, predictNumber = 1, parent = 0x28eeab8}
(gdb) p ((LWLock*) (&((&BufferDescriptors[(6713826 - 1)].bufferdesc))->content_lock))
$32 = (LWLock *) 0x2aab4fa5cfe4

(gdb) p *(GinBtreeStack *)0x28eeab8
$31 = {blkno = 644, buffer = 6713825, off = 1, iptr = {ip_blkid = {bi_hi = 35, bi_lo = 0}, ip_posid = 40884}, predictNumber = 1, parent = 0x0}
(gdb) p ((LWLock*) (&((&BufferDescriptors[(6713825 - 1)].bufferdesc))->content_lock))
$33 = (LWLock *) 0x2aaac587ae24
-----------------------------------------------------------------


## stack of autovacuum:Acquire lock 0x2aaab670dfe4(blkno:701) and hold 0x2aaab4009564(blkno:9954)
-----------------------------------------------------------------
#0  0x00007fe11552379b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007fe11552382f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac07eb8) at pg_sema.c:310
#4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab670dfe4, mode=LW_EXCLUSIVE) at lwlock.c:1233
#5  0x00000000004947ef in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=701, isRoot=isRoot@entry=0 '\000', 
    parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=37) at ginvacuum.c:262
#6  0x0000000000494874 in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=blkno@entry=9954, isRoot=isRoot@entry=1 '\001', 
    parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=0) at ginvacuum.c:277
#7  0x0000000000494ed1 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=9954, isRoot=isRoot@entry=0 '\000') at ginvacuum.c:404
#8  0x0000000000494e21 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=644, isRoot=isRoot@entry=1 '\001') at ginvacuum.c:372
#9  0x0000000000495540 in ginVacuumPostingTree (rootBlkno=<optimized out>, gvs=0x7ffd81e4d0f0) at ginvacuum.c:426
#10 ginbulkdelete (info=0x7ffd81e4f720, stats=<optimized out>, callback=<optimized out>, callback_state=<optimized out>) at ginvacuum.c:649
#11 0x00000000005e1194 in lazy_vacuum_index (indrel=0x3146e28, stats=stats@entry=0x28ec200, vacrelstats=vacrelstats@entry=0x28ebc28)
    at vacuumlazy.c:1621
#12 0x00000000005e214d in lazy_scan_heap (aggressive=<optimized out>, nindexes=<optimized out>, Irel=<optimized out>, 
    vacrelstats=<optimized out>, options=16, onerel=0x28ec1f8) at vacuumlazy.c:1311
#13 lazy_vacuum_rel (onerel=onerel@entry=0x3144fa8, options=options@entry=99, params=params@entry=0x289f270, bstrategy=<optimized out>)
    at vacuumlazy.c:258
#14 0x00000000005dfc35 in vacuum_rel (relid=relid@entry=2060396948, relation=relation@entry=0x7ffd81e4fe50, options=options@entry=99, 
---Type <return> to continue, or q <return> to quit---
    params=params@entry=0x289f270) at vacuum.c:1463
#15 0x00000000005e0ad2 in vacuum (options=99, relation=relation@entry=0x7ffd81e4fe50, relid=<optimized out>, params=params@entry=0x289f270, 
    va_cols=va_cols@entry=0x0, bstrategy=<optimized out>, bstrategy@entry=0x289e988, isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:306
#16 0x00000000006a01ff in autovacuum_do_vac_analyze (bstrategy=0x289e988, tab=0x289f268) at autovacuum.c:3095
#17 do_autovacuum () at autovacuum.c:2450
#18 0x00000000006a06a9 in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1693
#19 0x00000000006a0f49 in StartAutoVacWorker () at autovacuum.c:1498
#20 0x00000000006aeaea in StartAutovacuumWorker () at postmaster.c:5462
#21 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5159
#22 <signal handler called>
#23 0x00007fe11334bb83 in __select_nocancel () from /lib64/libc.so.6
#24 0x000000000047a3a3 in ServerLoop () at postmaster.c:1719
#25 0x00000000006afacf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x277ba20) at postmaster.c:1363
#26 0x000000000047bb6f in main (argc=3, argv=0x277ba20) at main.c:228
(gdb) f 4
#4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab670dfe4, mode=LW_EXCLUSIVE) at lwlock.c:1233
1233PGSemaphoreLock(proc->sem);
(gdb)  p num_held_lwlocks
$1 = 1
(gdb) p InterruptHoldoffCount
$2 = 2
(gdb)  p held_lwlocks
$3 = {{lock = 0x2aaab4009564, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac0aa00, mode = LW_SHARED}, {lock = 0x2aaaba8da364, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15900, mode = LW_EXCLUSIVE}, {
    lock = 0x2aaab66eb064, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15900, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac15c80, mode = LW_EXCLUSIVE}, {lock = 0x0, 
    mode = LW_EXCLUSIVE} <repeats 193 times>}
(gdb) f 8
#8  0x0000000000494e21 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=644, isRoot=isRoot@entry=1 '\001') at ginvacuum.c:372
372                if (ginVacuumPostingTreeLeaves(gvs, children[i], FALSE))

(gdb) p i
$13 = 2
(gdb) p children[1]
$14 = 7054
(gdb) p children[2]
$15 = 9954

#7  0x0000000000494ed1 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=9954, isRoot=isRoot@entry=0 '\000') at ginvacuum.c:404
404                ginScanToDelete(gvs, blkno, TRUE, &root, InvalidOffsetNumber);
(gdb) p buffer
$16 = 2119038
(gdb) p ((LWLock*) (&((&BufferDescriptors[(2119038 - 1)].bufferdesc))->content_lock))
$17 = (LWLock *) 0x2aaab4009564

(gdb) f 5
#5  0x00000000004947ef in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=701, isRoot=isRoot@entry=0 '\000', 
    parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=37) at ginvacuum.c:262
262            LockBuffer(buffer, GIN_EXCLUSIVE);

(gdb) p buffer
$29 = 2758312
(gdb) p ((LWLock*) (&((&BufferDescriptors[(2758312 - 1)].bufferdesc))->content_lock))
$32 = (LWLock *) 0x2aaab670dfe4

-----------------------------------------------------------------
On Tue, Nov 6, 2018 at 10:05 AM chenhj <chjischj@163.com> wrote:
> I  analyzed the btree block where lwlock deadlock occurred, as follows:

Thank you for doing this important work.

You're using Postgres 10.2. While that version isn't current with all
GIN bug fixes, it does have this important one:

"Ensure that vacuum will always clean up the pending-insertions list
of a GIN index (Masahiko Sawada)"

Later GIN fixes seem unlikely to be relevant to your issue. I think
that this is probably a genuine, new bug.

> The ginInsertValue() function above gets the lwlock in the order described in the README.

> However, ginScanToDelete() depth-first scans the btree and gets the EXCLUSIVE lock, which creates a deadlock.
> Is the above statement correct? If so, deadlocks should easily happen.

I have been suspicious of deadlock hazards in the code for some time
-- particularly around pending list cleanup. I go into a lot of detail
on my suspicions here:


https://www.postgresql.org/message-id/flat/CAH2-WzmfUpRjWcUq3%2B9ijyum4AJ%2Bk-meGT8_HnxBMpKz1aNS-g%40mail.gmail.com#ea5af1088adfacb3d0ba88313be1a5e3

I note that your first deadlock involve the following kinds of backends:

* ginInsertCleanup() calls from a regular backend, which will have a
backend do things that VACUUM generally only gets to do, like call
RecordFreeIndexPage().

* (auto)VACUUM processes.

Your second/recovery deadlock involves:

* Regular read-only queries.

* Recovery code.

Quite a lot of stuff is involved here!

The code in this area is way too complicated, and I haven't thought
about it in about a year, so it's hard for me to be sure of anything
at the moment. My guess is that there is confusion about the type of
page expected within one or more blocks (e.g. entry tree vs. pending
list), due to a race condition in block deletion and/or recycling --
again, I've suspected something like this could happen for some time.
The fact that you get a distinct deadlock during recovery is
consistent with that theory.

It's safe to say that promoting the asserts on gin page type into
"can't happen" elog errors in places like scanPendingInsert() and
ginInsertCleanup() would be a good start. Note that we already did
similar Assert-elog(ERROR) promotion this for posting tree code, when
similar bugs appeared way back in 2013.

-- 
Peter Geoghegan


On Mon, Oct 29, 2018 at 12:04 PM chenhj <chjischj@163.com> wrote:
>     #4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab4009564, mode=LW_EXCLUSIVE) at lwlock.c:1233
>     #5  0x0000000000490a32 in ginStepRight (buffer=6713826, index=<optimized out>, lockmode=lockmode@entry=2) at
ginbtree.c:174
>     #6  0x0000000000490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8,
freestack=freestack@entry=1'\001', buildStats=buildStats@entry=0x0) at ginbtree.c:701
 
>     #7  0x0000000000491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, stack=<optimized out>,
insertdata=insertdata@entry=0x7ffd81e4f940,buildStats=buildStats@entry=0x0)
 
>         at ginbtree.c:773
>     #8  0x000000000048fb42 in ginInsertItemPointers (index=<optimized out>, rootBlkno=rootBlkno@entry=644,
items=items@entry=0x2916598,nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
 
>         at gindatapage.c:1907
>     #9  0x000000000048a9ea in ginEntryInsert (ginstate=ginstate@entry=0x28e7ef8, attnum=<optimized out>,
key=42920440,category=<optimized out>, items=0x2916598, nitem=353,
 
>         buildStats=buildStats@entry=0x0) at gininsert.c:214
>     #10 0x0000000000496d94 in ginInsertCleanup (ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0
'\000',fill_fsm=fill_fsm@entry=1 '\001',
 
>         forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at ginfast.c:883
>     #11 0x0000000000497727 in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x28e7ef8,
collector=collector@entry=0x7ffd81e4fe60)at ginfast.c:448
 
>     #12 0x000000000048b209 in gininsert (index=<optimized out>, values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "",
ht_ctid=0x280d98c,heapRel=<optimized out>, checkUnique=<optimized out>,
 
>         indexInfo=0x28b5aa8) at gininsert.c:522

I want to point something out about the above inserter represented by
this stack trace. It deadlocks against VACUUM from within the
following ginEntryInsert() call:

            /*
             * While we left the page unlocked, more stuff might have gotten
             * added to it.  If so, process those entries immediately.  There
             * shouldn't be very many, so we don't worry about the fact that
             * we're doing this with exclusive lock. Insertion algorithm
             * guarantees that inserted row(s) will not continue on next page.
             * NOTE: intentionally no vacuum_delay_point in this loop.
             */
            if (PageGetMaxOffsetNumber(page) != maxoff)
            {
                ginInitBA(&accum);
                processPendingPage(&accum, &datums, page, maxoff + 1);

                ginBeginBAScan(&accum);
                while ((list = ginGetBAEntry(&accum,
                                             &attnum, &key, &category,
&nlist)) != NULL)
                    ginEntryInsert(ginstate, attnum, key, category, //
<--- deadlocks
                                   list, nlist, NULL);
            }

Clearly this particular call to ginEntryInsert() from within
ginInsertCleanup() is generally supposed to be a rare occurrence. It's
not surprising that it's hard for you to hit this issue.

BTW, I strongly doubt that disabling hugepages has fixed anything,
though it may have accidentally masked the problem. This is an issue
around the basic correctness of the locking protocols used by GIN.

-- 
Peter Geoghegan


On Mon, Oct 29, 2018 at 12:04 PM chenhj <chjischj@163.com> wrote:
> ## stack of autovacuum:Acquire lock 0x2aaab670dfe4 and hold 0x2aaab4009564
> --------------------------------------
>     (gdb) bt
>     #0  0x00007fe11552379b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
>     #1  0x00007fe11552382f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
>     #2  0x00007fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
>     #3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac07eb8) at pg_sema.c:310
>     #4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab670dfe4, mode=LW_EXCLUSIVE) at lwlock.c:1233
>     #5  0x00000000004947ef in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=701, isRoot=isRoot@entry=0 '\000',
parent=parent@entry=0x7ffd81e4c790,myoff=myoff@entry=37) at ginvacuum.c:262
 
>     #6  0x0000000000494874 in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=blkno@entry=9954,
isRoot=isRoot@entry=1'\001', parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=0)
 
>         at ginvacuum.c:277
>     #7  0x0000000000494ed1 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=9954,
isRoot=isRoot@entry=0'\000') at ginvacuum.c:404
 
>     #8  0x0000000000494e21 in ginVacuumPostingTreeLeaves (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=644,
isRoot=isRoot@entry=1'\001') at ginvacuum.c:372
 
>     #9  0x0000000000495540 in ginVacuumPostingTree (rootBlkno=<optimized out>, gvs=0x7ffd81e4d0f0) at
ginvacuum.c:426
>     #10 ginbulkdelete (info=0x7ffd81e4f720, stats=<optimized out>, callback=<optimized out>,
callback_state=<optimizedout>) at ginvacuum.c:649
 
>     #11 0x00000000005e1194 in lazy_vacuum_index (indrel=0x3146e28, stats=stats@entry=0x28ec200,
vacrelstats=vacrelstats@entry=0x28ebc28)at vacuumlazy.c:1621
 
>     #12 0x00000000005e214d in lazy_scan_heap (aggressive=<optimized out>, nindexes=<optimized out>, Irel=<optimized
out>,vacrelstats=<optimized out>, options=16, onerel=0x28ec1f8)
 
>         at vacuumlazy.c:1311
>     #13 lazy_vacuum_rel (onerel=onerel@entry=0x3144fa8, options=options@entry=99, params=params@entry=0x289f270,
bstrategy=<optimizedout>) at vacuumlazy.c:258
 

Actually, the bigger problem is on this side of the deadlock, within
VACUUM. ginInsertCleanup() (the first/other side of the deadlock) may
have problems, but this seems worse. Commit 218f51584d5 appears to be
at fault here.

First things first: ginScanToDelete() *maintains* buffer locks on
multiple levels of a posting tree, meaning that there may be cases
where quite a few exclusive buffer locks may be held all at once (one
per level). MAX_SIMUL_LWLOCKS is 200 these days, and in practice a
B-Tree can never get that tall, but having the number of buffer locks
acquired be determined by the height of the tree is not okay, on
general principle. The nbtree code's page deletion goes to a lot of
effort to keep the number of buffer locks fixed, but nothing like that
is is attempted for GIN posting trees.

Chen's original analysis of the problem seems to be more or less
accurate: the order that ginScanToDelete() acquires buffer locks as it
descends the tree (following commit 218f51584d5) is not compatible
with the order within ginFinishSplit(). The faulty code within
ginScanToDelete() crabs/couples buffer locks while descending the
tree, whereas the code within ginFinishSplit() crabs them as it
ascends the same tree.

Teodor: Do you think that the issue is fixable? It looks like there
are serious issues with the design of 218f51584d5 to me. I don't think
the general "there can't be any inserters at this subtree" thing works
given that we have to couple buffer locks when moving right for other
reasons. We call ginStepRight() within ginFinishSplit(), for reasons
that date back to bug fix commit ac4ab97e from 2013 -- that detail is
probably important, because it seems to be what breaks the subtree
design (we don't delete in two phases or anything in GIN).

I think that you have to be doing a multi-level delete for a deadlock
to take place, which isn't particularly likely to coincide with a
concurrent insertion in general. That may explain why it's taken a
year to get a high-quality bug report.

-- 
Peter Geoghegan


On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that you have to be doing a multi-level delete for a deadlock
> to take place, which isn't particularly likely to coincide with a
> concurrent insertion in general. That may explain why it's taken a
> year to get a high-quality bug report.

BTW, it's notable that Chen's query uses ON CONFLICT DO UPDATE.
Speculative insertion might cause just the wrong kind of churn,
involving continual recycling of heap TIDs.

-- 
Peter Geoghegan


On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Teodor: Do you think that the issue is fixable? It looks like there
> are serious issues with the design of 218f51584d5 to me. I don't think
> the general "there can't be any inserters at this subtree" thing works
> given that we have to couple buffer locks when moving right for other
> reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> probably important, because it seems to be what breaks the subtree
> design (we don't delete in two phases or anything in GIN).

Ping?

This is a thorny problem, and I'd like to get your input soon. I
suspect that reverting 218f51584d5 may be the ultimate outcome, even
though it's a v10 feature.

-- 
Peter Geoghegan



> >BTW, I strongly doubt that disabling hugepages has fixed anything, >though it may have accidentally masked the problem. This is an issue >around the basic correctness of the locking protocols used by GIN. >
We had disable hugepages for some days, but the bug still occurs.

The attachment is the gcore files that was collected when the issue occurred on the primary, and I hope to help analyze the problem.



Regard,
Chen Huajun
Attachment
Hi

Before we only discussed the connection hang on the primary, the connection hang on the standby database should be another problem.
 When the recovery process replays the gin's delete WAL record, the order of get lwlock is not the same as the select process, 
resulting in a deadlock. Accord infomation form gcore, the details are as follows:

## select process

                   2246(rootBlkno=2246)
                      |
3254(1. Held buffer=366260,LWLock=0x2aaaad50e2e4)  --rightlink--> 483(2. Acquire buffer=2201739,LWLock=0x2aaab45158a4)

The ginStepRight() function in select process gets the lwlock in the order of left to right.

## recovey process

                   2246(2. Held buffer=7034160,LWLock=0x2aaac6c081e4,rootBlkno)
                      |
3254(3. Acquire buffer=366260,LWLock=0x2aaaad50e2e4)  --rightlink--> 483(1. Held buffer=2201739,LWLock=0x2aaab45158a4)

But, the ginRedoDeletePage() function in recovery process gets the lwlock in the order of current to parent and to left.
So, i think inconsistent lock order in ginRedoDeletePage() is the reason for hang in the standby.

static void
ginRedoDeletePage(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record);
Buffer dbuffer;
Buffer pbuffer;
Buffer lbuffer;
Page page;

if (XLogReadBufferForRedo(record, 0, &dbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(dbuffer);
Assert(GinPageIsData(page));
GinPageGetOpaque(page)->flags = GIN_DELETED;
PageSetLSN(page, lsn);
MarkBufferDirty(dbuffer);
}

if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
}

if (XLogReadBufferForRedo(record, 2, &lbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(lbuffer);
Assert(GinPageIsData(page));
GinPageGetOpaque(page)->rightlink = data->rightLink;
PageSetLSN(page, lsn);
MarkBufferDirty(lbuffer);
}

if (BufferIsValid(lbuffer))
UnlockReleaseBuffer(lbuffer);
if (BufferIsValid(pbuffer))
UnlockReleaseBuffer(pbuffer);
if (BufferIsValid(dbuffer))
UnlockReleaseBuffer(dbuffer);
}


The order of get lwlock in ginRedoDeletePage() may should be change from "dbuffer->pbuffer->lbuffer" to "lbuffer->dbuffer->pbuffer" . Is this right?


## How to reproduct this issue

1. modify ginxlog.c and build(add "sleep(60)" )

if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
}
==>
if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
sleep(60);//add for debug
}


2. run test SQL on the primary

    create table tb1(id int);
    create index ON tb1 using gin(id);
    insert into tb1 select 1 from generate_series(1,1000000)id;
    delete from tb1;

3. check recovery process in standby had enter "sleep()" branch

    $ ps -ef|grep reco
    postgres 13418 13417  0 22:23 ?        00:00:00 postgres: startup process   recovering 00000001000000000000005E
    postgres 13425 31505  0 22:23 pts/8    00:00:00 grep --color=auto reco
    $ pstack 13418
    #0  0x00007f2166d39650 in __nanosleep_nocancel () from /lib64/libc.so.6
    #1  0x00007f2166d39504 in sleep () from /lib64/libc.so.6
    #2  0x000000000048614f in ginRedoDeletePage (record=0x127cbe8) at ginxlog.c:480
    #3  gin_redo (record=0x127cbe8) at ginxlog.c:732
    #4  0x00000000004efec3 in StartupXLOG ()
    #5  0x0000000000697c51 in StartupProcessMain ()
    #6  0x00000000004fd22a in AuxiliaryProcessMain ()
    #7  0x0000000000694e49 in StartChildProcess ()
    #8  0x0000000000697665 in PostmasterMain ()
    #9  0x00000000004766e1 in main ()

4. execute select SQL

    set enable_seqscan = false;
    select count(*) from tb1 where id =2;

5. check result

recovery process block in LWLock

    $ pstack 13418
    #0  0x00007f216775779b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
    #1  0x00007f216775782f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
    #2  0x00007f21677578cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
    #3  0x0000000000685df2 in PGSemaphoreLock ()
    #4  0x00000000006edd64 in LWLockAcquire ()
    #5  0x00000000004fa66a in XLogReadBufferForRedoExtended ()
    #6  0x0000000000486161 in ginRedoDeletePage (record=0x127cbe8) at ginxlog.c:483
    #7  gin_redo (record=0x127cbe8) at ginxlog.c:732
    #8  0x00000000004efec3 in StartupXLOG ()
    #9  0x0000000000697c51 in StartupProcessMain ()
    #10 0x00000000004fd22a in AuxiliaryProcessMain ()
    #11 0x0000000000694e49 in StartChildProcess ()
    #12 0x0000000000697665 in PostmasterMain ()
    #13 0x00000000004766e1 in main ()

and select also block in LWLock(deadlock ocuurs)
    
    postgres=# select pid,wait_event_type,wait_event,state,query from pg_stat_activity where state<>'idle';
  pid  | wait_event_type |   wait_event   | state  |                                            query                                             
-------+-----------------+----------------+--------+----------------------------------------------------------------------------------------------
 13472 | LWLock          | buffer_content | active | select count(*) from tb1 where id =1;
 13526 |                 |                | active | select pid,wait_event_type,wait_event,state,query from pg_stat_activity where state<>'idle';
(2 rows)


## appendix
### select process gcore analysis
---------------------------------------------------------------
[New LWP 14468]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: lma lma 10.246.5.80(54684) SELECT'.
#0  0x00007f7aaf72a79b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install audit-libs-2.6.5-3.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-26.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libcurl-7.29.0-35.el7.centos.x86_64 libgcc-4.8.5-11.el7.x86_64 libgcrypt-1.5.3-12.el7_1.1.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-15.el7.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-6.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libstdc++-4.8.5-11.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 nspr-4.11.0-1.el7_2.x86_64 nss-3.21.0-17.el7.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 nss-util-3.21.0-2.2.el7_2.x86_64 openldap-2.4.40-13.el7.x86_64 openssl-libs-1.0.1e-60.el7.x86_64 pam-1.1.8-18.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f7aaf72a79b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f7aaf72a82f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f7aaf72a8cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac072d8) at pg_sema.c:310
#4  0x00000000007095ac in LWLockAcquire (lock=0x2aaab45158a4, mode=LW_SHARED) at lwlock.c:1233
#5  0x0000000000490a32 in ginStepRight (buffer=366260, index=<optimized out>, lockmode=lockmode@entry=1) at ginbtree.c:174
#6  0x000000000049253c in entryLoadMoreItems (ginstate=0x1da3ad0, snapshot=0x167c798, advancePast=..., entry=0x159e218) at ginget.c:677
#7  entryGetItem (ginstate=ginstate@entry=0x1da3ad0, entry=entry@entry=0x159e218, advancePast=..., snapshot=snapshot@entry=0x167c798) at ginget.c:867
#8  0x0000000000493f21 in keyGetItem (snapshot=0x167c798, advancePast=..., key=<optimized out>, tempCtx=0x107ea68, ginstate=<optimized out>) at ginget.c:953
#9  scanGetItem (scan=<optimized out>, scan=<optimized out>, recheck=<synthetic pointer>, item=0x7fff550db820, advancePast=...) at ginget.c:1222
#10 gingetbitmap (scan=<optimized out>, tbm=<optimized out>) at ginget.c:1837
#11 0x00000000004c245a in index_getbitmap (scan=scan@entry=0x2236168, bitmap=bitmap@entry=0x2240f08) at indexam.c:726
#12 0x0000000000600c52 in MultiExecBitmapIndexScan (node=0x22358e8) at nodeBitmapIndexscan.c:104
#13 0x00000000005f40b9 in MultiExecProcNode (node=<optimized out>) at execProcnode.c:490
#14 0x00000000006004c7 in BitmapHeapNext (node=node@entry=0x1ec9788) at nodeBitmapHeapscan.c:117
#15 0x00000000005f5094 in ExecScanFetch (recheckMtd=0x5ff990 <BitmapHeapRecheck>, accessMtd=0x5ffa20 <BitmapHeapNext>, node=0x1ec9788) at execScan.c:97
#16 ExecScan (node=0x1ec9788, accessMtd=0x5ffa20 <BitmapHeapNext>, recheckMtd=0x5ff990 <BitmapHeapRecheck>) at execScan.c:164
#17 0x000000000060e10c in ExecProcNode (node=0x1ec9788) at ../../../src/include/executor/executor.h:250
#18 ExecNestLoop (pstate=0x1ec73c0) at nodeNestloop.c:160
#19 0x000000000060e05f in ExecProcNode (node=0x1ec73c0) at ../../../src/include/executor/executor.h:250
#20 ExecNestLoop (pstate=0x1ec7200) at nodeNestloop.c:109
#21 0x0000000000610636 in ExecProcNode (node=0x1ec7200) at ../../../src/include/executor/executor.h:250
#22 ExecSort (pstate=0x1ec6f90) at nodeSort.c:106
#23 0x00000000005fad89 in ExecProcNode (node=0x1ec6f90) at ../../../src/include/executor/executor.h:250
#24 fetch_input_tuple (aggstate=aggstate@entry=0x1ec6948) at nodeAgg.c:695
#25 0x00000000005fcf27 in agg_retrieve_direct (aggstate=0x1ec6948) at nodeAgg.c:2347
#26 ExecAgg (pstate=0x1ec6948) at nodeAgg.c:2158
#27 0x00000000005faa42 in ExecProcNode (node=0x1ec6948) at ../../../src/include/executor/executor.h:250
#28 ExecAppend (pstate=0x20843e8) at nodeAppend.c:222
#29 0x000000000060eee9 in ExecProcNode (node=0x20843e8) at ../../../src/include/executor/executor.h:250
#30 ExecResult (pstate=0x2084178) at nodeResult.c:115
#31 0x0000000000610636 in ExecProcNode (node=0x2084178) at ../../../src/include/executor/executor.h:250
#32 ExecSort (pstate=0x2083f08) at nodeSort.c:106
#33 0x0000000000607cfd in ExecProcNode (node=0x2083f08) at ../../../src/include/executor/executor.h:250
#34 ExecLimit (pstate=0x2083798) at nodeLimit.c:95
#35 0x00000000005ef612 in ExecProcNode (node=0x2083798) at ../../../src/include/executor/executor.h:250
#36 ExecutePlan (execute_once=<optimized out>, dest=0x10e34f8, direction=<optimized out>, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, use_parallel_mode=<optimized out>, 
    planstate=0x2083798, estate=0x20834e8) at execMain.c:1722
#37 standard_ExecutorRun (queryDesc=0x197b6e0, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:363
---Type <return> to continue, or q <return> to quit---
#38 0x00007f7aa028415d in pgss_ExecutorRun (queryDesc=0x197b6e0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:889
#39 0x000000000071a28b in PortalRunSelect (portal=portal@entry=0x1112198, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x10e34f8) at pquery.c:932
#40 0x000000000071b63f in PortalRun (portal=<optimized out>, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x10e34f8, altdest=0x10e34f8, 
    completionTag=0x7fff550dc220 "") at pquery.c:773
#41 0x0000000000718f84 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:1984
#42 0x000000000047ad3c in BackendRun (port=0x106a590) at postmaster.c:4405
#43 BackendStartup (port=0x106a590) at postmaster.c:4077
#44 ServerLoop () at postmaster.c:1755
#45 0x00000000006afacf in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x101ba20) at postmaster.c:1363
#46 0x000000000047bb6f in main (argc=1, argv=0x101ba20) at main.c:228
(gdb) p num_held_lwlocks 
$1 = 1
(gdb) p held_lwlocks
$2 = {{lock = 0x2aaaad50e2e4, mode = LW_SHARED}, {lock = 0x2aaaaac09e80, mode = LW_SHARED}, {lock = 0x2aaaaac09280, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac09280, mode = LW_EXCLUSIVE}, {
    lock = 0x0, mode = LW_EXCLUSIVE} <repeats 196 times>}
(gdb) f 5
#5  0x0000000000490a32 in ginStepRight (buffer=366260, index=<optimized out>, lockmode=lockmode@entry=1) at ginbtree.c:174
174 LockBuffer(nextbuffer, lockmode);
(gdb) p ((LWLock*) (&((&BufferDescriptors[(366260 - 1)].bufferdesc))->content_lock))
$3 = (LWLock *) 0x2aaaad50e2e4
(gdb) p nextbuffer
$4 = 2201739
(gdb) p ((LWLock*) (&((&BufferDescriptors[(2201739 - 1)].bufferdesc))->content_lock))
$5 = (LWLock *) 0x2aaab45158a4
#6  0x000000000049253c in entryLoadMoreItems (ginstate=0x1da3ad0, snapshot=0x167c798, advancePast=..., entry=0x159e218) at ginget.c:677
677 entry->buffer = ginStepRight(entry->buffer,
(gdb) p *entry
$2 = {queryKey = 20181026, queryCategory = 0 '\000', isPartialMatch = 0 '\000', extra_data = 0x159e170 "\003", strategy = 3, searchMode = 0, attnum = 1, 
  buffer = 366260, curItem = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 0}, matchBitmap = 0x0, matchIterator = 0x0, matchResult = 0x0, list = 0x0, nlist = 0, 
  offset = 0, isFinished = 0 '\000', reduceResult = 0 '\000', predictNumberResult = 0, btree = {findChildPage = 0x48d890 <dataLocateItem>, 
    getLeftMostChild = 0x48db40 <dataGetLeftMostPage>, isMoveRight = 0x48d830 <dataIsMoveRight>, findItem = 0x0, findChildPtr = 0x48da10 <dataFindChildPtr>, 
    beginPlaceToPage = 0x48e320 <dataBeginPlaceToPage>, execPlaceToPage = 0x48f4c0 <dataExecPlaceToPage>, prepareDownlink = 0x48dfd0 <dataPrepareDownlink>, 
    fillRoot = 0x48f420 <ginDataFillRoot>, isData = 1 '\001', index = 0x11f1b08, rootBlkno = 2246, ginstate = 0x0, fullScan = 1 '\001', isBuild = 0 '\000', 
    entryAttnum = 0, entryKey = 0, entryCategory = 0 '\000', itemptr = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 0}}}
---------------------------------------------------------------    

### recovery process gcore analysis
---------------------------------------------------------------
[New LWP 37811]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: startup process   recovering 000000040001042900000050'.
#0  0x00007f7aaf72a79b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install audit-libs-2.6.5-3.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-20.el7_2.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-26.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libcurl-7.29.0-35.el7.centos.x86_64 libgcc-4.8.5-11.el7.x86_64 libgcrypt-1.5.3-12.el7_1.1.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.1.2-15.el7.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-6.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libstdc++-4.8.5-11.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 nspr-4.11.0-1.el7_2.x86_64 nss-3.21.0-17.el7.x86_64 nss-softokn-freebl-3.16.2.3-14.4.el7.x86_64 nss-util-3.21.0-2.2.el7_2.x86_64 openldap-2.4.40-13.el7.x86_64 openssl-libs-1.0.1e-60.el7.x86_64 pam-1.1.8-18.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f7aaf72a79b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f7aaf72a82f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f7aaf72a8cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x000000000069d362 in PGSemaphoreLock (sema=0x2aaaaac07ff8) at pg_sema.c:310
#4  0x00000000007095ac in LWLockAcquire (lock=0x2aaaad50e2e4, mode=LW_EXCLUSIVE) at lwlock.c:1233
#5  0x0000000000503c1a in XLogReadBufferForRedoExtended (record=record@entry=0x105ce58, block_id=block_id@entry=2 '\002', mode=mode@entry=RBM_NORMAL, 
    get_cleanup_lock=get_cleanup_lock@entry=0 '\000', buf=buf@entry=0x7fff550d93c0) at xlogutils.c:399
#6  0x0000000000503d00 in XLogReadBufferForRedo (record=record@entry=0x105ce58, block_id=block_id@entry=2 '\002', buf=buf@entry=0x7fff550d93c0) at xlogutils.c:293
#7  0x000000000048bcd5 in ginRedoDeletePage (record=0x105ce58) at ginxlog.c:482
#8  gin_redo (record=0x105ce58) at ginxlog.c:731
#9  0x00000000004f8b83 in StartupXLOG () at xlog.c:7183
#10 0x00000000006b0091 in StartupProcessMain () at startup.c:217
#11 0x000000000050689a in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff550dcbe0) at bootstrap.c:426
#12 0x00000000006ad160 in StartChildProcess (type=StartupProcess) at postmaster.c:5361
#13 0x00000000006afaab in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x101ba20) at postmaster.c:1355
#14 0x000000000047bb6f in main (argc=1, argv=0x101ba20) at main.c:228
(gdb) p num_held_lwlocks 
$1 = 2
(gdb) p held_lwlocks
$2 = {{lock = 0x2aaab45158a4, mode = LW_EXCLUSIVE}, {lock = 0x2aaac6c081e4, mode = LW_EXCLUSIVE}, {lock = 0x2aaaaac0a780, mode = LW_SHARED}, {lock = 0x2aaabf16e7e4, mode = LW_EXCLUSIVE}, {
    lock = 0x2aaaaac08a00, mode = LW_EXCLUSIVE}, {lock = 0x0, mode = LW_EXCLUSIVE} <repeats 195 times>}
(gdb) f 7
#7  0x000000000048bcd5 in ginRedoDeletePage (record=0x105ce58) at ginxlog.c:482
482 if (XLogReadBufferForRedo(record, 2, &lbuffer) == BLK_NEEDS_REDO)
(gdb) p *record
$6 = {read_page = 0x4f40f0 <XLogPageRead>, system_identifier = 6605939872368807293, private_data = 0x7fff550d94a0, ReadRecPtr = 286050462744944, EndRecPtr = 286050462745016, 
  decoded_record = 0x105dfd8, main_data = 0x11026f8 "\002", main_data_len = 8, main_data_bufsz = 53792, record_origin = 0, blocks = {{in_use = 1 '\001', rnode = {spcNode = 1663, 
        dbNode = 16400, relNode = 2079769372}, forknum = MAIN_FORKNUM, blkno = 483, flags = 0 '\000', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x105e7d5 "\232\204", 
      hole_offset = 256, hole_length = 7680, bimg_len = 222, bimg_info = 7 '\a', has_data = 0 '\000', data = 0x1084ea8 "", data_len = 0, data_bufsz = 8100}, {in_use = 1 '\001', rnode = {
        spcNode = 1663, dbNode = 16400, relNode = 2079769372}, forknum = MAIN_FORKNUM, blkno = 2246, flags = 128 '\200', has_image = 0 '\000', apply_image = 0 '\000', 
      bkp_image = 0x10b9168 "", hole_offset = 4084, hole_length = 4100, bimg_len = 4092, bimg_info = 5 '\005', has_data = 0 '\000', data = 0x1081a58 "\001", data_len = 0, 
      data_bufsz = 4496}, {in_use = 1 '\001', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2079769372}, forknum = MAIN_FORKNUM, blkno = 3253, flags = 128 '\200', has_image = 0 '\000', 
      apply_image = 0 '\000', 
      bkp_image = 0x10b943f "\001\351\017\021\203\020\361\017\201\020\377\017\374\017\n\003\003\002\210\020\345\037\025\003\352\017\227\020\364\037\361\037\233\020\362\017\022\200\020\354\037\t\360/\027\351\017\002\016\203\020\370\017\001\001\367\017\021\001\361\017\r\364\017\020\365\017\003\200\020\003\205\020\357\017\374\017", hole_offset = 500, hole_length = 60, 
      bimg_len = 1421, bimg_info = 7 '\a', has_data = 0 '\000', data = 0x1083da8 "\234\001", data_len = 0, data_bufsz = 16}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, 
        relNode = 2080872819}, forknum = MAIN_FORKNUM, blkno = 6758, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, 
      bimg_len = 0, bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 6780, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x10e7038 "\017\001", data_len = 0, data_bufsz = 16}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, 
        relNode = 2080872819}, forknum = MAIN_FORKNUM, blkno = 6781, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, 
      bimg_len = 0, bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 6792, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 6824, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 6832, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 6833, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 6903, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 262, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 8759, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 8766, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 8927, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 8941, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 1663, dbNode = 16400, relNode = 2080872819}, 
      forknum = MAIN_FORKNUM, blkno = 9207, flags = 192 '\300', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, 
      bimg_info = 0 '\000', has_data = 0 '\000', data = 0x0, data_len = 0, data_bufsz = 0}, {in_use = 0 '\000', rnode = {spcNode = 0, dbNode = 0, relNode = 0}, forknum = MAIN_FORKNUM, 
---Type <return> to continue, or q <return> to quit---
      blkno = 0, flags = 0 '\000', has_image = 0 '\000', apply_image = 0 '\000', bkp_image = 0x0, hole_offset = 0, hole_length = 0, bimg_len = 0, bimg_info = 0 '\000', has_data = 0 '\000', 
      data = 0x0, data_len = 0, data_bufsz = 0} <repeats 16 times>}, max_block_id = 2, readBuf = 0x105de68 "\227\320\005", readLen = 8192, readSegNo = 17049936, readOff = 3686400, 
  readPageTLI = 4, latestPagePtr = 286050462744576, latestPageTLI = 4, currRecPtr = 286050462744944, currTLI = 0, currTLIValidUntil = 0, nextTLI = 0, readRecordBuf = 0x10b80f8 "F", 
  readRecordBufSize = 57344, errormsg_buf = 0x105fe78 ""}
#7  0x000000000048bcd5 in ginRedoDeletePage (record=0x105ce58) at ginxlog.c:482
482 if (XLogReadBufferForRedo(record, 2, &lbuffer) == BLK_NEEDS_REDO)
(gdb) p dbuffer
$1 = 2201739
(gdb) p pbuffer
$2 = 7034160
(gdb) p lbuffer
$3 = 366260
(gdb) p ((LWLock*) (&((&BufferDescriptors[(2201739 - 1)].bufferdesc))->content_lock))
$4 = (LWLock *) 0x2aaab45158a4
(gdb) p ((LWLock*) (&((&BufferDescriptors[(7034160 - 1)].bufferdesc))->content_lock))
$5 = (LWLock *) 0x2aaac6c081e4
(gdb) p ((LWLock*) (&((&BufferDescriptors[(366260 - 1)].bufferdesc))->content_lock))
$6 = (LWLock *) 0x2aaaad50e2e4
---------------------------------------------------------------


Regard,
Chen Huajun
 


Hi!

Mail.App somehow borken my reponse with <div class="ApplePlainTextBody">, which is far beyond of my understanding of
whatis plain text, that's why I'll quote all my previous message here. Hope this mail client is OK.
 

14.11.2018, 23:13, "Andrey Borodin" <x4mmm@yandex-team.ru>:
> Hi everyone!
>
> I didn't noticed this thread for too long somehow, sorry.
>
>> 8 нояб. 2018 г., в 6:46, Peter Geoghegan <pg@bowt.ie> написал(а):
>>
>> I don't think
>> the general "there can't be any inserters at this subtree" thing works
>> given that we have to couple buffer locks when moving right for other
>> reasons. We call ginStepRight() within ginFinishSplit(), for reasons
>> that date back to bug fix commit ac4ab97e from 2013 -- that detail is
>> probably important, because it seems to be what breaks the subtree
>> design (we don't delete in two phases or anything in GIN).
>
> ginVacuumPostingTreeLeaves() holds LockBufferForCleanup() on subtree root b.
> Thus there may be no GinBtreeStack's holding pin on b at the moment.
> When you ginStepRight(b) to the parent in ginFinishSplit(), you always get to the buffer from your stack.
> Hence you can never have ginFinishSplit() deadlock with cleanup of subtree whose root is LockBufferForCleanup()'d.
>
> Is this correct or did I miss something?
>
> But we have a deadlock at hand, I'll think more about it. Something with locking protocol is clearly wrong.
>
>> 11 нояб. 2018 г., в 22:33, chenhj <chjischj@163.com> написал(а):
>>
>> The order of get lwlock in ginRedoDeletePage() may should be change from "dbuffer->pbuffer->lbuffer" to
"lbuffer->dbuffer->pbuffer". Is this right?
 
>
> This looks correct to me.
>
> Best regards, Andrey Borodin.

I've reread Chen's reports and understood his findings.

> When you ginStepRight(b) to the parent in ginFinishSplit(), you always get to the buffer from your stack.

This statement does not hold.
When you have a GinBtreeStack S, one of it's internal pages may split into new page P. If in this moment you start a
multi-leveldelete from P and cascade split from S, P is not in S and we may deadlock.
 

Correct solution seems to replace lock-coupling descent by usual B-tree searches for parent pages as in B-tree.
I think I can compose patch for consideration.

Best regards, Andrey Borodin.


Hi!

> 15 нояб. 2018 г., в 17:30, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> I think I can compose patch for consideration.

I found no practical way to fix concept of "subtree-lock". Pre-locking all left siblings for cleanup would suffice, but
doesnot look very practical. 
GIN Posting tree has no all useful B-tree inventory like left links and high keys for concurrent deletes without
"subtree-lock".

Let's consider concurrent deletion.
As far as I understand, absence of high-key makes unsafe concurrent deletion of highest keys from internal pages. Thus,
internalpages cannot be deleted at all. 
PFA patch with most simple deletion algorithm:
1. Descend to leftmost leaf
2. Scan by rightlinks until rightmost page
3. Delete empty leaf pages, if they are not highest keys in corresponding parent page

Parent page is found by rightlink scan for leftmost parent with cached previous result.
Deleted page is stamped by ReadNewTransactionId()'s xid, page cannot be reused until deletion xid is beyond
RecentGlobalDataXmin.
DeleteXid is WAL-logged.
I've refactored btree->getLeftMostChild() and btree->findChildPtr() to reuse their code (removed unnecessary
parameter).

Please note, that attached patch do not fix 2nd bug found by Chen in page delete redo.

I understand that reverting commit 218f51584d5 and returning to long but finite root locks is safest solution. In this
descriptionof proposed patch I've tried to put concepts in a minimum of words. Please ping me if some concepts sound
crypticand require more clarification. 

Best regards, Andrey Borodin.

Attachment
On Fri, Nov 23, 2018 at 2:18 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I found no practical way to fix concept of "subtree-lock". Pre-locking all left siblings for cleanup would suffice,
butdoes not look very practical.
 
> GIN Posting tree has no all useful B-tree inventory like left links and high keys for concurrent deletes without
"subtree-lock".

Significantly changing the design of GIN vacuuming in back branches
for a release that's more than a year old is absolutely out of the
question. I think that your patch should be evaluated as a new v12
feature, following the revert of 218f51584d5.

> Please note, that attached patch do not fix 2nd bug found by Chen in page delete redo.
>
> I understand that reverting commit 218f51584d5 and returning to long but finite root locks is safest solution

Whatever happens with the master branch, I think that reverting commit
218f51584d5 is the only option on the table for the back branches. Its
design is based on ideas on locking protocols that are fundamentally
incorrect and unworkable. I don't have a lot of faith in our ability
to retrofit a design that fixes the issue without causing problems
elsewhere.

-- 
Peter Geoghegan


Hi,

On 2018-11-10 17:42:16 -0800, Peter Geoghegan wrote:
> On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Teodor: Do you think that the issue is fixable? It looks like there
> > are serious issues with the design of 218f51584d5 to me. I don't think
> > the general "there can't be any inserters at this subtree" thing works
> > given that we have to couple buffer locks when moving right for other
> > reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> > that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> > probably important, because it seems to be what breaks the subtree
> > design (we don't delete in two phases or anything in GIN).
> 
> Ping?
> 
> This is a thorny problem, and I'd like to get your input soon. I
> suspect that reverting 218f51584d5 may be the ultimate outcome, even
> though it's a v10 feature.

Teodor, any chance for a response here?  It's not OK to commit something
and then just not respond to bug-reports, especially when they're well
diagnose and clearly point towards issues in a commit of yours.

Alexander, CCing you, perhaps you could point the thread out to Teodor?

Greetings,

Andres Freund


Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Tue, Dec 4, 2018 at 8:01 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-11-10 17:42:16 -0800, Peter Geoghegan wrote:
> > On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > Teodor: Do you think that the issue is fixable? It looks like there
> > > are serious issues with the design of 218f51584d5 to me. I don't think
> > > the general "there can't be any inserters at this subtree" thing works
> > > given that we have to couple buffer locks when moving right for other
> > > reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> > > that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> > > probably important, because it seems to be what breaks the subtree
> > > design (we don't delete in two phases or anything in GIN).
> >
> > Ping?
> >
> > This is a thorny problem, and I'd like to get your input soon. I
> > suspect that reverting 218f51584d5 may be the ultimate outcome, even
> > though it's a v10 feature.
>
> Teodor, any chance for a response here?  It's not OK to commit something
> and then just not respond to bug-reports, especially when they're well
> diagnose and clearly point towards issues in a commit of yours.

Unfortunately, Teodor is unreachable at least till next week.

> Alexander, CCing you, perhaps you could point the thread out to Teodor?

I'll take a look at this issue.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Thu, Dec 6, 2018 at 10:56 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Tue, Dec 4, 2018 at 8:01 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2018-11-10 17:42:16 -0800, Peter Geoghegan wrote:
> > > On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > > Teodor: Do you think that the issue is fixable? It looks like there
> > > > are serious issues with the design of 218f51584d5 to me. I don't think
> > > > the general "there can't be any inserters at this subtree" thing works
> > > > given that we have to couple buffer locks when moving right for other
> > > > reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> > > > that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> > > > probably important, because it seems to be what breaks the subtree
> > > > design (we don't delete in two phases or anything in GIN).
> > >
> > > Ping?
> > >
> > > This is a thorny problem, and I'd like to get your input soon. I
> > > suspect that reverting 218f51584d5 may be the ultimate outcome, even
> > > though it's a v10 feature.
> >
> > Teodor, any chance for a response here?  It's not OK to commit something
> > and then just not respond to bug-reports, especially when they're well
> > diagnose and clearly point towards issues in a commit of yours.
>
> Unfortunately, Teodor is unreachable at least till next week.
>
> > Alexander, CCing you, perhaps you could point the thread out to Teodor?
>
> I'll take a look at this issue.

I've run through the thread.

So, algorithm introduced by 218f51584d5 is broken.  It tries to
guarantee that there are no inserters in the subtree by placing
cleanup lock to subtree root, assuming that inserter holds pins on the
path from root to leaf.  But due to concurrent splits of internal
pages the pins held can be not relevant to actual path.  I don't see
the way to fix this easily.  So, I think we should revert it from back
branches and try to reimplement that in master.

However, I'd like to note that 218f51584d5 introduces two changes:
1) Cleanup locking only if there pages to delete
2) Cleanup locking only subtree root
The 2nd one is broken.  But the 1st one seems still good for me and
useful, because in vast majority of cases vacuum doesn't delete any
index pages.  So, I propose to revert 218f51584d5, but leave there
logic, which locks root for cleanup only once there are pages to
delete.  Any thoughts?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> So, algorithm introduced by 218f51584d5 is broken.  It tries to
> guarantee that there are no inserters in the subtree by placing
> cleanup lock to subtree root, assuming that inserter holds pins on the
> path from root to leaf.  But due to concurrent splits of internal
> pages the pins held can be not relevant to actual path.  I don't see
> the way to fix this easily.  So, I think we should revert it from back
> branches and try to reimplement that in master.
>
> However, I'd like to note that 218f51584d5 introduces two changes:
> 1) Cleanup locking only if there pages to delete
> 2) Cleanup locking only subtree root
> The 2nd one is broken.  But the 1st one seems still good for me and
> useful, because in vast majority of cases vacuum doesn't delete any
> index pages.  So, I propose to revert 218f51584d5, but leave there
> logic, which locks root for cleanup only once there are pages to
> delete.  Any thoughts?

Can you post a patch that just removes the 2nd part, leaving the
still-correct first part?

Your proposal sounds reasonable, but I'd like to see what you have in
mind in detail before commenting.

Thanks
-- 
Peter Geoghegan


Hi!

> 7 дек. 2018 г., в 2:50, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
>>
>> However, I'd like to note that 218f51584d5 introduces two changes:
>> 1) Cleanup locking only if there pages to delete
>> 2) Cleanup locking only subtree root
>> The 2nd one is broken.  But the 1st one seems still good for me and
>> useful, because in vast majority of cases vacuum doesn't delete any
>> index pages.  So, I propose to revert 218f51584d5, but leave there
>> logic, which locks root for cleanup only once there are pages to
>> delete.  Any thoughts?
>
> Can you post a patch that just removes the 2nd part, leaving the
> still-correct first part?

I like the idea of keeping cleanup lock only if there are pages to delete. It will still solve the original problem
thatcaused proposals for GIN VACUUM enhancements. 

But I must note that there is one more problem: ginVacuumPostingTreeLeaves() do not ensure that all splitted pages are
visited.It copies downlink block numbers to a temp array and then unlocks parent. It is not correct way to scan posting
treefor cleanup. 

PFA diff with following changes:
1. Always take root cleanup lock before deleting pages
2. Check for concurrent splits after scanning page

Please note, that neither applying this diff nor reverting 218f51584d5 will solve bug of page delete redo lock on
standby.

Best regards, Andrey Borodin.

Attachment

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Fri, Dec 7, 2018 at 12:14 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 7 дек. 2018 г., в 2:50, Peter Geoghegan <pg@bowt.ie> написал(а):
> > On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> >> However, I'd like to note that 218f51584d5 introduces two changes:
> >> 1) Cleanup locking only if there pages to delete
> >> 2) Cleanup locking only subtree root
> >> The 2nd one is broken.  But the 1st one seems still good for me and
> >> useful, because in vast majority of cases vacuum doesn't delete any
> >> index pages.  So, I propose to revert 218f51584d5, but leave there
> >> logic, which locks root for cleanup only once there are pages to
> >> delete.  Any thoughts?
> >
> > Can you post a patch that just removes the 2nd part, leaving the
> > still-correct first part?
>
> I like the idea of keeping cleanup lock only if there are pages to delete. It will still solve the original problem
thatcaused proposals for GIN VACUUM enhancements. 
>
> But I must note that there is one more problem: ginVacuumPostingTreeLeaves() do not ensure that all splitted pages
arevisited. It copies downlink block numbers to a temp array and then unlocks parent. It is not correct way to scan
postingtree for cleanup. 

Hmm... In ginVacuumPostingTreeLeaves() we visit pages from parent to
children.  This is why splitted pages might be unvisited.  Should we
visit leafpages by rightlinks instead?  Then, if we have candidates
for delete, ginScanToDelete() can work the same as before 218f51584d5.

> PFA diff with following changes:
> 1. Always take root cleanup lock before deleting pages
> 2. Check for concurrent splits after scanning page
>
> Please note, that neither applying this diff nor reverting 218f51584d5 will solve bug of page delete redo lock on
standby.

ginRedoDeletePage() visits pages in following order: right(to be
deleted), parent, left.  I didn't find any description of why it has
been done so.  Any guesses don't we visit parent page first here?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Fri, Dec 7, 2018 at 12:50 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > So, algorithm introduced by 218f51584d5 is broken.  It tries to
> > guarantee that there are no inserters in the subtree by placing
> > cleanup lock to subtree root, assuming that inserter holds pins on the
> > path from root to leaf.  But due to concurrent splits of internal
> > pages the pins held can be not relevant to actual path.  I don't see
> > the way to fix this easily.  So, I think we should revert it from back
> > branches and try to reimplement that in master.
> >
> > However, I'd like to note that 218f51584d5 introduces two changes:
> > 1) Cleanup locking only if there pages to delete
> > 2) Cleanup locking only subtree root
> > The 2nd one is broken.  But the 1st one seems still good for me and
> > useful, because in vast majority of cases vacuum doesn't delete any
> > index pages.  So, I propose to revert 218f51584d5, but leave there
> > logic, which locks root for cleanup only once there are pages to
> > delete.  Any thoughts?
>
> Can you post a patch that just removes the 2nd part, leaving the
> still-correct first part?
>
> Your proposal sounds reasonable, but I'd like to see what you have in
> mind in detail before commenting.

Yep, please find attached draft patch.

BTW, it seems that I've another bug in GIN.  README says that

"However, posting trees are only
fully searched from left to right, starting from the leftmost leaf. (The
tree-structure is only needed by insertions, to quickly find the correct
insert location). So as long as we don't delete the leftmost page on each
level, a search can never follow a downlink to page that's about to be
deleted."

But that's not really true once we teach GIN to skip parts of posting
trees in PostgreSQL 9.4.  So, there might be a risk to visit page to
be deleted using downlink.  But in order to get real problem, vacuum
should past cleanup stage and someone else should reuse this page,
before we traverse downlink.  Thus, the risk of real problem is very
low.  But it still should be addressed.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
Hi!

Thanks, Alexander!
> 8 дек. 2018 г., в 6:54, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
>
> Yep, please find attached draft patch.

Patch seems good to me, I'll check it in more detail.
The patch gets posting item at FirstOffsetNumber instead of btree->getLeftMostChild(). This seem OK, since
dataGetLeftMostPage()is doing just the same, but with few Assert()s. 

> BTW, it seems that I've another bug in GIN.  README says that
>
> "However, posting trees are only
> fully searched from left to right, starting from the leftmost leaf. (The
> tree-structure is only needed by insertions, to quickly find the correct
> insert location). So as long as we don't delete the leftmost page on each
> level, a search can never follow a downlink to page that's about to be
> deleted."
>
> But that's not really true once we teach GIN to skip parts of posting
> trees in PostgreSQL 9.4.  So, there might be a risk to visit page to
> be deleted using downlink.  But in order to get real problem, vacuum
> should past cleanup stage and someone else should reuse this page,
> before we traverse downlink.  Thus, the risk of real problem is very
> low.  But it still should be addressed.

There's a patch above in this thread 0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch where I propose
stampingevery deleted page with GinPageSetDeleteXid(page, ReadNewTransactionId()); and avoid reusing the page before
TransactionIdPrecedes(GinPageGetDeleteXid(page),RecentGlobalDataXmin). 
Should we leave alone this bug for future fixes to keep current fix noninvasive?

Best regards, Andrey Borodin.

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Sat, Dec 8, 2018 at 12:48 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 8 дек. 2018 г., в 6:54, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
> >
> > Yep, please find attached draft patch.
>
> Patch seems good to me, I'll check it in more detail.
> The patch gets posting item at FirstOffsetNumber instead of btree->getLeftMostChild(). This seem OK, since
dataGetLeftMostPage()is doing just the same, but with few Assert()s. 

I'd like to evade creating GinBtree for just calling
getLeftMostChild().  Also, few more places in ginvacuum.c do the same.
We have the same amount of Assert()s in ginVacuumPostingTreeLeaves().
So, let's keep it uniform.

I would also like Peter Geoghegan to take a look at this patch before
committing it.

> > BTW, it seems that I've another bug in GIN.  README says that
> >
> > "However, posting trees are only
> > fully searched from left to right, starting from the leftmost leaf. (The
> > tree-structure is only needed by insertions, to quickly find the correct
> > insert location). So as long as we don't delete the leftmost page on each
> > level, a search can never follow a downlink to page that's about to be
> > deleted."
> >
> > But that's not really true once we teach GIN to skip parts of posting
> > trees in PostgreSQL 9.4.  So, there might be a risk to visit page to
> > be deleted using downlink.  But in order to get real problem, vacuum
> > should past cleanup stage and someone else should reuse this page,
> > before we traverse downlink.  Thus, the risk of real problem is very
> > low.  But it still should be addressed.
>
> There's a patch above in this thread 0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch where I propose
stampingevery deleted page with GinPageSetDeleteXid(page, ReadNewTransactionId()); and avoid reusing the page before
TransactionIdPrecedes(GinPageGetDeleteXid(page),RecentGlobalDataXmin). 
> Should we leave alone this bug for future fixes to keep current fix noninvasive?

I think since is separate bug introduced in PostgreSQL 9.4, which
should be backpatched with separate commit.  Could you please extract
patch dealing with GinPageSetDeleteXid() and GinPageGetDeleteXid().
The rest of work made in your patch should be considered for master.

BTW, what do you think about locking order in ginRedoDeletePage()?  Do
you have any explanation of current behavior?  I'll try to reach
Teodor tomorrow with this question.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Hi!

> 10 дек. 2018 г., в 0:25, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
>>
>> There's a patch above in this thread 0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch where I propose
stampingevery deleted page with GinPageSetDeleteXid(page, ReadNewTransactionId()); and avoid reusing the page before
TransactionIdPrecedes(GinPageGetDeleteXid(page),RecentGlobalDataXmin). 
>> Should we leave alone this bug for future fixes to keep current fix noninvasive?
>
> I think since is separate bug introduced in PostgreSQL 9.4, which
> should be backpatched with separate commit.  Could you please extract
> patch dealing with GinPageSetDeleteXid() and GinPageGetDeleteXid().
> The rest of work made in your patch should be considered for master.

PFA this part. In thread about GiST vacuum Heikki was not very happy with reusing PageHeader->pd_prune_xid for this.
Butwe somewhat concluded that this might be OK. 
Anyway, there's whole page, we can store those bits anywhere.

> BTW, what do you think about locking order in ginRedoDeletePage()?  Do
> you have any explanation of current behavior?
It seems to me that locking order must be left->deleted->parent.

Best regards, Andrey Borodin.

> 10 дек. 2018 г., в 18:56, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> PFA this part. In thread about GiST vacuum Heikki was not very happy with reusing PageHeader->pd_prune_xid for this.
Butwe somewhat concluded that this might be OK. 
> Anyway, there's whole page, we can store those bits anywhere.

Errrmm... I'm very sorry for the noise. I've forgot the actual patch.

Best regards, Andrey Borodin.

Attachment

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Mon, Dec 10, 2018 at 4:58 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 10 дек. 2018 г., в 18:56, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
> >
> > PFA this part. In thread about GiST vacuum Heikki was not very happy with reusing PageHeader->pd_prune_xid for
this.But we somewhat concluded that this might be OK. 
> > Anyway, there's whole page, we can store those bits anywhere.

We might allocate special field in opaque to store stat xid like
B-tree or SP-GiST.  But we have to be binary compatible.  And this
change shouldn't require upgrade procedure, since we're going to
backpatch that.  We could allocate space elsewhere on the page, but
that would be a bit cumbersome.  So, let's use pd_prune_xid once it's
available.

> Errrmm... I'm very sorry for the noise. I've forgot the actual patch.

Attached patch appears to be incomplete.  GinPageSetDeleteXid() is
called only in ginRedoDeletePage(), so only in recovery, while it
should be set during normal work too.  deleteXid field of
ginxlogDeletePage is never set.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Sun, Dec 9, 2018 at 10:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, Dec 8, 2018 at 12:48 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > > 8 дек. 2018 г., в 6:54, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
> > >
> > > Yep, please find attached draft patch.
> >
> > Patch seems good to me, I'll check it in more detail.
> > The patch gets posting item at FirstOffsetNumber instead of btree->getLeftMostChild(). This seem OK, since
dataGetLeftMostPage()is doing just the same, but with few Assert()s. 
>
> I'd like to evade creating GinBtree for just calling
> getLeftMostChild().  Also, few more places in ginvacuum.c do the same.
> We have the same amount of Assert()s in ginVacuumPostingTreeLeaves().
> So, let's keep it uniform.
>
> I would also like Peter Geoghegan to take a look at this patch before
> committing it.

I've slightly adjusted commit message.  I'm going to commit this fix
if no objections.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
Hi!

> 11 дек. 2018 г., в 3:43, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
>
> Attached patch appears to be incomplete.  GinPageSetDeleteXid() is
> called only in ginRedoDeletePage(), so only in recovery, while it
> should be set during normal work too.  deleteXid field of
> ginxlogDeletePage is never set.

Sorry, I've messed it again. Forgot to include ginvacuum.c changes. Here it is.

Best regards, Andrey Borodin.

Attachment

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Tue, Dec 11, 2018 at 10:14 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 11 дек. 2018 г., в 3:43, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
> >
> > Attached patch appears to be incomplete.  GinPageSetDeleteXid() is
> > called only in ginRedoDeletePage(), so only in recovery, while it
> > should be set during normal work too.  deleteXid field of
> > ginxlogDeletePage is never set.
>
> Sorry, I've messed it again. Forgot to include ginvacuum.c changes. Here it is.

BTW, I still can't see you're setting deleteXid field of
ginxlogDeletePage struct anywhere.

Also, I note that protection against re-usage of recently deleted
pages is implemented differently than it is in B-tree.
1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
RecentGlobalDataXmin)), while B-tree checks
TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
here?
2) B-tree checks this condition both before putting page into FSM and
after getting page from FSM.  You're checking only after getting page
from FSM.  Approach of B-tree looks better for me.  It's seems more
consistent when FSM pages are really free for usage.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Tue, Dec 11, 2018 at 1:50 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sun, Dec 9, 2018 at 10:25 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> > On Sat, Dec 8, 2018 at 12:48 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > > > 8 дек. 2018 г., в 6:54, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
> > > >
> > > > Yep, please find attached draft patch.
> > >
> > > Patch seems good to me, I'll check it in more detail.
> > > The patch gets posting item at FirstOffsetNumber instead of btree->getLeftMostChild(). This seem OK, since
dataGetLeftMostPage()is doing just the same, but with few Assert()s. 
> >
> > I'd like to evade creating GinBtree for just calling
> > getLeftMostChild().  Also, few more places in ginvacuum.c do the same.
> > We have the same amount of Assert()s in ginVacuumPostingTreeLeaves().
> > So, let's keep it uniform.
> >
> > I would also like Peter Geoghegan to take a look at this patch before
> > committing it.
>
> I've slightly adjusted commit message.  I'm going to commit this fix
> if no objections.

Please also find patch changing lock order in ginRedoDeletePage()
preventing deadlock on standby.  I'm going to commit it too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
Hi!

> 12 дек. 2018 г., в 3:26, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
>
> BTW, I still can't see you're setting deleteXid field of
> ginxlogDeletePage struct anywhere.
Oops. Fixed.
>
> Also, I note that protection against re-usage of recently deleted
> pages is implemented differently than it is in B-tree.
> 1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
> RecentGlobalDataXmin)), while B-tree checks
> TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
> clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
> here?
As far as I understand RecentGlobalDataXmin may be slightly farther than RecentGlobalXmin in case when
replication_slot_catalog_xminis holding RecentGlobalXmin. And GIN is never used by catalog tables. But let's use
RecentGlobalXminlike in B-tree. 

> 2) B-tree checks this condition both before putting page into FSM and
> after getting page from FSM.  You're checking only after getting page
> from FSM.  Approach of B-tree looks better for me.  It's seems more
> consistent when FSM pages are really free for usage.
Fixed.


Best regards, Andrey Borodin.

Attachment

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Wed, Dec 12, 2018 at 4:08 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 12 дек. 2018 г., в 3:26, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
> >
> > BTW, I still can't see you're setting deleteXid field of
> > ginxlogDeletePage struct anywhere.
> Oops. Fixed.
> >
> > Also, I note that protection against re-usage of recently deleted
> > pages is implemented differently than it is in B-tree.
> > 1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
> > RecentGlobalDataXmin)), while B-tree checks
> > TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
> > clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
> > here?
> As far as I understand RecentGlobalDataXmin may be slightly farther than RecentGlobalXmin in case when
replication_slot_catalog_xminis holding RecentGlobalXmin. And GIN is never used by catalog tables. But let's use
RecentGlobalXminlike in B-tree. 
>
> > 2) B-tree checks this condition both before putting page into FSM and
> > after getting page from FSM.  You're checking only after getting page
> > from FSM.  Approach of B-tree looks better for me.  It's seems more
> > consistent when FSM pages are really free for usage.
> Fixed.

Thank you.  I've revised your patch and pushed it.  As long as two
other patches in this thread.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Thu, Dec 13, 2018 at 03:03:54PM +0300, Alexander Korotkov wrote:
> On Wed, Dec 12, 2018 at 4:08 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > > 12 дек. 2018 г., в 3:26, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
> > >
> > > BTW, I still can't see you're setting deleteXid field of
> > > ginxlogDeletePage struct anywhere.
> > Oops. Fixed.
> > >
> > > Also, I note that protection against re-usage of recently deleted
> > > pages is implemented differently than it is in B-tree.
> > > 1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
> > > RecentGlobalDataXmin)), while B-tree checks
> > > TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
> > > clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
> > > here?
> > As far as I understand RecentGlobalDataXmin may be slightly farther than RecentGlobalXmin in case when
replication_slot_catalog_xminis holding RecentGlobalXmin. And GIN is never used by catalog tables. But let's use
RecentGlobalXminlike in B-tree.
 
> >
> > > 2) B-tree checks this condition both before putting page into FSM and
> > > after getting page from FSM.  You're checking only after getting page
> > > from FSM.  Approach of B-tree looks better for me.  It's seems more
> > > consistent when FSM pages are really free for usage.
> > Fixed.
> 
> Thank you.  I've revised your patch and pushed it.  As long as two
> other patches in this thread.

I am seeing this warning in the 9.4 branch:

    ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
    in this function [-Wmaybe-uninitialized]

from this commit:

    commit 19cf52e6cc33b9e126775f28269ccccb6ddf7e30
    Author: Alexander Korotkov <akorotkov@postgresql.org>
    Date:   Thu Dec 13 06:12:25 2018 +0300

    Prevent deadlock in ginRedoDeletePage()

    On standby ginRedoDeletePage() can work concurrently with read-only queries.
    Those queries can traverse posting tree in two ways.
    1) Using rightlinks by ginStepRight(), which locks the next page before
       unlocking its left sibling.
    2) Using downlinks by ginFindLeafPage(), which locks at most one page at time.

    Original lock order was: page, parent, left sibling.  That lock order can
    deadlock with ginStepRight().  In order to prevent deadlock this commit changes
    lock order to: left sibling, page, parent.  Note, that position of parent in
    locking order seems insignificant, because we only lock one page at time while
    traversing downlinks.

    Reported-by: Chen Huajun
    Diagnosed-by: Chen Huajun, Peter Geoghegan, Andrey Borodin
    Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
    Author: Alexander Korotkov
    Backpatch-through: 9.4

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Bruce Momjian <bruce@momjian.us> writes:
> I am seeing this warning in the 9.4 branch:
>     ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
>     in this function [-Wmaybe-uninitialized]

I'm also getting that, just in 9.4, but at a different line number:

ginxlog.c: In function 'ginRedoDeletePage':
ginxlog.c:693: warning: 'lbuffer' may be used uninitialized in this function

That's with gcc version 4.4.7 20120313 (Red Hat 4.4.7-23)

            regards, tom lane


Hi!


> 13 дек. 2018 г., в 17:03, Alexander Korotkov <aekorotkov@gmail.com> написал(а):
>
> Thank you.  I've revised your patch and pushed it.  As long as two
> other patches in this thread.

That's great! Thanks!

> 13 дек. 2018 г., в 20:12, chjischj@163.com написал(а):
>
>
> hi
> I Have a question. Why the order of unlocking is not adjusted in this patch? like this:
>
> if (BufferIsValid(lbuffer))
> UnlockReleaseBuffer(lbuffer);
> if (BufferIsValid(pbuffer))
> UnlockReleaseBuffer(pbuffer);
> if (BufferIsValid(dbuffer))
> UnlockReleaseBuffer(dbuffer);
> ==>
> if (BufferIsValid(pbuffer))
> UnlockReleaseBuffer(pbuffer);
> if (BufferIsValid(dbuffer))
> UnlockReleaseBuffer(dbuffer);
> if (BufferIsValid(lbuffer))
> UnlockReleaseBuffer(lbuffer);

I think that unlock order does not matter. But I may be wrong. May be just for uniformity?


> 13 дек. 2018 г., в 21:48, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
>
> Bruce Momjian <bruce@momjian.us> writes:
>> I am seeing this warning in the 9.4 branch:
>>     ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
>>     in this function [-Wmaybe-uninitialized]
>
> I'm also getting that, just in 9.4, but at a different line number:
>
> ginxlog.c: In function 'ginRedoDeletePage':
> ginxlog.c:693: warning: 'lbuffer' may be used uninitialized in this function
>
> That's with gcc version 4.4.7 20120313 (Red Hat 4.4.7-23)


That's the same variable, one place is definition while other is potential misuse.
Seems like these 2 lines [0]

+    if (BufferIsValid(lbuffer))
+        UnlockReleaseBuffer(lbuffer);

are superfluous: lbuffer is UnlockReleased earlier.



Best regards, Andrey Borodin.

[0]
https://github.com/postgres/postgres/commit/19cf52e6cc33b9e126775f28269ccccb6ddf7e30#diff-ed6446a8993c76d2884ec6413e49a6b6R757

Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Thu, Dec 13, 2018 at 8:06 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > if (BufferIsValid(lbuffer))
> > UnlockReleaseBuffer(lbuffer);
> > if (BufferIsValid(pbuffer))
> > UnlockReleaseBuffer(pbuffer);
> > if (BufferIsValid(dbuffer))
> > UnlockReleaseBuffer(dbuffer);
> > ==>
> > if (BufferIsValid(pbuffer))
> > UnlockReleaseBuffer(pbuffer);
> > if (BufferIsValid(dbuffer))
> > UnlockReleaseBuffer(dbuffer);
> > if (BufferIsValid(lbuffer))
> > UnlockReleaseBuffer(lbuffer);
>
> I think that unlock order does not matter. But I may be wrong. May be just for uniformity?

It doesn't mater, because we release all locks on every buffer at one
time.  The unlock order can have effect on what waiter will acquire
the lock next after ginRedoDeletePage().  However, I don't see why one
unlock order is better than another.  Thus, I just used the rule of
thumb to not change code when it's not necessary for bug fix.

> > 13 дек. 2018 г., в 21:48, Tom Lane <tgl@sss.pgh.pa.us> написал(а):
> >
> > Bruce Momjian <bruce@momjian.us> writes:
> >> I am seeing this warning in the 9.4 branch:
> >>      ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
> >>      in this function [-Wmaybe-uninitialized]
> >
> > I'm also getting that, just in 9.4, but at a different line number:
> >
> > ginxlog.c: In function 'ginRedoDeletePage':
> > ginxlog.c:693: warning: 'lbuffer' may be used uninitialized in this function
> >
> > That's with gcc version 4.4.7 20120313 (Red Hat 4.4.7-23)
>
>
> That's the same variable, one place is definition while other is potential misuse.
> Seems like these 2 lines [0]
>
> +       if (BufferIsValid(lbuffer))
> +               UnlockReleaseBuffer(lbuffer);
>
> are superfluous: lbuffer is UnlockReleased earlier.

Thanks to everybody for noticing!  Speaking more generally backpatch
to 9.4 was completely wrong: there were multiple errors.  It's my
oversight, I forget how much better our xlog system became since 9.4
:)

I've pushed bf0e5a73be fixing that.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Hi,

On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
> It doesn't mater, because we release all locks on every buffer at one
> time.  The unlock order can have effect on what waiter will acquire
> the lock next after ginRedoDeletePage().  However, I don't see why one
> unlock order is better than another.  Thus, I just used the rule of
> thumb to not change code when it's not necessary for bug fix.

I think it's right to not change unlock order at the same time as a
bugfix here.  More generally I think it can often be useful to default
to release locks in the inverse order they've been acquired - if there's
any likelihood that somebody will acquire them in the same order, that
ensures that such a party would only need to wait for a lock once,
instead of being woken up for one lock, and then immediately having to
wait for the next one.

Greetings,

Andres Freund


Re: Connections hang indefinitely while taking a gin index's LWLockbuffer_content lock

From
Alexander Korotkov
Date:
On Thu, Dec 13, 2018 at 10:46 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
> > It doesn't mater, because we release all locks on every buffer at one
> > time.  The unlock order can have effect on what waiter will acquire
> > the lock next after ginRedoDeletePage().  However, I don't see why one
> > unlock order is better than another.  Thus, I just used the rule of
> > thumb to not change code when it's not necessary for bug fix.
>
> I think it's right to not change unlock order at the same time as a
> bugfix here.  More generally I think it can often be useful to default
> to release locks in the inverse order they've been acquired - if there's
> any likelihood that somebody will acquire them in the same order, that
> ensures that such a party would only need to wait for a lock once,
> instead of being woken up for one lock, and then immediately having to
> wait for the next one.

Good point, thank you!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Thu, Dec 13, 2018 at 10:40:59PM +0300, Alexander Korotkov wrote:
> On Thu, Dec 13, 2018 at 8:06 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > That's the same variable, one place is definition while other is potential misuse.
> > Seems like these 2 lines [0]
> >
> > +       if (BufferIsValid(lbuffer))
> > +               UnlockReleaseBuffer(lbuffer);
> >
> > are superfluous: lbuffer is UnlockReleased earlier.
> 
> Thanks to everybody for noticing!  Speaking more generally backpatch
> to 9.4 was completely wrong: there were multiple errors.  It's my
> oversight, I forget how much better our xlog system became since 9.4
> :)
> 
> I've pushed bf0e5a73be fixing that.

I can confirm the compiler warning is now gone.  Thanks.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


On Thu, 13 Dec 2018 at 14:48, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Dec 13, 2018 at 10:46 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
> > It doesn't mater, because we release all locks on every buffer at one
> > time.  The unlock order can have effect on what waiter will acquire
> > the lock next after ginRedoDeletePage().  However, I don't see why one
> > unlock order is better than another.  Thus, I just used the rule of
> > thumb to not change code when it's not necessary for bug fix.
>
> I think it's right to not change unlock order at the same time as a
> bugfix here.  More generally I think it can often be useful to default
> to release locks in the inverse order they've been acquired - if there's
> any likelihood that somebody will acquire them in the same order, that
> ensures that such a party would only need to wait for a lock once,
> instead of being woken up for one lock, and then immediately having to
> wait for the next one.

Good point, thank you!

It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
introduced a WAL incompatibility that has not been flagged.

In ginRedoDeletePage() we use the struct directly to read the WAL record, so if a WAL record was written prior to 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have problems, since deleteXid will not be set correctly.

It seems this should not have been backpatched.

Please give your assessment.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 21, 2019 at 9:26 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, 13 Dec 2018 at 14:48, Alexander Korotkov <aekorotkov@gmail.com> wrote:
>> On Thu, Dec 13, 2018 at 10:46 PM Andres Freund <andres@anarazel.de> wrote:
>> > On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
>> > > It doesn't mater, because we release all locks on every buffer at one
>> > > time.  The unlock order can have effect on what waiter will acquire
>> > > the lock next after ginRedoDeletePage().  However, I don't see why one
>> > > unlock order is better than another.  Thus, I just used the rule of
>> > > thumb to not change code when it's not necessary for bug fix.
>> >
>> > I think it's right to not change unlock order at the same time as a
>> > bugfix here.  More generally I think it can often be useful to default
>> > to release locks in the inverse order they've been acquired - if there's
>> > any likelihood that somebody will acquire them in the same order, that
>> > ensures that such a party would only need to wait for a lock once,
>> > instead of being woken up for one lock, and then immediately having to
>> > wait for the next one.
>>
>> Good point, thank you!
>
>
> It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
> introduced a WAL incompatibility that has not been flagged.
>
> In ginRedoDeletePage() we use the struct directly to read the WAL record, so if a WAL record was written prior to
52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478,yet read by code at 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we
willhave problems, since deleteXid will not be set correctly. 
>
> It seems this should not have been backpatched.
>
> Please give your assessment.

Oh, right.  This is my fault.

However, I think this still can be backpatched correctly.  We can
determine whether xlog record data contains deleteXid by its size.
See the attached patch.  I haven't test this yet.  I'm going to test
it.  If OK, then push.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
On Thu, 21 Mar 2019 at 15:18, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Thu, Mar 21, 2019 at 9:26 PM Simon Riggs <simon@2ndquadrant.com> wrote:

 
> It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
> introduced a WAL incompatibility that has not been flagged.
>
> In ginRedoDeletePage() we use the struct directly to read the WAL record, so if a WAL record was written prior to 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have problems, since deleteXid will not be set correctly.
>
> It seems this should not have been backpatched.
>
> Please give your assessment.

Oh, right.  This is my fault.

However, I think this still can be backpatched correctly.  We can
determine whether xlog record data contains deleteXid by its size.
See the attached patch.  I haven't test this yet.  I'm going to test
it.  If OK, then push.

Patch looks like it will work.

I'd prefer to see a tighter test, since the length should either be the old or new length, no other. 

Thanks

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Mar-21, Alexander Korotkov wrote:

> However, I think this still can be backpatched correctly.  We can
> determine whether xlog record data contains deleteXid by its size.
> See the attached patch.  I haven't test this yet.  I'm going to test
> it.  If OK, then push.

Wow, this is so magical that I think it merits a long comment explaining
what is going on.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On Fri, Mar 22, 2019 at 12:06 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2019-Mar-21, Alexander Korotkov wrote:
>
> > However, I think this still can be backpatched correctly.  We can
> > determine whether xlog record data contains deleteXid by its size.
> > See the attached patch.  I haven't test this yet.  I'm going to test
> > it.  If OK, then push.
>
> Wow, this is so magical that I think it merits a long comment explaining
> what is going on.

Yeah, have to fo magic to fix such weird things :)
Please, find next revision of patch attached.  It uses static
assertion to check that data structure size has changed.  Also,
assertion checks that record data length match on of structures
length.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
On Fri, Mar 22, 2019 at 11:05 AM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Fri, Mar 22, 2019 at 12:06 AM Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > On 2019-Mar-21, Alexander Korotkov wrote:
> >
> > > However, I think this still can be backpatched correctly.  We can
> > > determine whether xlog record data contains deleteXid by its size.
> > > See the attached patch.  I haven't test this yet.  I'm going to test
> > > it.  If OK, then push.
> >
> > Wow, this is so magical that I think it merits a long comment explaining
> > what is going on.
>
> Yeah, have to fo magic to fix such weird things :)
> Please, find next revision of patch attached.  It uses static
> assertion to check that data structure size has changed.  Also,
> assertion checks that record data length match on of structures
> length.

I'm going to backpatch this on no objections.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Sat, Mar 23, 2019 at 11:43 AM Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Fri, Mar 22, 2019 at 11:05 AM Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
> > On Fri, Mar 22, 2019 at 12:06 AM Alvaro Herrera
> > <alvherre@2ndquadrant.com> wrote:
> > > On 2019-Mar-21, Alexander Korotkov wrote:
> > >
> > > > However, I think this still can be backpatched correctly.  We can
> > > > determine whether xlog record data contains deleteXid by its size.
> > > > See the attached patch.  I haven't test this yet.  I'm going to test
> > > > it.  If OK, then push.
> > >
> > > Wow, this is so magical that I think it merits a long comment explaining
> > > what is going on.
> >
> > Yeah, have to fo magic to fix such weird things :)
> > Please, find next revision of patch attached.  It uses static
> > assertion to check that data structure size has changed.  Also,
> > assertion checks that record data length match on of structures
> > length.
>
> I'm going to backpatch this on no objections.

So, pushed.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company