pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic - Mailing list pgsql-hackers

An internal instance was rejecting connections with "too many clients".
I found a bunch of processes waiting on a futex and I was going to upgrade the
kernel (3.10.0-514) and dismiss the issue.

However, I also found an autovacuum chewing 100% CPU, and it appears the
problem is actually because autovacuum has locked a page of pg-statistic, and
every other process then gets stuck waiting in the planner.  I checked a few
and found these:

#13 0x0000000000961908 in SearchSysCache3 (cacheId=cacheId@entry=59, key1=key1@entry=2610, key2=key2@entry=2,
key3=key3@entry=0)at syscache.c:1156
 

As for the autovacuum:

$ ps -wwf 18950
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
postgres 18950  7179 93 Jun04 ?        ts   2049:20 postgres: autovacuum worker ts

(gdb)
#0  0x00000000004f995c in heap_prune_satisfies_vacuum (prstate=prstate@entry=0x7ffe7a0cd0c0,
tup=tup@entry=0x7ffe7a0cce50,buffer=buffer@entry=14138) at pruneheap.c:423
 
#1  0x00000000004fa154 in heap_prune_chain (prstate=0x7ffe7a0cd0c0, rootoffnum=11, buffer=14138) at pruneheap.c:644
#2  heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138, vistest=vistest@entry=0xe7bcc0
<GlobalVisCatalogRels>,old_snap_xmin=old_snap_xmin@entry=0,
 
    old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=<optimized out>,
off_loc@entry=0x1d1b3fc)at pruneheap.c:278
 
#3  0x00000000004fd9bf in lazy_scan_prune (vacrel=vacrel@entry=0x1d1b390, buf=buf@entry=14138, blkno=blkno@entry=75,
page=page@entry=0x2aaab2089e00"G\f",
 
    vistest=vistest@entry=0xe7bcc0 <GlobalVisCatalogRels>, prunestate=prunestate@entry=0x7ffe7a0ced80) at
vacuumlazy.c:1712
#4  0x0000000000500263 in lazy_scan_heap (aggressive=<optimized out>, params=0x1c77b7c, vacrel=<optimized out>) at
vacuumlazy.c:1347
#5  heap_vacuum_rel (rel=0x7f0349466d28, params=0x1c77b7c, bstrategy=<optimized out>) at vacuumlazy.c:612
#6  0x000000000067418a in table_relation_vacuum (bstrategy=<optimized out>, params=0x1c77b7c, rel=0x7f0349466d28) at
../../../src/include/access/tableam.h:1678
#7  vacuum_rel (relid=2619, relation=<optimized out>, params=params@entry=0x1c77b7c) at vacuum.c:2001
#8  0x000000000067556e in vacuum (relations=0x1cc5008, params=params@entry=0x1c77b7c, bstrategy=<optimized out>,
bstrategy@entry=0x1c77400,isTopLevel=isTopLevel@entry=true) at vacuum.c:461
 
#9  0x0000000000783c13 in autovacuum_do_vac_analyze (bstrategy=0x1c77400, tab=0x1c77b78) at autovacuum.c:3284
#10 do_autovacuum () at autovacuum.c:2537
#11 0x0000000000784073 in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1715
#12 0x00000000007841c9 in StartAutoVacWorker () at autovacuum.c:1500
#13 0x0000000000792b9c in StartAutovacuumWorker () at postmaster.c:5547
#14 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5251
#15 <signal handler called>
#16 0x00007f0346c56783 in __select_nocancel () from /lib64/libc.so.6
#17 0x000000000048ee7d in ServerLoop () at postmaster.c:1709
#18 0x0000000000793e98 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1bed280) at postmaster.c:1417
#19 0x0000000000491272 in main (argc=3, argv=0x1bed280) at main.c:209

heap_page_prune() is being called repeatedly, with (I think) the same arguments.

(gdb) c
Continuing.

Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138,
vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, 
 
    old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at
pruneheap.c:225
225     in pruneheap.c
(gdb) 
Continuing.

Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138,
vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, 
 
    old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at
pruneheap.c:225
225     in pruneheap.c
(gdb) 
Continuing.

Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138,
vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, 
 
    old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at
pruneheap.c:225
225     in pruneheap.c
(gdb) 
Continuing.

Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138,
vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, 
 
    old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at
pruneheap.c:225
225     in pruneheap.c

(gdb) p *vacrel
$3 = {rel = 0x7f0349466d28, indrels = 0x1d1b500, nindexes = 1, do_index_vacuuming = true, do_index_cleanup = true,
do_failsafe= false, bstrategy = 0x1c77400, lps = 0x0, old_rel_pages = 80, 
 
  old_live_tuples = 1101, relfrozenxid = 909081649, relminmxid = 53341561, OldestXmin = 913730329, FreezeLimit =
863730329,MultiXactCutoff = 48553302, relnamespace = 0x1d1b520 "pg_catalog", 
 
  relname = 0x1d1b548 "pg_statistic", indname = 0x0, blkno = 75, offnum = 15, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP,
dead_tuples= 0x1ccef10, rel_pages = 85, scanned_pages = 76, 
 
  pinskipped_pages = 0, frozenskipped_pages = 0, tupcount_pages = 76, pages_removed = 0, lpdead_item_pages = 65,
nonempty_pages= 75, lock_waiter_detected = false, new_rel_tuples = 0, 
 
  new_live_tuples = 0, indstats = 0x1d1b590, num_index_scans = 0, tuples_deleted = 757, lpdead_items = 1103,
new_dead_tuples= 0, num_tuples = 973, live_tuples = 973}
 

(gdb) p *rel
$2 = {rd_node = {spcNode = 1663, dbNode = 16886, relNode = 107230415}, rd_smgr = 0x1d0a670, rd_refcnt = 1, rd_backend =
-1,rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, 
 
  rd_indexvalid = true, rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_firstRelfilenodeSubid
=0, rd_droppedSubid = 0, rd_rel = 0x7f0349466f40, rd_att = 0x7f0349467058, 
 
  rd_id = 2619, rd_lockInfo = {lockRelId = {relId = 2619, dbId = 16886}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc =
0x0,rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, 
 
  rd_partkey = 0x0, rd_partkeycxt = 0x0, rd_partdesc = 0x0, rd_pdcxt = 0x0, rd_partdesc_nodetached = 0x0, rd_pddcxt =
0x0,rd_partdesc_nodetached_xmin = 0, rd_partcheck = 0x0, 
 
  rd_partcheckvalid = false, rd_partcheckcxt = 0x0, rd_indexlist = 0x7f0349498e40, rd_pkindex = 2696, rd_replidindex =
0,rd_statlist = 0x0, rd_indexattr = 0x7f0349498ee8, 
 
  rd_keyattr = 0x7f0349498e98, rd_pkattr = 0x7f0349498ec0, rd_idattr = 0x0, rd_pubactions = 0x0, rd_options = 0x0,
rd_amhandler= 3, rd_tableam = 0x9ccfc0 <heapam_methods>, rd_index = 0x0, 
 
  rd_indextuple = 0x0, rd_indexcxt = 0x0, rd_indam = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0,
rd_supportinfo= 0x0, rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, 
 
  rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_indcollation = 0x0, rd_opcoptions = 0x0, rd_amcache =
0x0,rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x1c88940}
 

(gdb) p *relation->rd_rel
$8 = {oid = 2619, relname = {data = "pg_statistic", '\000' <repeats 51 times>}, relnamespace = 11, reltype = 13029,
reloftype= 0, relowner = 10, relam = 2, relfilenode = 107230415, 
 
  reltablespace = 0, relpages = 80, reltuples = 1101, relallvisible = 11, reltoastrelid = 2840, relhasindex = true,
relisshared= false, relpersistence = 112 'p', relkind = 114 'r', relnatts = 31, 
 
  relchecks = 0, relhasrules = false, relhastriggers = false, relhassubclass = false, relrowsecurity = false,
relforcerowsecurity= false, relispopulated = true, relreplident = 110 'n', 
 
  relispartition = false, relrewrite = 0, relfrozenxid = 909081649, relminmxid = 53341561}

(gdb) info locals
rel = 0x7f0349466d28
offnum = 4
maxoff = 21
itemid = 0x2aaab2089e24
tuple = {t_len = 564, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 75}, ip_posid = 4}, t_tableOid = 2619, t_data =
0x2aaab208bbc8}

(gdb) p *itemid
$2 = {lp_off = 7624, lp_flags = 1, lp_len = 564}

I'll leave the instance running for a little bit before restarting (or kill-9)
in case someone requests more info.

See also:
https://www.postgresql.org/message-id/2591376.1621196582@sss.pgh.pa.us

These commits may be relevant.

commit 3c3b8a4b26891892bccf3d220580a7f413c0b9ca
Author: Peter Geoghegan <pg@bowt.ie>
Date:   Wed Apr 7 08:47:15 2021 -0700

    Truncate line pointer array during VACUUM.

commit 7ab96cf6b312cfcd79cdc1a69c6bdb75de0ed30f
Author: Peter Geoghegan <pg@bowt.ie>
Date:   Tue Apr 6 07:49:39 2021 -0700

    Refactor lazy_scan_heap() loop.

commit 8523492d4e349c4714aa2ab0291be175a88cb4fc
Author: Peter Geoghegan <pg@bowt.ie>
Date:   Tue Apr 6 08:49:22 2021 -0700

    Remove tupgone special case from vacuumlazy.c.

commit dc7420c2c9274a283779ec19718d2d16323640c0
Author: Andres Freund <andres@anarazel.de>
Date:   Wed Aug 12 16:03:49 2020 -0700

    snapshot scalability: Don't compute global horizons while building snapshots.



pgsql-hackers by date:

Previous
From: Omar Kilani
Date:
Subject: Re: Strangeness with UNIQUE indexes and UTF-8
Next
From: Chapman Flack
Date:
Subject: Re: Strangeness with UNIQUE indexes and UTF-8