pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic - Mailing list pgsql-hackers
From | Justin Pryzby |
---|---|
Subject | pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic |
Date | |
Msg-id | 20210606163531.GT14099@telsasoft.com Whole thread Raw |
Responses |
Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic |
List | pgsql-hackers |
An internal instance was rejecting connections with "too many clients". I found a bunch of processes waiting on a futex and I was going to upgrade the kernel (3.10.0-514) and dismiss the issue. However, I also found an autovacuum chewing 100% CPU, and it appears the problem is actually because autovacuum has locked a page of pg-statistic, and every other process then gets stuck waiting in the planner. I checked a few and found these: #13 0x0000000000961908 in SearchSysCache3 (cacheId=cacheId@entry=59, key1=key1@entry=2610, key2=key2@entry=2, key3=key3@entry=0)at syscache.c:1156 As for the autovacuum: $ ps -wwf 18950 UID PID PPID C STIME TTY STAT TIME CMD postgres 18950 7179 93 Jun04 ? ts 2049:20 postgres: autovacuum worker ts (gdb) #0 0x00000000004f995c in heap_prune_satisfies_vacuum (prstate=prstate@entry=0x7ffe7a0cd0c0, tup=tup@entry=0x7ffe7a0cce50,buffer=buffer@entry=14138) at pruneheap.c:423 #1 0x00000000004fa154 in heap_prune_chain (prstate=0x7ffe7a0cd0c0, rootoffnum=11, buffer=14138) at pruneheap.c:644 #2 heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138, vistest=vistest@entry=0xe7bcc0 <GlobalVisCatalogRels>,old_snap_xmin=old_snap_xmin@entry=0, old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=<optimized out>, off_loc@entry=0x1d1b3fc)at pruneheap.c:278 #3 0x00000000004fd9bf in lazy_scan_prune (vacrel=vacrel@entry=0x1d1b390, buf=buf@entry=14138, blkno=blkno@entry=75, page=page@entry=0x2aaab2089e00"G\f", vistest=vistest@entry=0xe7bcc0 <GlobalVisCatalogRels>, prunestate=prunestate@entry=0x7ffe7a0ced80) at vacuumlazy.c:1712 #4 0x0000000000500263 in lazy_scan_heap (aggressive=<optimized out>, params=0x1c77b7c, vacrel=<optimized out>) at vacuumlazy.c:1347 #5 heap_vacuum_rel (rel=0x7f0349466d28, params=0x1c77b7c, bstrategy=<optimized out>) at vacuumlazy.c:612 #6 0x000000000067418a in table_relation_vacuum (bstrategy=<optimized out>, params=0x1c77b7c, rel=0x7f0349466d28) at ../../../src/include/access/tableam.h:1678 #7 vacuum_rel (relid=2619, relation=<optimized out>, params=params@entry=0x1c77b7c) at vacuum.c:2001 #8 0x000000000067556e in vacuum (relations=0x1cc5008, params=params@entry=0x1c77b7c, bstrategy=<optimized out>, bstrategy@entry=0x1c77400,isTopLevel=isTopLevel@entry=true) at vacuum.c:461 #9 0x0000000000783c13 in autovacuum_do_vac_analyze (bstrategy=0x1c77400, tab=0x1c77b78) at autovacuum.c:3284 #10 do_autovacuum () at autovacuum.c:2537 #11 0x0000000000784073 in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1715 #12 0x00000000007841c9 in StartAutoVacWorker () at autovacuum.c:1500 #13 0x0000000000792b9c in StartAutovacuumWorker () at postmaster.c:5547 #14 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5251 #15 <signal handler called> #16 0x00007f0346c56783 in __select_nocancel () from /lib64/libc.so.6 #17 0x000000000048ee7d in ServerLoop () at postmaster.c:1709 #18 0x0000000000793e98 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1bed280) at postmaster.c:1417 #19 0x0000000000491272 in main (argc=3, argv=0x1bed280) at main.c:209 heap_page_prune() is being called repeatedly, with (I think) the same arguments. (gdb) c Continuing. Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138, vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at pruneheap.c:225 225 in pruneheap.c (gdb) Continuing. Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138, vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at pruneheap.c:225 225 in pruneheap.c (gdb) Continuing. Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138, vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at pruneheap.c:225 225 in pruneheap.c (gdb) Continuing. Breakpoint 3, heap_page_prune (relation=relation@entry=0x7f0349466d28, buffer=buffer@entry=14138, vistest=vistest@entry=0xe7bcc0<GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin@entry=0, old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false, off_loc=off_loc@entry=0x1d1b3fc) at pruneheap.c:225 225 in pruneheap.c (gdb) p *vacrel $3 = {rel = 0x7f0349466d28, indrels = 0x1d1b500, nindexes = 1, do_index_vacuuming = true, do_index_cleanup = true, do_failsafe= false, bstrategy = 0x1c77400, lps = 0x0, old_rel_pages = 80, old_live_tuples = 1101, relfrozenxid = 909081649, relminmxid = 53341561, OldestXmin = 913730329, FreezeLimit = 863730329,MultiXactCutoff = 48553302, relnamespace = 0x1d1b520 "pg_catalog", relname = 0x1d1b548 "pg_statistic", indname = 0x0, blkno = 75, offnum = 15, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, dead_tuples= 0x1ccef10, rel_pages = 85, scanned_pages = 76, pinskipped_pages = 0, frozenskipped_pages = 0, tupcount_pages = 76, pages_removed = 0, lpdead_item_pages = 65, nonempty_pages= 75, lock_waiter_detected = false, new_rel_tuples = 0, new_live_tuples = 0, indstats = 0x1d1b590, num_index_scans = 0, tuples_deleted = 757, lpdead_items = 1103, new_dead_tuples= 0, num_tuples = 973, live_tuples = 973} (gdb) p *rel $2 = {rd_node = {spcNode = 1663, dbNode = 16886, relNode = 107230415}, rd_smgr = 0x1d0a670, rd_refcnt = 1, rd_backend = -1,rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = true, rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_firstRelfilenodeSubid =0, rd_droppedSubid = 0, rd_rel = 0x7f0349466f40, rd_att = 0x7f0349467058, rd_id = 2619, rd_lockInfo = {lockRelId = {relId = 2619, dbId = 16886}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0,rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkey = 0x0, rd_partkeycxt = 0x0, rd_partdesc = 0x0, rd_pdcxt = 0x0, rd_partdesc_nodetached = 0x0, rd_pddcxt = 0x0,rd_partdesc_nodetached_xmin = 0, rd_partcheck = 0x0, rd_partcheckvalid = false, rd_partcheckcxt = 0x0, rd_indexlist = 0x7f0349498e40, rd_pkindex = 2696, rd_replidindex = 0,rd_statlist = 0x0, rd_indexattr = 0x7f0349498ee8, rd_keyattr = 0x7f0349498e98, rd_pkattr = 0x7f0349498ec0, rd_idattr = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_amhandler= 3, rd_tableam = 0x9ccfc0 <heapam_methods>, rd_index = 0x0, rd_indextuple = 0x0, rd_indexcxt = 0x0, rd_indam = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0, rd_supportinfo= 0x0, rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_indcollation = 0x0, rd_opcoptions = 0x0, rd_amcache = 0x0,rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x1c88940} (gdb) p *relation->rd_rel $8 = {oid = 2619, relname = {data = "pg_statistic", '\000' <repeats 51 times>}, relnamespace = 11, reltype = 13029, reloftype= 0, relowner = 10, relam = 2, relfilenode = 107230415, reltablespace = 0, relpages = 80, reltuples = 1101, relallvisible = 11, reltoastrelid = 2840, relhasindex = true, relisshared= false, relpersistence = 112 'p', relkind = 114 'r', relnatts = 31, relchecks = 0, relhasrules = false, relhastriggers = false, relhassubclass = false, relrowsecurity = false, relforcerowsecurity= false, relispopulated = true, relreplident = 110 'n', relispartition = false, relrewrite = 0, relfrozenxid = 909081649, relminmxid = 53341561} (gdb) info locals rel = 0x7f0349466d28 offnum = 4 maxoff = 21 itemid = 0x2aaab2089e24 tuple = {t_len = 564, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 75}, ip_posid = 4}, t_tableOid = 2619, t_data = 0x2aaab208bbc8} (gdb) p *itemid $2 = {lp_off = 7624, lp_flags = 1, lp_len = 564} I'll leave the instance running for a little bit before restarting (or kill-9) in case someone requests more info. See also: https://www.postgresql.org/message-id/2591376.1621196582@sss.pgh.pa.us These commits may be relevant. commit 3c3b8a4b26891892bccf3d220580a7f413c0b9ca Author: Peter Geoghegan <pg@bowt.ie> Date: Wed Apr 7 08:47:15 2021 -0700 Truncate line pointer array during VACUUM. commit 7ab96cf6b312cfcd79cdc1a69c6bdb75de0ed30f Author: Peter Geoghegan <pg@bowt.ie> Date: Tue Apr 6 07:49:39 2021 -0700 Refactor lazy_scan_heap() loop. commit 8523492d4e349c4714aa2ab0291be175a88cb4fc Author: Peter Geoghegan <pg@bowt.ie> Date: Tue Apr 6 08:49:22 2021 -0700 Remove tupgone special case from vacuumlazy.c. commit dc7420c2c9274a283779ec19718d2d16323640c0 Author: Andres Freund <andres@anarazel.de> Date: Wed Aug 12 16:03:49 2020 -0700 snapshot scalability: Don't compute global horizons while building snapshots.
pgsql-hackers by date: