Re: PANIC: wrong buffer passed to visibilitymap_clear - Mailing list pgsql-hackers

From Tom Lane
Subject Re: PANIC: wrong buffer passed to visibilitymap_clear
Date
Msg-id 2644575.1618117467@sss.pgh.pa.us
Whole thread Raw
In response to Re: PANIC: wrong buffer passed to visibilitymap_clear  (Andres Freund <andres@anarazel.de>)
Responses Re: PANIC: wrong buffer passed to visibilitymap_clear
List pgsql-hackers
I've managed to reproduce this locally, by dint of running the
src/bin/scripts tests over and over and tweaking the timing by
trying different "taskset" parameters to vary the number of CPUs
available.  I find that I duplicated the report from spurfowl,
particularly

(gdb) bt
#0  0x00007f67bb6807d5 in raise () from /lib64/libc.so.6
#1  0x00007f67bb669895 in abort () from /lib64/libc.so.6
#2  0x000000000094ce37 in errfinish (filename=<optimized out>,
    lineno=<optimized out>,
    funcname=0x9ac120 <__func__.1> "visibilitymap_clear") at elog.c:680
#3  0x0000000000488b8c in visibilitymap_clear (rel=rel@entry=0x7f67b2837330,
    heapBlk=<optimized out>, buf=buf@entry=0, flags=flags@entry=3 '\003')
                             ^^^^^^^^^^^^^^^
    at visibilitymap.c:155
#4  0x000000000055cd87 in heap_update (relation=0x7f67b2837330,
    otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>,
    crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700,
    lockmode=0x7ffecf4d56fc) at heapam.c:3993
#5  0x000000000055dd61 in simple_heap_update (
    relation=relation@entry=0x7f67b2837330, otid=otid@entry=0x7f67b274744c,
    tup=tup@entry=0x7f67b2747448) at heapam.c:4211
#6  0x00000000005e531c in CatalogTupleUpdate (heapRel=0x7f67b2837330,
    otid=0x7f67b274744c, tup=0x7f67b2747448) at indexing.c:309
#7  0x00000000006420f9 in update_attstats (relid=1255, inh=false,
    natts=natts@entry=30, vacattrstats=vacattrstats@entry=0x19c9fc0)
    at analyze.c:1758
#8  0x00000000006430dd in update_attstats (vacattrstats=0x19c9fc0, natts=30,
    inh=false, relid=<optimized out>) at analyze.c:1646
#9  do_analyze_rel (onerel=<optimized out>, params=0x7ffecf4d5e50,
    va_cols=0x0, acquirefunc=<optimized out>, relpages=86,
    inh=<optimized out>, in_outer_xact=false, elevel=13) at analyze.c:589
#10 0x00000000006447a1 in analyze_rel (relid=<optimized out>,
    relation=<optimized out>, params=params@entry=0x7ffecf4d5e50, va_cols=0x0,
    in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:261
#11 0x00000000006a5718 in vacuum (relations=0x19c8158, params=0x7ffecf4d5e50,
    bstrategy=<optimized out>, isTopLevel=<optimized out>) at vacuum.c:478
#12 0x00000000006a5c94 in ExecVacuum (pstate=pstate@entry=0x1915970,
    vacstmt=vacstmt@entry=0x18ed5c8, isTopLevel=isTopLevel@entry=true)
    at vacuum.c:254
#13 0x000000000083c32c in standard_ProcessUtility (pstmt=0x18ed918,
    queryString=0x18eca20 "ANALYZE pg_catalog.pg_proc;",
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
    dest=0x18eda08, qc=0x7ffecf4d61c0) at utility.c:826

I'd not paid much attention to that point before, but now it
seems there is no question that heap_update is reaching line 3993

        visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
                            vmbuffer, VISIBILITYMAP_VALID_BITS);

without having changed "vmbuffer" from its initial value of
InvalidBuffer.  It looks that way both at frame 3 and frame 4:

(gdb) f 4
#4  0x000000000055cd87 in heap_update (relation=0x7f67b2837330,
    otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>,
    crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700,
    lockmode=0x7ffecf4d56fc) at heapam.c:3993
3993                    visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
(gdb) i locals
...
vmbuffer = 0
vmbuffer_new = 0
...

It is also hard to doubt that somebody broke this in the last-minute
commit blizzard.  There are no reports of this PANIC in the buildfarm for
the last month, but we're now up to four (last I checked) since Thursday.

While the first thing that comes to mind is a logic bug in heap_update
itself, that code doesn't seem to have changed much in the last few days.
Moreover, why is it that this only seems to be happening within
do_analyze_rel -> update_attstats?  (We only have two stack traces
positively showing that, but all the buildfarm reports look like the
failure is happening within manual or auto analyze of a system catalog.
Fishy as heck.)

Just eyeing the evidence on hand, I'm wondering if something has decided
it can start setting the page-all-visible bit without adequate lock,
perhaps only in system catalogs.  heap_update is clearly assuming that
that flag won't change underneath it, and if it did, it's clear how this
symptom would ensue.

Too tired to take it further tonight ... discuss among yourselves.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: Add header support to text format and matching feature
Next
From: Pavel Stehule
Date:
Subject: Re: proposal: possibility to read dumped table's name from file