Re: PANIC: wrong buffer passed to visibilitymap_clear - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: PANIC: wrong buffer passed to visibilitymap_clear |
Date | |
Msg-id | 2644575.1618117467@sss.pgh.pa.us Whole thread Raw |
In response to | Re: PANIC: wrong buffer passed to visibilitymap_clear (Andres Freund <andres@anarazel.de>) |
Responses |
Re: PANIC: wrong buffer passed to visibilitymap_clear
|
List | pgsql-hackers |
I've managed to reproduce this locally, by dint of running the src/bin/scripts tests over and over and tweaking the timing by trying different "taskset" parameters to vary the number of CPUs available. I find that I duplicated the report from spurfowl, particularly (gdb) bt #0 0x00007f67bb6807d5 in raise () from /lib64/libc.so.6 #1 0x00007f67bb669895 in abort () from /lib64/libc.so.6 #2 0x000000000094ce37 in errfinish (filename=<optimized out>, lineno=<optimized out>, funcname=0x9ac120 <__func__.1> "visibilitymap_clear") at elog.c:680 #3 0x0000000000488b8c in visibilitymap_clear (rel=rel@entry=0x7f67b2837330, heapBlk=<optimized out>, buf=buf@entry=0, flags=flags@entry=3 '\003') ^^^^^^^^^^^^^^^ at visibilitymap.c:155 #4 0x000000000055cd87 in heap_update (relation=0x7f67b2837330, otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>, crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700, lockmode=0x7ffecf4d56fc) at heapam.c:3993 #5 0x000000000055dd61 in simple_heap_update ( relation=relation@entry=0x7f67b2837330, otid=otid@entry=0x7f67b274744c, tup=tup@entry=0x7f67b2747448) at heapam.c:4211 #6 0x00000000005e531c in CatalogTupleUpdate (heapRel=0x7f67b2837330, otid=0x7f67b274744c, tup=0x7f67b2747448) at indexing.c:309 #7 0x00000000006420f9 in update_attstats (relid=1255, inh=false, natts=natts@entry=30, vacattrstats=vacattrstats@entry=0x19c9fc0) at analyze.c:1758 #8 0x00000000006430dd in update_attstats (vacattrstats=0x19c9fc0, natts=30, inh=false, relid=<optimized out>) at analyze.c:1646 #9 do_analyze_rel (onerel=<optimized out>, params=0x7ffecf4d5e50, va_cols=0x0, acquirefunc=<optimized out>, relpages=86, inh=<optimized out>, in_outer_xact=false, elevel=13) at analyze.c:589 #10 0x00000000006447a1 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffecf4d5e50, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:261 #11 0x00000000006a5718 in vacuum (relations=0x19c8158, params=0x7ffecf4d5e50, bstrategy=<optimized out>, isTopLevel=<optimized out>) at vacuum.c:478 #12 0x00000000006a5c94 in ExecVacuum (pstate=pstate@entry=0x1915970, vacstmt=vacstmt@entry=0x18ed5c8, isTopLevel=isTopLevel@entry=true) at vacuum.c:254 #13 0x000000000083c32c in standard_ProcessUtility (pstmt=0x18ed918, queryString=0x18eca20 "ANALYZE pg_catalog.pg_proc;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x18eda08, qc=0x7ffecf4d61c0) at utility.c:826 I'd not paid much attention to that point before, but now it seems there is no question that heap_update is reaching line 3993 visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer, VISIBILITYMAP_VALID_BITS); without having changed "vmbuffer" from its initial value of InvalidBuffer. It looks that way both at frame 3 and frame 4: (gdb) f 4 #4 0x000000000055cd87 in heap_update (relation=0x7f67b2837330, otid=0x7f67b274744c, newtup=0x7f67b2747448, cid=<optimized out>, crosscheck=<optimized out>, wait=<optimized out>, tmfd=0x7ffecf4d5700, lockmode=0x7ffecf4d56fc) at heapam.c:3993 3993 visibilitymap_clear(relation, BufferGetBlockNumber(buffer), (gdb) i locals ... vmbuffer = 0 vmbuffer_new = 0 ... It is also hard to doubt that somebody broke this in the last-minute commit blizzard. There are no reports of this PANIC in the buildfarm for the last month, but we're now up to four (last I checked) since Thursday. While the first thing that comes to mind is a logic bug in heap_update itself, that code doesn't seem to have changed much in the last few days. Moreover, why is it that this only seems to be happening within do_analyze_rel -> update_attstats? (We only have two stack traces positively showing that, but all the buildfarm reports look like the failure is happening within manual or auto analyze of a system catalog. Fishy as heck.) Just eyeing the evidence on hand, I'm wondering if something has decided it can start setting the page-all-visible bit without adequate lock, perhaps only in system catalogs. heap_update is clearly assuming that that flag won't change underneath it, and if it did, it's clear how this symptom would ensue. Too tired to take it further tonight ... discuss among yourselves. regards, tom lane
pgsql-hackers by date: