Re: Bug about drop index concurrently - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Bug about drop index concurrently
Date
Msg-id 20191022224953.pawkxvmpjwv42mf3@development
Whole thread Raw
In response to Re: Bug about drop index concurrently  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses 回复:Bug about drop index concurrently
List pgsql-hackers
On Fri, Oct 18, 2019 at 05:00:54PM +0200, Tomas Vondra wrote:
>Hi,
>
>I can trivially reproduce this - it's enough to create a master-standby
>setup, and then do this on the master
>
> CREATE TABLE t (a int, b int);
> INSERT INTO t SELECT i, i FROM generate_series(1,10000) s(i);
>
>and run pgbench with this script
>
> DROP INDEX CONCURRENTLY IF EXISTS t_a_idx;
> CREATE INDEX t_a_idx ON t(a);
>
>while on the standby there's another pgbench running this script
>
> EXPLAIN ANALYZE SELECT * FROM t WHERE a = 10000;
>
>and it fails pretty fast for me. With an extra assert(false) added to
>src/backend/access/common/relation.c I get a backtrace like this:
>
>   Program terminated with signal SIGABRT, Aborted.
>   #0  0x00007c32e458fe35 in raise () from /lib64/libc.so.6
>   Missing separate debuginfos, use: dnf debuginfo-install glibc-2.29-22.fc30.x86_64
>   (gdb) bt
>   #0  0x00007c32e458fe35 in raise () from /lib64/libc.so.6
>   #1  0x00007c32e457a895 in abort () from /lib64/libc.so.6
>   #2  0x0000000000a58579 in ExceptionalCondition (conditionName=0xacd9bc "!(0)", errorType=0xacd95b
"FailedAssertion",fileName=0xacd950 "relation.c", lineNumber=64) at assert.c:54
 
>   #3  0x000000000048d1bd in relation_open (relationId=38216, lockmode=1) at relation.c:64
>   #4  0x00000000005082e4 in index_open (relationId=38216, lockmode=1) at indexam.c:130
>   #5  0x000000000080ac3f in get_relation_info (root=0x21698b8, relationObjectId=16385, inhparent=false,
rel=0x220ce60)at plancat.c:196
 
>   #6  0x00000000008118c6 in build_simple_rel (root=0x21698b8, relid=1, parent=0x0) at relnode.c:292
>   #7  0x00000000007d485d in add_base_rels_to_query (root=0x21698b8, jtnode=0x2169478) at initsplan.c:114
>   #8  0x00000000007d48a3 in add_base_rels_to_query (root=0x21698b8, jtnode=0x21693e0) at initsplan.c:122
>   #9  0x00000000007d8fad in query_planner (root=0x21698b8, qp_callback=0x7ded97 <standard_qp_callback>,
qp_extra=0x7fffa4834f10)at planmain.c:168
 
>   #10 0x00000000007dc316 in grouping_planner (root=0x21698b8, inheritance_update=false, tuple_fraction=0) at
planner.c:2048
>   #11 0x00000000007da7ca in subquery_planner (glob=0x220d078, parse=0x2168f78, parent_root=0x0, hasRecursion=false,
tuple_fraction=0)at planner.c:1012
 
>   #12 0x00000000007d942c in standard_planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:406
>   #13 0x00000000007d91e8 in planner (parse=0x2168f78, cursorOptions=256, boundParams=0x0) at planner.c:275
>   #14 0x00000000008e1b0d in pg_plan_query (querytree=0x2168f78, cursorOptions=256, boundParams=0x0) at
postgres.c:878
>   #15 0x0000000000658683 in ExplainOneQuery (query=0x2168f78, cursorOptions=256, into=0x0, es=0x220cd90,
queryString=0x21407b8"explain analyze select * from t where a = 10000;", params=0x0, queryEnv=0x0) at explain.c:367
 
>   #16 0x0000000000658386 in ExplainQuery (pstate=0x220cc28, stmt=0x2141728, queryString=0x21407b8 "explain analyze
select* from t where a = 10000;", params=0x0, queryEnv=0x0, dest=0x220cb90) at explain.c:255
 
>   #17 0x00000000008ea218 in standard_ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select *
fromt where a = 10000;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90,
 
>       completionTag=0x7fffa48355c0 "") at utility.c:675
>   #18 0x00000000008e9a45 in ProcessUtility (pstmt=0x21425c0, queryString=0x21407b8 "explain analyze select * from t
wherea = 10000;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x220cb90,
 
>       completionTag=0x7fffa48355c0 "") at utility.c:360
>   #19 0x00000000008e8a0c in PortalRunUtility (portal=0x219c278, pstmt=0x21425c0, isTopLevel=true,
setHoldSnapshot=true,dest=0x220cb90, completionTag=0x7fffa48355c0 "") at pquery.c:1175
 
>   #20 0x00000000008e871a in FillPortalStore (portal=0x219c278, isTopLevel=true) at pquery.c:1035
>   #21 0x00000000008e8075 in PortalRun (portal=0x219c278, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x21efb90,altdest=0x21efb90, completionTag=0x7fffa48357b0 "") at pquery.c:765
 
>   #22 0x00000000008e207c in exec_simple_query (query_string=0x21407b8 "explain analyze select * from t where a =
10000;")at postgres.c:1215
 
>   #23 0x00000000008e636e in PostgresMain (argc=1, argv=0x216c600, dbname=0x216c4e0 "test", username=0x213c3f8 "user")
atpostgres.c:4236
 
>   #24 0x000000000083c71e in BackendRun (port=0x2165850) at postmaster.c:4437
>   #25 0x000000000083beef in BackendStartup (port=0x2165850) at postmaster.c:4128
>   #26 0x0000000000838313 in ServerLoop () at postmaster.c:1704
>   #27 0x0000000000837bbf in PostmasterMain (argc=3, argv=0x213a360) at postmaster.c:1377
>   #28 0x0000000000759643 in main (argc=3, argv=0x213a360) at main.c:228
>
>So my guess is the root cause is pretty simple - we close/unlock the
>indexes after completing the query, but then EXPLAIN tries to open it
>again when producing the explain plan.
>

D'oh! I've just looked at this issue more carefully, and I realize this
suggestion (that it's due to releasing a lock too early) is just bogus.
Sorry about the confusion :-(

In fact, I think you've been 100% correct in your analysis regarding the
root cause, which is that we don't realize there's a query on a standby,
using the index that we're dropping (and REINDEX CONCURRENTLY seems to
have exactly the same issue).

I've reproduced this on all releases back to 10, I suppose it affects
all releases with DROP INDEX CONCURRENTLY (I haven't tried, but I don't
see why it wouldn't).

I still think it's a bug, and we'll need to fix it somehow. Not sure
how, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: stress test for parallel workers
Next
From: Michael Paquier
Date:
Subject: Re: Missing error_context_stack = NULL in AutoVacWorkerMain()