Re: 回复:回复:Bug aboutdrop index concurrently - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: 回复:回复:Bug aboutdrop index concurrently |
Date | |
Msg-id | 20191023220422.n2ctyc4n7c2uq33m@development Whole thread Raw |
In response to | 回复:回复:Bug about drop index concurrently ("李杰(慎追)" <adger.lj@alibaba-inc.com>) |
Responses |
回复:回复:回复:Bug about drop index concurrently
|
List | pgsql-hackers |
On Wed, Oct 23, 2019 at 02:38:45PM +0800, 李杰(慎追) wrote: >> >>I'm a bit confused. You shouldn't see any crashes and/or core files in >>this scenario, for two reasons. Firstly, I assume you're running a >>regular build without asserts. Secondly, I had to add an extra assert >>to trigger the failure. So what core are you talking about? >> >Sorry, I should explain it more clearly. I saw the core file because I >modified the postgres source code and added Assert to it. >> OK >>Also, it's not clear to me what do you mean by "bug in the standby" or >>no lock in the drop index concurrently. Can you explain? >> >"bug in the standby" means that we built a master-slave instance, when >we executed a large number of queries on the standby, we executed 'drop >index concurrently' on the master so that get ‘error’ in the standby. >Although it is not 100%, it will appear. no lock in the drop index >concurrently ::: I think this is because there are not enough advanced >locks when executing ‘ drop index concurrently’. > OK, thanks for the clarification. Yes, it won't appear every time, it's likely timing-sensitive (I'll explain in a minute). >>Hmm, so you observe the issue with regular queries, not just EXPLAIN >>ANALYZE? > >yeah, we have seen this error frequently. > That suggests you're doing a lot of 'drop index concurrently', right? >>>Of course, we considered applying the method of waiting to detect the >>>query lock on the master to the standby, but worried about affecting >>>the standby application log delay, so we gave up that. >>> >>I don't understand? What method? >> > >I analyzed this problem, I used to find out the cause of this problem, >I also executed 'drop index concurrently' and ‘explain select * from >xxx’ on the master, but the bug did not appear as expected. So I went >to analyze the source code. I found that there is such a mechanism on >the master that when the 'drop index concurrently' is execute, it wait >will every transaction that saw the old index state has finished. >source code is as follows follow as: > >WaitForLockers(heaplocktag, AccessExclusiveLock); > >Therefore, I think that if this method is also available in standby, >then the error will not appear. but I worried about affecting the >standby application log delay, so we gave up that. > Yes, but we can't really do that, I'm afraid. We certainly can't do that on the master because we simply don't have the necessary information about locks from the standby, and we really don't want to have it, because with a busy standby that might be quite a bit of data (plust the standby would have to wait for the master to confirm each lock acquisition, I think which seems pretty terrible). On the standby, we don't really have an idea that the there's a drop index running - we only get information about AE locks, and a bunch of catalog updates. I don't think we have a way to determine this is a drop index in concurrent mode :-( More preciresly, the master sends information about AccessExclusiveLock in XLOG_STANDBY_LOCK wal record (in xl_standby_locks struct). And when the standby replays that, it should acquire those locks. For regular DROP INDEX we send this: rmgr: Standby ... desc: LOCK xid 8888 db 16384 rel 16385 rmgr: Standby ... desc: LOCK xid 8888 db 16384 rel 20573 ... catalog changes ... rmgr: Transaction ... desc: COMMIT 2019-10-23 22:42:27.950995 CEST; rels: base/16384/20573; inval msgs: catcache 32 catcache 7 catcache 6 catcache 50 catcache 49 relcache 20573 relcache 16385 snapshot 2608 while for DROP IDNEX CONCURRENTLY we send this rmgr: Heap ... desc: INPLACE ... catalog update rmgr: Standby ... desc: INVALIDATIONS ; inval msgs: catcache 32 relcache 21288 relcache 16385 rmgr: Heap ... desc: INPLACE ... catalog update rmgr: Standby ... desc: INVALIDATIONS ; inval msgs: catcache 32 relcache 21288 relcache 16385 rmgr: Standby ... desc: LOCK xid 10326 db 16384 rel 21288 ... catalog updates ... rmgr: Transaction ... desc: COMMIT 2019-10-23 23:47:10.042568 CEST; rels: base/16384/21288; inval msgs: catcache 32 catcache 7 catcache 6 catcache 50 catcache 49 relcache 21288 relcache 16385 snapshot 2608 So just a single lock on the index, but not the lock on the relation itself (which makes sense, because for DROP INDEX CONCURRENTLY we don't get an exclusive lock on the table). I'm not quite familiar with this part of the code, but the SELECT backends are clearly getting a stale list of indexes from relcache, and try to open an index which was already removed by the redo. We do acquire the lock on the index itself, but that's not sufficient :-( Not sure how to fix this. I wonder if we could invalidate the relcache for the relation at some point. Or maybe we could add additional information to the WAL to make the redo wait for all lock waiters, just like on the master. But that might be tricky because of deadlocks, and because the redo could easily get "stuck" waiting for a long-running queries. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: