Killing dead index tuples before they get vacuumed - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Killing dead index tuples before they get vacuumed |
Date | |
Msg-id | 21507.1021999719@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Killing dead index tuples before they get vacuumed
|
List | pgsql-hackers |
I'm planning to tackle the problem of killing index tuples for dead rows during normal processing (ie, before VACUUM). We've noticed repeatedly that visits to dead heap rows consume a lot of time during indexscans of heavily-updated tables. This problem has been discussed before, so the general outline of the solution is apparent, but several design decisions remain to be made. Here are my present thoughts: 1. The basic idea is for index_getnext, when it retrieves a tuple that turns out to be invisible to the current transaction, to test whether the tuple is dead to *all* transactions; if so, tell the index AM to mark that index tuple as killed. Subsequently the index tuple will be ignored until it's finally vacuumed. (We cannot try to remove the index tuple immediately, because of concurrency issues; but not returning it out of the index AM during an indexscan should largely solve the performance problem.) Under normal circumstances the time window between "dead to my transaction" and "dead to all transactions" should not be very large, so this approach should not cause very many extra tuple-visibility tests to be performed. 2. The second visibility test is the same as VACUUM's: is the tuple committed dead (or never good) and older than any running transaction's xmin? To call HeapTupleSatisfiesVacuum we need an idea of the global xmin, but we surely don't want index_getnext calling GetOldestXmin() every time it does this. (Quite aside from the speed penalty, I'm worried about possible deadlocks due to having to grab SInvalLock there.) Instead I propose that we modify GetSnapshotData() to compute the current global xmin as a byproduct of its existing computation (which it can do almost for free) and stash that in a global variable someplace. index_getnext can then use the global variable to call HeapTupleSatisfiesVacuum. This will effectively mean that we do index-tuple killing on the basis of the global xmin as it stood when we started the current transaction. In some cases that might be a little out of date, but using an old xmin cannot cause incorrect behavior; at worst an index entry will survive a little longer than it really needs to. 3. What should the API to the index AMs look like? I propose adding two fields to the IndexScanDesc data structure: bool kill_prior_tuple; /* true if previously returned tuple is dead */ bool ignore_killed_tuples; /* true to not return killed entries */ kill_prior_tuple is always set false during RelationGetIndexScan and at the start of index_getnext. It's set true when index_getnext detects a dead tuple and loops around to call the index AM again. So the index AM may interpret it as "kill the tuple you last returned, ie, the one indicated by currentItemData". Performing this action as part of amgetnext minimizes the extra overhead needed to kill a tuple --- we don't need an extra cycle of re-locking the current index page and re-finding our place. ignore_killed_tuples will be set true in RelationGetIndexScan, but could be set false by callers that want to see the killed index tuples. (Offhand I think only VACUUM would want to do that.) Within the index AMs, both kill_prior_tuple and ignore_killed_tuples would be examined only by the topmost amgetnext routine. A "killed" entry behaves normally with respect to all internal operations of the index AM; we just don't return it to callers when ignore_killed_tuples is true. This will minimize the risk of introducing bugs into the index AMs. As long as we can loop around for the next index tuple before we've released page locks inside the AM, we should get most of the possible performance benefit with just a localized change. 4. How exactly should a killed index tuple be marked on-disk? While there is one free bit available in IndexTupleData.t_info, I would prefer to use that bit to expand the index tuple size field to 14 bits instead of 13. (This would allow btree index entries to be up to 10K when BLCKSZ is 32K, rather than being hard-limited to 8K.) What I am thinking of doing is using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be unused for index tuples. (I'm not sure it's ever set for heap tuples either, actually, but it definitely looks free for index tuples.) Whichever bit we use, the index AM can simply set it and mark the buffer dirty with SetBufferCommitInfoNeedsSave. We do not need to WAL-log the action, just as we do not WAL-log marking heap tuple commit status bits, because the action could be done over by someone else if it were lost. Comments? Anyone see any flaws or better ideas? regards, tom lane
pgsql-hackers by date: