Thread: Killing dead index tuples before they get vacuumed
I'm planning to tackle the problem of killing index tuples for dead rows during normal processing (ie, before VACUUM). We've noticed repeatedly that visits to dead heap rows consume a lot of time during indexscans of heavily-updated tables. This problem has been discussed before, so the general outline of the solution is apparent, but several design decisions remain to be made. Here are my present thoughts: 1. The basic idea is for index_getnext, when it retrieves a tuple that turns out to be invisible to the current transaction, to test whether the tuple is dead to *all* transactions; if so, tell the index AM to mark that index tuple as killed. Subsequently the index tuple will be ignored until it's finally vacuumed. (We cannot try to remove the index tuple immediately, because of concurrency issues; but not returning it out of the index AM during an indexscan should largely solve the performance problem.) Under normal circumstances the time window between "dead to my transaction" and "dead to all transactions" should not be very large, so this approach should not cause very many extra tuple-visibility tests to be performed. 2. The second visibility test is the same as VACUUM's: is the tuple committed dead (or never good) and older than any running transaction's xmin? To call HeapTupleSatisfiesVacuum we need an idea of the global xmin, but we surely don't want index_getnext calling GetOldestXmin() every time it does this. (Quite aside from the speed penalty, I'm worried about possible deadlocks due to having to grab SInvalLock there.) Instead I propose that we modify GetSnapshotData() to compute the current global xmin as a byproduct of its existing computation (which it can do almost for free) and stash that in a global variable someplace. index_getnext can then use the global variable to call HeapTupleSatisfiesVacuum. This will effectively mean that we do index-tuple killing on the basis of the global xmin as it stood when we started the current transaction. In some cases that might be a little out of date, but using an old xmin cannot cause incorrect behavior; at worst an index entry will survive a little longer than it really needs to. 3. What should the API to the index AMs look like? I propose adding two fields to the IndexScanDesc data structure: bool kill_prior_tuple; /* true if previously returned tuple is dead */ bool ignore_killed_tuples; /* true to not return killed entries */ kill_prior_tuple is always set false during RelationGetIndexScan and at the start of index_getnext. It's set true when index_getnext detects a dead tuple and loops around to call the index AM again. So the index AM may interpret it as "kill the tuple you last returned, ie, the one indicated by currentItemData". Performing this action as part of amgetnext minimizes the extra overhead needed to kill a tuple --- we don't need an extra cycle of re-locking the current index page and re-finding our place. ignore_killed_tuples will be set true in RelationGetIndexScan, but could be set false by callers that want to see the killed index tuples. (Offhand I think only VACUUM would want to do that.) Within the index AMs, both kill_prior_tuple and ignore_killed_tuples would be examined only by the topmost amgetnext routine. A "killed" entry behaves normally with respect to all internal operations of the index AM; we just don't return it to callers when ignore_killed_tuples is true. This will minimize the risk of introducing bugs into the index AMs. As long as we can loop around for the next index tuple before we've released page locks inside the AM, we should get most of the possible performance benefit with just a localized change. 4. How exactly should a killed index tuple be marked on-disk? While there is one free bit available in IndexTupleData.t_info, I would prefer to use that bit to expand the index tuple size field to 14 bits instead of 13. (This would allow btree index entries to be up to 10K when BLCKSZ is 32K, rather than being hard-limited to 8K.) What I am thinking of doing is using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be unused for index tuples. (I'm not sure it's ever set for heap tuples either, actually, but it definitely looks free for index tuples.) Whichever bit we use, the index AM can simply set it and mark the buffer dirty with SetBufferCommitInfoNeedsSave. We do not need to WAL-log the action, just as we do not WAL-log marking heap tuple commit status bits, because the action could be done over by someone else if it were lost. Comments? Anyone see any flaws or better ideas? regards, tom lane
> 4. How exactly should a killed index tuple be marked on-disk? While there > is one free bit available in IndexTupleData.t_info, I would prefer to use > that bit to expand the index tuple size field to 14 bits instead of 13. > (This would allow btree index entries to be up to 10K when BLCKSZ is 32K, > rather than being hard-limited to 8K.) While I agree that it might be handy to save this bit for future use, I do not see any value in increasing the max key length from 8k, especially when the new limit is then 10k. The limit is already 32 * the max key size of some other db's, and even those 256 bytes are usually sufficient. Andreas
On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote: > > 4. How exactly should a killed index tuple be marked on-disk? While there > > is one free bit available in IndexTupleData.t_info, I would prefer to use > > that bit to expand the index tuple size field to 14 bits instead of 13. > > (This would allow btree index entries to be up to 10K when BLCKSZ is 32K, > > rather than being hard-limited to 8K.) > > While I agree that it might be handy to save this bit for future use, > I do not see any value in increasing the max key length from 8k, > especially when the new limit is then 10k. The limit is already 32 * > the max key size of some other db's, and even those 256 bytes are usually > sufficient. I'm not sure if it applies here, but key length for GIST indexes may benefit from 2x increase (14bits = 16k). IIRC limited key length is one reason for intarray indexes being 'lossy'. And we can even make it bigger if we start measuring keys in words or dwords instead of bytes - 16k x dword = 64kb -------------- Hannu
Hannu Krosing <hannu@tm.ee> writes: > On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote: >> While I agree that it might be handy to save this bit for future use, >> I do not see any value in increasing the max key length from 8k, > I'm not sure if it applies here, but key length for GIST indexes may > benefit from 2x increase (14bits = 16k). IIRC limited key length is one > reason for intarray indexes being 'lossy'. Since there seems to be some dissension about that, I'll leave the t_info bit unused for now, instead of absorbing it into the length field. Since 13 bits is sufficient for 8K, people would not see any benefit anyway unless they use a nonstandard BLCKSZ. So I'm not that concerned about raising it --- just wanted to throw out the idea and see if people liked it. In the long run it'd be possible to not store length in IndexTupleData at all, but rely on the length from the item header, same as we do for heap tuples. So if we ever need more bits in IndexTupleData, there's a way out. regards, tom lane
Tom Lane wrote: > Hannu Krosing <hannu@tm.ee> writes: > > On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote: > >> While I agree that it might be handy to save this bit for future use, > >> I do not see any value in increasing the max key length from 8k, > > > I'm not sure if it applies here, but key length for GIST indexes may > > benefit from 2x increase (14bits = 16k). IIRC limited key length is one > > reason for intarray indexes being 'lossy'. > > Since there seems to be some dissension about that, I'll leave the > t_info bit unused for now, instead of absorbing it into the length > field. > > Since 13 bits is sufficient for 8K, people would not see any benefit > anyway unless they use a nonstandard BLCKSZ. So I'm not that concerned > about raising it --- just wanted to throw out the idea and see if people > liked it. Also, in btree haven't we had some problems with index page splits when using entries large enought so that not at least 3 of them fit on a page? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <janwieck@yahoo.com> writes: > Also, in btree haven't we had some problems with index page > splits when using entries large enought so that not at least > 3 of them fit on a page? Right, that's why I said that the limit would only go up to ~10K anyway; btree won't take keys > 1/3 page. regards, tom lane
Tom Lane wrote: > Jan Wieck <janwieck@yahoo.com> writes: > > Also, in btree haven't we had some problems with index page > > splits when using entries large enought so that not at least > > 3 of them fit on a page? > > Right, that's why I said that the limit would only go up to ~10K anyway; > btree won't take keys > 1/3 page. What's the point then? I mean, someone who needs more than 8K will outgrow 10K in no time, and those cases are topics for comp.databases.abuse.brutal ... Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <janwieck@yahoo.com> writes: > Tom Lane wrote: >> Right, that's why I said that the limit would only go up to ~10K anyway; >> btree won't take keys > 1/3 page. > What's the point then? Well, btree's not the only index access method we have. I'm not sure whether gist or rtree allow larger tuples... regards, tom lane
On Tue, 21 May 2002 12:48:39 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >4. How exactly should a killed index tuple be marked on-disk? While there >is one free bit available in IndexTupleData.t_info, I would prefer to use >that bit to expand the index tuple size field to 14 bits instead of 13. >(This would allow btree index entries to be up to 10K when BLCKSZ is 32K, >rather than being hard-limited to 8K.) What I am thinking of doing is >using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be >unused for index tuples. (I'm not sure it's ever set for heap tuples >either, actually, but it definitely looks free for index tuples.) AFAICS LP_DELETE is not used at all. The only place where something seems to happen to it is in PageRepairFragmentation() in bufpage.c: if ((*lp).lp_flags & LP_DELETE) /* marked for deletion*/ (*lp).lp_flags &= ~(LP_USED | LP_DELETE); but there is no place where this bit is set. There's also a macro definition in itemid.h: #define ItemIdDeleted(itemId) \ (((itemId)->lp_flags & LP_DELETE) != 0) which is *always* used in this context if (!ItemIdIsUsed(lp) || ItemIdDeleted(lp)) So it looks save to use this bit for marking dead tuples. Wouldn't it be even possible to simply reset LP_USED instead of setting LP_DELETED? If you do not use LP_DELETED I'd vote for cleaning up the source and removing it completely. Yet another idea: set ItemIdData.lp_len = 0 for killed index tuples. Will this free space be used by subsequent inserts? ServusManfred
Manfred Koizar <mkoi-pg@aon.at> writes: > So it looks save to use this bit for marking dead tuples. Wouldn't it > be even possible to simply reset LP_USED instead of setting > LP_DELETED? Mmmm ... I don't think so. That would cause the tuple to actually disappear from the perspective of the index AM internals, which seems like a bad idea. (For example, if another backend had an indexscan stopped on that same tuple, it would fail to re-find its place when it tried to continue the indexscan.) > Yet another idea: set ItemIdData.lp_len = 0 for killed index tuples. See above. This is *not* a substitute for vacuuming. regards, tom lane