Thread: Killing dead index tuples before they get vacuumed

Killing dead index tuples before they get vacuumed

From
Tom Lane
Date:
I'm planning to tackle the problem of killing index tuples for dead rows
during normal processing (ie, before VACUUM).  We've noticed repeatedly
that visits to dead heap rows consume a lot of time during indexscans
of heavily-updated tables.  This problem has been discussed before,
so the general outline of the solution is apparent, but several design
decisions remain to be made.  Here are my present thoughts:

1. The basic idea is for index_getnext, when it retrieves a tuple that
turns out to be invisible to the current transaction, to test whether
the tuple is dead to *all* transactions; if so, tell the index AM to mark
that index tuple as killed.  Subsequently the index tuple will be ignored
until it's finally vacuumed.  (We cannot try to remove the index tuple
immediately, because of concurrency issues; but not returning it out of
the index AM during an indexscan should largely solve the performance
problem.)  Under normal circumstances the time window between "dead to
my transaction" and "dead to all transactions" should not be very large,
so this approach should not cause very many extra tuple-visibility tests
to be performed.

2. The second visibility test is the same as VACUUM's: is the tuple
committed dead (or never good) and older than any running transaction's
xmin?  To call HeapTupleSatisfiesVacuum we need an idea of the global
xmin, but we surely don't want index_getnext calling GetOldestXmin()
every time it does this.  (Quite aside from the speed penalty, I'm worried
about possible deadlocks due to having to grab SInvalLock there.)  Instead
I propose that we modify GetSnapshotData() to compute the current global
xmin as a byproduct of its existing computation (which it can do almost
for free) and stash that in a global variable someplace.  index_getnext
can then use the global variable to call HeapTupleSatisfiesVacuum.  This
will effectively mean that we do index-tuple killing on the basis of the
global xmin as it stood when we started the current transaction.  In some
cases that might be a little out of date, but using an old xmin cannot
cause incorrect behavior; at worst an index entry will survive a little
longer than it really needs to.

3. What should the API to the index AMs look like?  I propose adding
two fields to the IndexScanDesc data structure:

bool    kill_prior_tuple;  /* true if previously returned tuple is dead */
bool    ignore_killed_tuples; /* true to not return killed entries */

kill_prior_tuple is always set false during RelationGetIndexScan and at
the start of index_getnext.  It's set true when index_getnext detects
a dead tuple and loops around to call the index AM again.  So the index
AM may interpret it as "kill the tuple you last returned, ie, the one
indicated by currentItemData".  Performing this action as part of
amgetnext minimizes the extra overhead needed to kill a tuple --- we don't
need an extra cycle of re-locking the current index page and re-finding
our place.

ignore_killed_tuples will be set true in RelationGetIndexScan, but could
be set false by callers that want to see the killed index tuples.
(Offhand I think only VACUUM would want to do that.)

Within the index AMs, both kill_prior_tuple and ignore_killed_tuples would
be examined only by the topmost amgetnext routine.  A "killed" entry
behaves normally with respect to all internal operations of the index AM;
we just don't return it to callers when ignore_killed_tuples is true.
This will minimize the risk of introducing bugs into the index AMs.
As long as we can loop around for the next index tuple before we've
released page locks inside the AM, we should get most of the possible
performance benefit with just a localized change.

4. How exactly should a killed index tuple be marked on-disk?  While there
is one free bit available in IndexTupleData.t_info, I would prefer to use
that bit to expand the index tuple size field to 14 bits instead of 13.
(This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
rather than being hard-limited to 8K.)  What I am thinking of doing is
using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be
unused for index tuples.  (I'm not sure it's ever set for heap tuples
either, actually, but it definitely looks free for index tuples.)

Whichever bit we use, the index AM can simply set it and mark the buffer
dirty with SetBufferCommitInfoNeedsSave.  We do not need to WAL-log the
action, just as we do not WAL-log marking heap tuple commit status bits,
because the action could be done over by someone else if it were lost.

Comments?  Anyone see any flaws or better ideas?
        regards, tom lane


Re: Killing dead index tuples before they get vacuumed

From
"Zeugswetter Andreas SB SD"
Date:
> 4. How exactly should a killed index tuple be marked on-disk? While there
> is one free bit available in IndexTupleData.t_info, I would prefer to use
> that bit to expand the index tuple size field to 14 bits instead of 13.
> (This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
> rather than being hard-limited to 8K.)

While I agree that it might be handy to save this bit for future use,
I do not see any value in increasing the max key length from 8k,
especially when the new limit is then 10k. The limit is already 32 *
the max key size of some other db's, and even those 256 bytes are usually
sufficient.

Andreas


Re: Killing dead index tuples before they get vacuumed

From
Hannu Krosing
Date:
On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote:
> > 4. How exactly should a killed index tuple be marked on-disk? While there
> > is one free bit available in IndexTupleData.t_info, I would prefer to use
> > that bit to expand the index tuple size field to 14 bits instead of 13.
> > (This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
> > rather than being hard-limited to 8K.)
> 
> While I agree that it might be handy to save this bit for future use,
> I do not see any value in increasing the max key length from 8k,
> especially when the new limit is then 10k. The limit is already 32 *
> the max key size of some other db's, and even those 256 bytes are usually 
> sufficient.

I'm not sure if it applies here, but key length for GIST indexes may
benefit from 2x increase (14bits = 16k). IIRC limited key length is one
reason for intarray indexes being 'lossy'.

And we can even make it bigger if we start measuring keys in words or
dwords instead of bytes - 16k x dword = 64kb

--------------
Hannu




Re: Killing dead index tuples before they get vacuumed

From
Tom Lane
Date:
Hannu Krosing <hannu@tm.ee> writes:
> On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote:
>> While I agree that it might be handy to save this bit for future use,
>> I do not see any value in increasing the max key length from 8k,

> I'm not sure if it applies here, but key length for GIST indexes may
> benefit from 2x increase (14bits = 16k). IIRC limited key length is one
> reason for intarray indexes being 'lossy'.

Since there seems to be some dissension about that, I'll leave the
t_info bit unused for now, instead of absorbing it into the length
field.

Since 13 bits is sufficient for 8K, people would not see any benefit
anyway unless they use a nonstandard BLCKSZ.  So I'm not that concerned
about raising it --- just wanted to throw out the idea and see if people
liked it.

In the long run it'd be possible to not store length in IndexTupleData
at all, but rely on the length from the item header, same as we do for
heap tuples.  So if we ever need more bits in IndexTupleData, there's
a way out.
        regards, tom lane


Re: Killing dead index tuples before they get vacuumed

From
Jan Wieck
Date:
Tom Lane wrote:
> Hannu Krosing <hannu@tm.ee> writes:
> > On Wed, 2002-05-22 at 12:28, Zeugswetter Andreas SB SD wrote:
> >> While I agree that it might be handy to save this bit for future use,
> >> I do not see any value in increasing the max key length from 8k,
>
> > I'm not sure if it applies here, but key length for GIST indexes may
> > benefit from 2x increase (14bits = 16k). IIRC limited key length is one
> > reason for intarray indexes being 'lossy'.
>
> Since there seems to be some dissension about that, I'll leave the
> t_info bit unused for now, instead of absorbing it into the length
> field.
>
> Since 13 bits is sufficient for 8K, people would not see any benefit
> anyway unless they use a nonstandard BLCKSZ.  So I'm not that concerned
> about raising it --- just wanted to throw out the idea and see if people
> liked it.
   Also,  in  btree haven't we had some problems with index page   splits when using entries large enought so that not
at least   3 of them fit on a page?
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




Re: Killing dead index tuples before they get vacuumed

From
Tom Lane
Date:
Jan Wieck <janwieck@yahoo.com> writes:
>     Also,  in  btree haven't we had some problems with index page
>     splits when using entries large enought so that not at  least
>     3 of them fit on a page?

Right, that's why I said that the limit would only go up to ~10K anyway;
btree won't take keys > 1/3 page.
        regards, tom lane


Re: Killing dead index tuples before they get vacuumed

From
Jan Wieck
Date:
Tom Lane wrote:
> Jan Wieck <janwieck@yahoo.com> writes:
> >     Also,  in  btree haven't we had some problems with index page
> >     splits when using entries large enought so that not at  least
> >     3 of them fit on a page?
>
> Right, that's why I said that the limit would only go up to ~10K anyway;
> btree won't take keys > 1/3 page.
   What's the point then? I mean, someone who needs more than 8K   will outgrow 10K in no time, and those cases are
topics for   comp.databases.abuse.brutal ...
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #




Re: Killing dead index tuples before they get vacuumed

From
Tom Lane
Date:
Jan Wieck <janwieck@yahoo.com> writes:
> Tom Lane wrote:
>> Right, that's why I said that the limit would only go up to ~10K anyway;
>> btree won't take keys > 1/3 page.

>     What's the point then?

Well, btree's not the only index access method we have.  I'm not sure
whether gist or rtree allow larger tuples...
        regards, tom lane


Re: Killing dead index tuples before they get vacuumed

From
Manfred Koizar
Date:
On Tue, 21 May 2002 12:48:39 -0400, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
>4. How exactly should a killed index tuple be marked on-disk?  While there
>is one free bit available in IndexTupleData.t_info, I would prefer to use
>that bit to expand the index tuple size field to 14 bits instead of 13.
>(This would allow btree index entries to be up to 10K when BLCKSZ is 32K,
>rather than being hard-limited to 8K.)  What I am thinking of doing is
>using the LP_DELETE bit in ItemIdData.lp_flags --- this appears to be
>unused for index tuples.  (I'm not sure it's ever set for heap tuples
>either, actually, but it definitely looks free for index tuples.)

AFAICS LP_DELETE is not used at all.  The only place where something
seems to happen to it is in PageRepairFragmentation() in bufpage.c:   if ((*lp).lp_flags & LP_DELETE) /* marked for
deletion*/       (*lp).lp_flags &= ~(LP_USED | LP_DELETE); 
but there is no place where this bit is set.  There's also a macro
definition in itemid.h:
#define ItemIdDeleted(itemId) \   (((itemId)->lp_flags & LP_DELETE) != 0)
which is *always* used in this context   if (!ItemIdIsUsed(lp) || ItemIdDeleted(lp))

So it looks save to use this bit for marking dead tuples.  Wouldn't it
be even possible to simply reset LP_USED instead of setting
LP_DELETED?

If you do not use LP_DELETED I'd vote for cleaning up the source and
removing it completely.

Yet another idea: set ItemIdData.lp_len = 0 for killed index tuples.

Will this free space be used by subsequent inserts?

ServusManfred


Re: Killing dead index tuples before they get vacuumed

From
Tom Lane
Date:
Manfred Koizar <mkoi-pg@aon.at> writes:
> So it looks save to use this bit for marking dead tuples.  Wouldn't it
> be even possible to simply reset LP_USED instead of setting
> LP_DELETED?

Mmmm ... I don't think so.  That would cause the tuple to actually
disappear from the perspective of the index AM internals, which seems
like a bad idea.  (For example, if another backend had an indexscan
stopped on that same tuple, it would fail to re-find its place when it
tried to continue the indexscan.)

> Yet another idea: set ItemIdData.lp_len = 0 for killed index tuples.

See above.  This is *not* a substitute for vacuuming.
        regards, tom lane