Thread: Duplicating transaction information in indexes and performing in memory vacuum

Duplicating transaction information in indexes and performing in memory vacuum

From

Shridhar Daithankar

Date:

27 October 2003, 09:54:06

Hi,

Last week, there was a thread whether solely in memory vacuum can be performed 
or not.(OK, that was a part of thread but anyways)

I suggested that a page be vacuumed when it is pushed out of buffer cache. Tom 
pointed out that it can not be done as index tuples stote heap tuple id and 
depend upon heap tuple to find out transaction information.

I asked is it feasible to add transaction information to index tuple and the 
answer was no.

I searched hackers archive and following is only thread I could come up in this 
context.

http://archives.postgresql.org/pgsql-hackers/2000-09/msg00513.php
http://archives.postgresql.org/pgsql-hackers/2001-09/msg00409.php

The thread does not consider vacuum at all.

What are (more) reasons for not adding transaction information to index tuple, 
in addition to heap tuple?

Cons are bloated indexes. The index tuple size will be close to 30 bytes minimum.

On pro* side of this, no more vacuum required (at least for part of data that is 
being used. If data isn't used, it does not need vacuum anyway) and space bloat 
is stopped right in memory, without incurring overhead of additional IO vacuum 
demands.

Given recent trend of pushing PG higher and higher in scale (From performance 
list traffic, that is), I think this could be worthwhile addition.

So what are the cons I missed so far?
 Bye  Shridhar

Re: Duplicating transaction information in indexes and performing in memory vacuum

From

Tom Lane

Date:

27 October 2003, 14:34:46

Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> What are (more) reasons for not adding transaction information to
> index tuple, in addition to heap tuple?

> Cons are bloated indexes. The index tuple size will be close to 30
> bytes minimum.

And extra time to perform an update or delete, and extra time for
readers of the index to process and perhaps update the extra copies
of the row's state.  And atomicity concerns, since you can't possibly
update the row and all its index entries simultaneously.  I'm not
certain that the latter issue is insoluble, but it surely is a big risk.

> On pro* side of this, no more vacuum required (at least for part of
> data that is being used. If data isn't used, it does not need vacuum
> anyway) and space bloat is stopped right in memory, without incurring
> overhead of additional IO vacuum demands.

I do not believe either of those claims.  For starters, if you don't
remove a row's index entries when the row itself is removed, won't that
make index bloat a lot worse?  When exactly *will* you remove the index
entries ... and won't that process look a lot like VACUUM?
        regards, tom lane

Re: Duplicating transaction information in indexes and

From

Shridhar Daithankar

Date:

28 October 2003, 04:41:18

Tom Lane wrote:
> Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> 
>>What are (more) reasons for not adding transaction information to
>>index tuple, in addition to heap tuple?
> 
> 
>>Cons are bloated indexes. The index tuple size will be close to 30
>>bytes minimum.
> 
> 
> And extra time to perform an update or delete, and extra time for
> readers of the index to process and perhaps update the extra copies
> of the row's state.  And atomicity concerns, since you can't possibly
> update the row and all its index entries simultaneously.  I'm not
> certain that the latter issue is insoluble, but it surely is a big risk.

The additional information going in index, is available while updating the 
index, I assume. So extra time required is IO for pushing that page to disk.

As far as updating each index row is concerned, I was under impression that all 
relevant indexes are updated when a row is updated. Isn't that right?

>>On pro* side of this, no more vacuum required (at least for part of
>>data that is being used. If data isn't used, it does not need vacuum
>>anyway) and space bloat is stopped right in memory, without incurring
>>overhead of additional IO vacuum demands.

OK, no more vacuum required is "marketing speak" for it. It is not strictly true.

> I do not believe either of those claims.  For starters, if you don't
> remove a row's index entries when the row itself is removed, won't that
> make index bloat a lot worse?  When exactly *will* you remove the index
> entries ... and won't that process look a lot like VACUUM?

If a heap row is removed and index rows are not removed, it would not make any 
difference because the index row would contain all the information to infer that 
it is dead and can be removed.

The dead index row would be removed, when index page is fetched into buffer 
cache and being pushed out, just like a heap tuple. It would not need heap 
tuple(s) to clean the index page.

The index bloat would not be any worse than current because all the information 
available in index itself, vacuum can clean the dead indexes as well.

And yes, it is essentially vacuum. But with some differences.

* It will operate on buffer pages only. Not on entire database objects. It makes 
it CPU bound operation and cheaper compared to IO incurred. If we assume CPU to 
be cheap enough, additional processing would not affect regular operation that much.
* It will operate continuously unlike vacuum which needs a trigger. That could 
lower overall throughput a little but it would be much more consistent 
throughput rather than peaks and crests shown by triggered vacuum approach.
* It will not clean up entire database objects but only pages in question. So 
some bloat might be left on disk, on indexes and on heaps. But whatever that 
gets used will be cleaned up. Assuming caching works normally, it will keep the 
data set clean for frequent use.
* It is out of order in a sense, index and heap will not be cleaned in sync. The 
extra information in index is to make sure that this can happen.

This will not really eliminate vacuum but would rather drive down significance 
of vacuum. Right now, a write/updateheavy database will die horribly if not 
vacuumed aggressively. Hopefully situation will be much better with such an 
approach.
 Bye  Shridhar