Thread: HOT for PostgreSQL 8.3

HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
Heap Only Tuples ("HOT") is a simplification of earlier proposals for
improving the way the server handles frequent updates, based upon what's
been learned and feedback received.

Heap Only Tuples
----------------

The basic idea is that when a tuple is UPDATEd we can, in certain
circumstances, avoid inserting index tuples for a tuple. Such tuples are
marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
other tuples. The pre-conditions for allowing a HOT UPDATE are
- UPDATE doesn't change any indexed columns
- there is space on the same block as the tuple being updated

There is no restriction on tuple length changes, nor any requirement for
an additional field in the tuple header; as a result this change does
not require activation by an additional WITH parameter and this
technique can be used on *all* tables. 

HOT will, in some cases, perform better in conjunction with the use of
the fillfactor storage parameter. For smaller tables, this will seldom
be required, so database tuning will not increase in complexity (in
comparison with carefully planned VACUUM strategies in earlier
releases). In many cases, the update rate will cause a steady state to
be reached, with on-block space being reused cyclically.
At the same time we insert the HEAP_ONLY_TUPLE, the just-updated tuple
will be marked HEAP_UPDATE_ROOT. When we use an index to locate a heap
tuple, we start from this root tuple and hop forwards using the ctid
chain until we find the appropriate tuple. 

CREATE INDEX requires some careful work to allow it to identify and
correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
a result of the new index. This will cause additional work to be
required for those cases. CREATE INDEX on a newly loaded table will be
completely unaffected. There is some complexity there, though we don't
go into detail on those issues here. Please read on!

To allow HOT to work effectively we need to consider how we will VACUUM,
noting that in many cases we can remove HEAP_ONLY_TUPLEs much more
easily because they have no index tuples referencing them. There are
various options at this stage, but for clarity only one of those options
is presented here. 

When we try to UPDATE a tuple and the new tuple version doesn't fit on
the block, we get the BufferCleanupLock if possible and then perform a
single-block VACUUM. Any tuple that is both HEAP_DEAD & HEAP_ONLY_TUPLE
can be removed completely. This is possible by changing the t_ctid field
so that it points at the first visible-to-someone tuple in the chain, so
it points "over" the previous HOT tuples. The root tuple is also dead -
it cannot be removed completely, so it is replaced it with "just a
TupleHeader", which is referred to as a TupleStub. (Credit to Itagaki
for this concept).

e.g.

t1 (t_ctid: t2 ) - info HEAP_UPDATE_ROOT status HEAPTUPLE_DEAD
t2 (t_ctid: t3 ) - info HEAP_ONLY        status HEAPTUPLE_DEAD
t3 (t_ctid:self) - info HEAP_ONLY        status HEAPTUPLE_LIVE

after single-page VACUUM

t1 (t_ctid: t3 ) - info HEAP_UPDATE_ROOT & HEAP_TUPLE_STUB                - status HEAPTUPLE_RECENTLY_DEAD
 - t1 is now a TupleStub only
 
t3 (t_ctid:self) - info HEAP_ONLY        status HEAPTUPLE_LIVE

Status shown is the return value from HeapTupleSatisfiesVacuum()

The single-block VACUUM would alter *all* tuple chains on the block, not
just the one for the current tuple being UPDATEd.

This technique means that a tuple never changes its CTID, so everything
that currently uses CTID can continue normally. SeqScan would also work
identically to the way it works today.

It also means that we can't easily remove the root tuple, even if it is
now just a TupleStub (unless the whole chain is also removable because
of DELETE). Removing the root tuple will require a VACUUM *FULL*. Even
so, this option is still space-neutral in the worst-case, in comparison
with inserting index tuples. 

When we perform the single-block VACUUM we don't change the FSM, nor do
we try to check/increment the table's freezelimit. HOT would alter
slightly the way that UPDATEs are signalled to stats, so that these HOT
UPDATEs don't count towards the threshold for autovacuuming - so that a
frequently HOT-updated table may only very seldom require a normal
VACUUM.

The number of itempointers would increase in many cases, though this
would still be limited by current maximums.

Various tweaks on this basic idea exist, which can be considered in more
detail if the basic concept is accepted. 

- - - 

This design is aimed at being a no-frills version of the code that has
already been written. The existing version is available for testing now
and will be made available on community.enterprisedb.com shortly.

Four PostgreSQL developers have various amounts of time to contribute to
developing the above solution and customising it further according to
the wishes of the Community. That is myself, Heikki Linnakangas, Pavan
Deolasee and Nikhil Sontakke. Taken together, it seems possible to craft
something that can be acceptable for PostgreSQL 8.3

Plan from here is to publish the WIP patch(es) weekly until 8.3 code
freeze, together with any performance results.


Your comments are welcome.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> The basic idea is that when a tuple is UPDATEd we can, in certain
> circumstances, avoid inserting index tuples for a tuple. Such tuples are
> marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
> other tuples.

What is VACUUM FULL going to do when it wants to move one of these things?

> CREATE INDEX requires some careful work to allow it to identify and
> correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
> a result of the new index.

I think you've glossed over the CREATE INDEX problem much too easily.
The difficulty with that is going to be that de-HOT-ifying a tuple
is going to require multiple updates that can't possibly be put into
a single WAL record, and I don't think that WAL replay can clean up
after an incomplete update (since it can't run user-defined functions
and hence cannot be expected to compute index entries for itself).
So I don't think you can do that while preserving crash safety.

> Removing the root tuple will require a VACUUM *FULL*.

That seems unacceptable ... it won't take too long for your table to
fill up with stubs, and we don't want to return to the bad old days
when periodic VACUUM FULL was unavoidable.

ISTM we could fix that by extending the index VACUUM interface to
include two concepts: aside from "remove these TIDs when you find them",
there could be "replace these TIDs with those TIDs when you find them".
This would allow pointer-swinging to one of the child tuples, after
which the old root could be removed.  This has got the same atomicity
problem as for CREATE INDEX, because it's the same thing: you're
de-HOT-ifying the child.  So if you can solve the former, I think you
can make this work too.
        regards, tom lane


Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > The basic idea is that when a tuple is UPDATEd we can, in certain
> > circumstances, avoid inserting index tuples for a tuple. Such tuples are
> > marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
> > other tuples.
> 
> What is VACUUM FULL going to do when it wants to move one of these things?

This question stands out from the others. I'm not sure which aspect
you're thinking of - do you see some failure cases?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
"Joshua D. Drake"
Date:
Simon Riggs wrote:
> Heap Only Tuples ("HOT") is a simplification of earlier proposals for
> improving the way the server handles frequent updates, based upon what's
> been learned and feedback received.
> 
> Heap Only Tuples
> ----------------
> 
> The basic idea is that when a tuple is UPDATEd we can, in certain
> circumstances, avoid inserting index tuples for a tuple. Such tuples are
> marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
> other tuples. The pre-conditions for allowing a HOT UPDATE are
> - UPDATE doesn't change any indexed columns

Uhmmm... how often is that the case? Don't get me wrong, bravo but that
seems a rather large limitation.

Considering it, this would certainly be a boon in web space where you
have things like Rails doing:

UPDATE foo SET first_name = 'Barney' WHERE id = 1;

That would qualify correct?

Sincerely,

Joshua D. Drake


-- 
     === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997            http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/



Re: HOT for PostgreSQL 8.3

From
"Merlin Moncure"
Date:
On 2/7/07, Joshua D. Drake <jd@commandprompt.com> wrote:
> Simon Riggs wrote:
> > Heap Only Tuples ("HOT") is a simplification of earlier proposals for
> > improving the way the server handles frequent updates, based upon what's
> > been learned and feedback received.

> Uhmmm... how often is that the case? Don't get me wrong, bravo but that
> seems a rather large limitation.
>
> Considering it, this would certainly be a boon in web space where you
> have things like Rails doing:

HOT is great for tables that are updated frequently via triggers or
cron right?  so it this eliminate the need to vacuum foo following
executing:
update foo set v =1 where id =1;
where v is not an index, right?

if so, it would be great all kinds of things, especially
materialization techniques.  or any situation where a table is updated
so frequently autovac can't keep up.  I can think of tons of places
where this would be useful.

merlin


Re: HOT for PostgreSQL 8.3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> The basic idea is that when a tuple is UPDATEd we can, in certain
>> circumstances, avoid inserting index tuples for a tuple. Such tuples are
>> marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
>> other tuples.
> 
> What is VACUUM FULL going to do when it wants to move one of these things?

I suppose it could move the whole chain to the same target page, if 
there is one with enough space to accommodate the whole chain. Or we 
could just cop out and not move tuples marked with HEAP_ONLY_TUPLE. I 
think that would be acceptable; after the last transaction that can see 
the old tuple is finished, the old tuple is dead. After that, VACUUM 
FULL can remove the old tuple and move the remaining tuple as usual.

>> CREATE INDEX requires some careful work to allow it to identify and
>> correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
>> a result of the new index.
> 
> I think you've glossed over the CREATE INDEX problem much too easily.
> The difficulty with that is going to be that de-HOT-ifying a tuple
> is going to require multiple updates that can't possibly be put into
> a single WAL record, and I don't think that WAL replay can clean up
> after an incomplete update (since it can't run user-defined functions
> and hence cannot be expected to compute index entries for itself).
> So I don't think you can do that while preserving crash safety.

Yeah, chilling tuples from HOT state to normal tuples is not easy.

One solution I thought of is to add another flag to heap tuples, 
CHILL_IN_PROGRESS. To chill a tuple, you would:

1. Mark heap tuple with CHILL_IN_PROGRESS
2. Insert missing index entries
3. Clear CHILL_IN_PROGRESS and HEAP_ONLY_TUPLE flags

Index scans would ignore tuples with CHILL_IN_PROGRESS and directly 
pointed to from the index. That way if we crash in the middle of step 2, 
scans and updates would work normally after replay, as if the index 
entries weren't there. CREATE INDEX would have to fail if there's any 
CHILL_IN_PROGRESS tuples, because we wouldn't know which index entries 
need to be inserted; some might already be there. To clear the 
CHILL_IN_PROGRESS flag, a vacuum would be needed. Vacuum would remove 
all index entries for those tuples, but instead of removing the heap 
tuple in the 2nd scan it would just clear the CHILL_IN_PROGRESS flag, 
bringing us back to where we started.


However, the easiest solution would be to make CREATE INDEX wait until 
the old tuple is dead. That should be ok at least for concurrent CREATE 
INDEX, because it already has that kind of a wait between 1st and 2nd 
phase.

>> Removing the root tuple will require a VACUUM *FULL*.
> 
> That seems unacceptable ... it won't take too long for your table to
> fill up with stubs, and we don't want to return to the bad old days
> when periodic VACUUM FULL was unavoidable.

Agreed.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > The basic idea is that when a tuple is UPDATEd we can, in certain
> > circumstances, avoid inserting index tuples for a tuple. Such tuples are
> > marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to
> > other tuples.
> 
> What is VACUUM FULL going to do when it wants to move one of these things?

In addition to others suggested, one option is to rework VACUUM FULL:

Use case for VACUUM FULL is very low these days. It has an appallingly
long execution time. This can be speeded up by dropping and re-creating
indexes, says the manual, but it is still lengthy. It is even faster to
drop the indexes, do a CREATE TABLE AS SELECT * FROM table, drop the old
table and then rebuild the indexes. When moving into the new relation
the space requirement is not double, since we only copy useful
data/space and we also do this using the space the indexes occupied, so
the actual space overhead isn't that high. VACUUM FULL also generates
lots of WAL and ties up lots of memory while it operates - and it uses
memory without any constraint. The CTAS technique doesn't carry across
all of the visible tuple chains, but then to be brutally frank, by the
time VACUUM FULL has actually finished executing, it is very frequently
the oldest transaction anyway, so we needn't really have gone to all the
trouble of moving the tuple chains.

So the main use case for VACUUM FULL is when the space to be freed
inside the table is low enough to make defraging the table quicker than
a CTAS, yet still high enough that we were worried enough to do a VACUUM
FULL. Thats a very narrow use case, and if it exists at all there's a
narrow time window associated with it - only a heavily updated/deleted
table needs vacuuming anyway - that means most often be performing the
VF when we're already into the zone where the CTAS approach is quicker.
VACUUM FULL also forces us to handle various failure cases that leave
half-moved tuples scattered across tables. (Incomplete VACUUM FULLs are
actually fairly common because of its incredibly long run time).

So the radical suggestion is to continue to support the VACUUM FULL
command, but using a newly coded technique that is both higher
performance and more robust. We can either make VACUUM FULL wait until
it actually is the oldest transaction, or we can mark the table in some
way so that a lock cannot be obtained upon it by an earlier Xid. We
should be able to compact the files of a large table one physical file
at a time, so the space overhead is only ever MAX_PHYSICAL_FILESIZE and
the space overhead may become a net space gain as the VF continues. This
is of course (almost) identical to the approach already in use for
CLUSTER, so it seems like that should be acceptable. As a result, its
really not that much code and can still be accomplished on time.

Note also that we would not have to drop and re-add Foreign Keys, since
nothing can have changed while we have the table locked.

Doing this also frees up two heap info bits and simplifies many of the
HeapTupleSatisfies code, which could probably use the helping hand.
Tuples moved to the new files would retain their info bit settings as
they are copied across.

So overall, it seems a lot easier to completely replace VF than to fight
through its complexities and failure cases. If we do the above, then
we'll speed up VACUUM FULL and we'll be able to handle HOT tuples
easily.

> > CREATE INDEX requires some careful work to allow it to identify and
> > correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as
> > a result of the new index.
> 
> I think you've glossed over the CREATE INDEX problem much too easily.
> The difficulty with that is going to be that de-HOT-ifying a tuple
> is going to require multiple updates that can't possibly be put into
> a single WAL record, and I don't think that WAL replay can clean up
> after an incomplete update (since it can't run user-defined functions
> and hence cannot be expected to compute index entries for itself).
> So I don't think you can do that while preserving crash safety.

No intention to gloss, just wanted to get past first post and onto the
really complex stuff. You're absolutely right that this is where much of
the thinking/work needs to take place.

> > Removing the root tuple will require a VACUUM *FULL*.
> 
> That seems unacceptable ... it won't take too long for your table to
> fill up with stubs, and we don't want to return to the bad old days
> when periodic VACUUM FULL was unavoidable.

Completely agree. I wanted to start right at the very beginning, so
everybody would understand the issues, rather than jump straight in
again with additional complexity.

> ISTM we could fix that by extending the index VACUUM interface to
> include two concepts: aside from "remove these TIDs when you find them",
> there could be "replace these TIDs with those TIDs when you find them".
> This would allow pointer-swinging to one of the child tuples, after
> which the old root could be removed.  This has got the same atomicity
> problem as for CREATE INDEX, because it's the same thing: you're
> de-HOT-ifying the child.  So if you can solve the former, I think you
> can make this work too.

This is looking like the best option out of the many, since it doesn't
have any serious restrictions or penalties. Let's see what Pavan thinks,
since he's been working on this aspect.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:


On 2/9/07, Simon Riggs <simon@2ndquadrant.com > wrote:
On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:

> ISTM we could fix that by extending the index VACUUM interface to
> include two concepts: aside from "remove these TIDs when you find them",
> there could be "replace these TIDs with those TIDs when you find them".
> This would allow pointer-swinging to one of the child tuples, after
> which the old root could be removed.  This has got the same atomicity
> problem as for CREATE INDEX, because it's the same thing: you're
> de-HOT-ifying the child.  So if you can solve the former, I think you
> can make this work too.

This is looking like the best option out of the many, since it doesn't
have any serious restrictions or penalties. Let's see what Pavan thinks,
since he's been working on this aspect.

ISTM that there two related issues that we need to solve to make
progress.

- We must make de-HOTifying or CHILLing crash safe
- Concurrent index scans should work correctly with CHILLing operations

I think the first issue can be addressed on the lines of what Heikki suggested.
We can CHILL one tuple at a time. I am thinking of a two step process.
In the first step, the root-tuple and the heap-only tuple (which needs CHILLing)
are marked with a special flag, CHILL_IN_PROGRESS. This operation is
WAL logged. We then insert appropriate index entries for the tuple under
consideration.

In the second step, the HEAP_UPDATED_ROOT and HEAP_ONLY_TUPLE
flags on the heap tuples are adjusted and CHILL_IN_PROGRESS flags are cleared.

During normal operations, if CHILL_IN_PROGRESS flag is found set, we might
need to do some more work to figure out whether the index insert operations
were successful or not. If we find that there are missing index entries for the tuple
under consideration for CHILLing, then those could be added now and flags
are set/reset appropriately.

The second problem of concurrent index scans seems a bit more complex.
We need a mechanism so that no tuples are missed or tuples are
not returned twice. Since CHILLing of a tuple adds a new access path to the
tuple from the index, a concurrent index scan may return a tuple twice.

How about grabbing a AccessExclusiveLock during CHILLing
operation ? This would prevent any concurrent index scans. Since CHILLing
of a large table can take a long time, the operation can be spread across
time with periodic acquire/release of the lock. This would prevent starvation
of other backends. Since CHILLing is required only for CREATE INDEX
and stub-cleanup, I am assuming that its ok for it to be lazy in nature.
 
Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Thu, 2007-02-08 at 14:47 +0000, Heikki Linnakangas wrote:

> However, the easiest solution would be to make CREATE INDEX wait until 
> the old tuple is dead. That should be ok at least for concurrent CREATE 
> INDEX, because it already has that kind of a wait between 1st and 2nd 
> phase.

I'm not sure this part of the idea is possible; the rest sounded good.
Looking at DefineIndex() the wait happens only for transactions that
already have a lock on the table being indexed, which may not be very
many. AFAICS the ref page for CREATE INDEX CONCURRENTLY isn't fully
accurate (any more?) when it says "[CREATE INDEX] must wait for all
existing transactions to terminate".

Waiting until an arbitrary Xid dies could be deadlock-prone, if the lock
isn't taken carefully. Imagine a pg_dump coming towards you and then
waiting on the locked table. You'd need to wait at the beginning of the
command, before locks were taken. However, that would means CREATE INDEX
would only be possible outside of transaction blocks, which I don't
think is acceptable.

I wanted this for VACUUM FULL also, but same problem exists.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> ISTM we could fix that by extending the index VACUUM interface to
> include two concepts: aside from "remove these TIDs when you find them",
> there could be "replace these TIDs with those TIDs when you find them".
> This would allow pointer-swinging to one of the child tuples, after
> which the old root could be removed.  

Implementing the "replace these TIDs" operation atomically would be 
simple, except for the new bitmap index am. It should be possible there 
as well, but if the old and new tid happen to be on a different bitmap 
page, it requires some care to avoid deadlocks.

Also, we'd need more work mem for vacuum.

> This has got the same atomicity
> problem as for CREATE INDEX, because it's the same thing: you're
> de-HOT-ifying the child.

Not exactly. De-HOT-ifying, or chilling, a child means inserting new 
index entries. But if we're just replacing the tids from the existing 
index entries, it's ok if we crash after replacing some but not all of 
them. The next vacuum would replace the rest of the pointers, and remove 
the old root tuple.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: HOT for PostgreSQL 8.3

From
Teodor Sigaev
Date:
> Implementing the "replace these TIDs" operation atomically would be 
> simple, except for the new bitmap index am. It should be possible there 

That isn't simple (may be, even possible) from GIN.
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: HOT for PostgreSQL 8.3

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> Implementing the "replace these TIDs" operation atomically would be 
>> simple, except for the new bitmap index am. It should be possible there 

> That isn't simple (may be, even possible) from GIN.

I suspect that those pushing this idea only care about btrees anyway,
so one possible answer is that HOT is only possible when the table has
only btree indexes --- or at least, only indexes of AMs that support the
replace-these-TIDs operation.  (Yet another pg_am flag...)
        regards, tom lane


Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Fri, 2007-02-09 at 10:17 -0500, Tom Lane wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
> >> Implementing the "replace these TIDs" operation atomically would be 
> >> simple, except for the new bitmap index am. It should be possible there 
> 
> > That isn't simple (may be, even possible) from GIN.
> 
> I suspect that those pushing this idea only care about btrees anyway,
> so one possible answer is that HOT is only possible when the table has
> only btree indexes --- or at least, only indexes of AMs that support the
> replace-these-TIDs operation.  (Yet another pg_am flag...)

Well, thats me. Yes, I think b-trees-only is acceptable.

Realistically, very frequent updating and full text indexing are easily
separable use cases, at least into separate tables.

HOT should be of use in Data Warehousing applications also, when summary
tables are maintained alongside detailed data, but that also sounds like
HOT and bitmap indexes would be separable at the table level without
difficulty.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Fri, 2007-02-09 at 13:39 +0000, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > ISTM we could fix that by extending the index VACUUM interface to
> > include two concepts: aside from "remove these TIDs when you find them",
> > there could be "replace these TIDs with those TIDs when you find them".
> > This would allow pointer-swinging to one of the child tuples, after
> > which the old root could be removed.  
> 
> Implementing the "replace these TIDs" operation atomically would be 
> simple, except for the new bitmap index am. It should be possible there 
> as well, but if the old and new tid happen to be on a different bitmap 
> page, it requires some care to avoid deadlocks.

Grouped Item Indexes cope with this easily also, yes?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2007-02-09 at 13:39 +0000, Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> ISTM we could fix that by extending the index VACUUM interface to
>>> include two concepts: aside from "remove these TIDs when you find them",
>>> there could be "replace these TIDs with those TIDs when you find them".
>>> This would allow pointer-swinging to one of the child tuples, after
>>> which the old root could be removed.  
>> Implementing the "replace these TIDs" operation atomically would be 
>> simple, except for the new bitmap index am. It should be possible there 
>> as well, but if the old and new tid happen to be on a different bitmap 
>> page, it requires some care to avoid deadlocks.
> 
> Grouped Item Indexes cope with this easily also, yes?

Yes, as long as the old and the new tid point to the same page.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: HOT for PostgreSQL 8.3

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2007-02-09 kell 13:39, kirjutas Heikki
Linnakangas:
> Tom Lane wrote:
> > ISTM we could fix that by extending the index VACUUM interface to
> > include two concepts: aside from "remove these TIDs when you find them",
> > there could be "replace these TIDs with those TIDs when you find them".
> > This would allow pointer-swinging to one of the child tuples, after
> > which the old root could be removed.  
> 
> Implementing the "replace these TIDs" operation atomically would be 
> simple, except for the new bitmap index am. It should be possible there 
> as well, but if the old and new tid happen to be on a different bitmap 
> page, it requires some care to avoid deadlocks.
> 
> Also, we'd need more work mem for vacuum.

Why do we need to muck around with indexes at all ?

What are the problems with just shuffling the last (and only visible)
tuple to replace the HOT-hain root and be done with it ?

Can there be some problems with seqscans moving one tuple at a time or
doing revisits of the "same" (by TID) tuple ? If there are can't these
be fixed by creative use of ctid chains form the original live one to
the new live one ?

> > This has got the same atomicity
> > problem as for CREATE INDEX, because it's the same thing: you're
> > de-HOT-ifying the child.
> 
> Not exactly. De-HOT-ifying, or chilling, a child means inserting new 
> index entries. 

If we can just move the tuple inside the page we can avoid even that.

> But if we're just replacing the tids from the existing 
> index entries, it's ok if we crash after replacing some but not all of 
> them. The next vacuum would replace the rest of the pointers, and remove 
> the old root tuple.
> 
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: HOT for PostgreSQL 8.3

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> What are the problems with just shuffling the last (and only visible)
> tuple to replace the HOT-hain root and be done with it ?

ctid stops being a reliable identifier.
        regards, tom lane


Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Fri, 2007-02-09 at 13:47 -0500, Tom Lane wrote:
> Hannu Krosing <hannu@skype.net> writes:
> > What are the problems with just shuffling the last (and only visible)
> > tuple to replace the HOT-hain root and be done with it ?
> 
> ctid stops being a reliable identifier.

Yes, that sums it up.

The issue can be overcome in the internals, but the only means of doing
so that seemed workable required changing the output of the CTID special
column. Some applications use the tid datatype directly to relocate rows
within a transaction using the tidscan. ODBC driver uses that concept to
implement Updateable Cursors from the client. We can change that, but
we'd break all the other ones we don't know about. 

This issue was one of the major contributing factors to the size of the
previous HOT prototype, making it more invasive than is really
desirable.

Pointer-swinging avoids those issues and seems workable, even if it is a
pain to have to visit the index during VACUUM. So changing CTID isn't a
bridge we really need to cross, for which I'm glad.

Just as a matter of record, the tuple identifier would need to include
(block, itemid, xmin, cmin) to make this idea work, with the itemid
being that of the root tuple and the xmin and cmin being the tuple in
the chain that is being referenced. This would've then allowed callers
to relocate a specific tuple, even when the update chain had changed
between block accesses. Anyway, glad we're not going there anymore.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Fri, 2007-02-09 at 13:16 +0530, Pavan Deolasee wrote:

> The second problem of concurrent index scans seems a bit more complex.
> We need a mechanism so that no tuples are missed or tuples are 
> not returned twice. Since CHILLing of a tuple adds a new access path
> to the tuple from the index, a concurrent index scan may return a
> tuple twice.
> 
> How about grabbing a AccessExclusiveLock during CHILLing
> operation ? This would prevent any concurrent index scans. Since
> CHILLing of a large table can take a long time, the operation can be
> spread across time with periodic acquire/release of the lock. This
> would prevent starvation of other backends. Since CHILLing is required
> only for CREATE INDEX and stub-cleanup, I am assuming that its ok for
> it to be lazy in nature.

We've just spoken about this, so just wanted to add those thoughts here.

A pointer-swing operation will begin when we see a tuple that is both
status of HEAPTUPLE_DEAD and is marked HEAP_UPDATE_ROOT. Perhaps that
requires a new status from HeapTupleSatisfiesVacuum()? We chill all
tuples in the chain, up to the new root. We mark those tuples, so that
HeapTupleSatisfiesVacuum() will be describe them as
HEAPTUPLE_RECENTLY_DEAD. So the current Vacuum won't immediately remove
them, but they'll go away in the future as part of an on-demand block
vacuum or Vacuum. That's similar to the way we handle HALF_DEAD index
pages.

The index scans are page-at-a-time, so when we pointer-swing from the
root tuple to one of the HOT tuples we'll be OK. We're switching a
specific index tuple, so there's no multi-page locking on the index to
consider. 

Right now, I wouldn't want to assume that the way tuples are marked
prior to pointer-swinging is exactly the same as the chilling required
by CREATE INDEX: CHILL_IN_PROGRESS. It may well be, but I'm wary that we
assume they are exactly the same and introduce a subtle bug.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
Bruce Momjian
Date:
Tom Lane wrote:
> > Removing the root tuple will require a VACUUM *FULL*.
> 
> That seems unacceptable ... it won't take too long for your table to
> fill up with stubs, and we don't want to return to the bad old days
> when periodic VACUUM FULL was unavoidable.
> 
> ISTM we could fix that by extending the index VACUUM interface to
> include two concepts: aside from "remove these TIDs when you find them",
> there could be "replace these TIDs with those TIDs when you find them".
> This would allow pointer-swinging to one of the child tuples, after
> which the old root could be removed.  This has got the same atomicity
> problem as for CREATE INDEX, because it's the same thing: you're
> de-HOT-ifying the child.  So if you can solve the former, I think you
> can make this work too.

I need clarification here.  Is removing dead heap tuple always going to
require an index scan, or was this just for chilling a row (adding an
index)?

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Fri, 2007-02-09 at 18:10 -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > > Removing the root tuple will require a VACUUM *FULL*.
> > 
> > That seems unacceptable ... it won't take too long for your table to
> > fill up with stubs, and we don't want to return to the bad old days
> > when periodic VACUUM FULL was unavoidable.
> > 
> > ISTM we could fix that by extending the index VACUUM interface to
> > include two concepts: aside from "remove these TIDs when you find them",
> > there could be "replace these TIDs with those TIDs when you find them".
> > This would allow pointer-swinging to one of the child tuples, after
> > which the old root could be removed.  This has got the same atomicity
> > problem as for CREATE INDEX, because it's the same thing: you're
> > de-HOT-ifying the child.  So if you can solve the former, I think you
> > can make this work too.
> 
> I need clarification here.  Is removing dead heap tuple always going to
> require an index scan, or was this just for chilling a row (adding an
> index)?

We can remove a tupled marked HEAP_ONLY_TUPLE when it is status
HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a
TupleStub, but not removed. Multiple tuples in the chain can be removed,
though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the
first non-removed tuple in the chain. All of that can be done when we
hold a CleanupLock on the block, without reference to the indexes; this
can be performed on-demand, when we attempt an UPDATE. This is similar
to what already happens prior to a btree split operation. (This could
also be performed by bgwriter, but that isn't proposed at this time
because the buffer transit time through the cache is often not long
enough to allow tuples to die and get benefit from space reuse).

TupleStubs can be marked for removal by a pointer-swing operation during
normal VACUUM, i.e. it will require touching the indexes.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
Bruce Momjian
Date:
Simon Riggs wrote:
> > I need clarification here.  Is removing dead heap tuple always going to
> > require an index scan, or was this just for chilling a row (adding an
> > index)?
> 
> We can remove a tupled marked HEAP_ONLY_TUPLE when it is status
> HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a
> TupleStub, but not removed. Multiple tuples in the chain can be removed,
> though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the
> first non-removed tuple in the chain. All of that can be done when we
> hold a CleanupLock on the block, without reference to the indexes; this
> can be performed on-demand, when we attempt an UPDATE. This is similar
> to what already happens prior to a btree split operation. (This could
> also be performed by bgwriter, but that isn't proposed at this time
> because the buffer transit time through the cache is often not long
> enough to allow tuples to die and get benefit from space reuse).
> 
> TupleStubs can be marked for removal by a pointer-swing operation during
> normal VACUUM, i.e. it will require touching the indexes.

OK, that sounds like a good plan.  Thanks.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: HOT for PostgreSQL 8.3

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2007-02-07 kell 17:38, kirjutas Simon Riggs:

> When we try to UPDATE a tuple and the new tuple version doesn't fit on
> the block, we get the BufferCleanupLock if possible and then perform a
> single-block VACUUM. Any tuple that is both HEAP_DEAD &
> HEAP_ONLY_TUPLE
> can be removed completely. This is possible by changing the t_ctid
> field
> so that it points at the first visible-to-someone tuple in the chain,
> so
> it points "over" the previous HOT tuples. The root tuple is also dead
> -
> it cannot be removed completely, so it is replaced it with "just a
> TupleHeader", which is referred to as a TupleStub. (Credit to Itagaki
> for this concept).

What if we would just reuse the root tuple directly instead of turning
it into a stub ?

This would create a cycle of ctid pointers, which changes the lookup
process from 'follow ctid chaint until the end' to 'follow the tid chain
until you reach the start'.

It also allows one to chill tuples for which all other tuples except the
root are dead by just flipping the HOT flag bit (or juts making the ctid
chain ptr to point to self.)

> The single-block VACUUM would alter *all* tuple chains on the block,
> not just the one for the current tuple being UPDATEd.
> 
> This technique means that a tuple never changes its CTID, so
> everything that currently uses CTID can continue normally. SeqScan
> would also work identically to the way it works today.
> 
> It also means that we can't easily remove the root tuple, even if it
> is now just a TupleStub (unless the whole chain is also removable
> because of DELETE). Removing the root tuple will require a VACUUM
> *FULL*. Even so, this option is still space-neutral in the worst-case,
> in comparison with inserting index tuples. 

if you can afford changing xmin and xmax, you can actually do it in 2
steps - first do an "update" that moves the tuple to hot chain root
position, and then, after the original is dead for all backends, just
mark it as not hot.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com





Re: HOT for PostgreSQL 8.3

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> What if we would just reuse the root tuple directly instead of turning
> it into a stub ?
> This would create a cycle of ctid pointers, which changes the lookup
> process from 'follow ctid chaint until the end' to 'follow the tid chain
> until you reach the start'.

How do you know which one is newest?  What happens when you have to put
a newer version off-page for lack of space?
        regards, tom lane


Re: HOT for PostgreSQL 8.3

From
Hannu Krosing
Date:
Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > What if we would just reuse the root tuple directly instead of turning
> > it into a stub ?
> > This would create a cycle of ctid pointers, which changes the lookup
> > process from 'follow ctid chaint until the end' to 'follow the tid chain
> > until you reach the start'.
> 
> How do you know which one is newest? 

By xmin,cmin of course .

> What happens when you have to put a newer version off-page for lack of space?

Then this scheme won't work.

How about adding a new 2-byte field to header for in-page c_tid poiner
for HOT ?

It grows header from 26 to 28 bytes, but for MAXALIGN=4 the space usage
would stay the same.

>             regards, tom lane
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/11/07, Hannu Krosing <hannu@skype.net> wrote:
Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > What if we would just reuse the root tuple directly instead of turning
> > it into a stub ?
> > This would create a cycle of ctid pointers, which changes the lookup
> > process from 'follow ctid chaint until the end' to 'follow the tid chain
> > until you reach the start'.
>
> How do you know which one is newest?

By xmin,cmin of course .

This sounds interesting. But we might have trouble supporting HOT-update when
tuple length changes. May be we can release the space consumed by the dead root
tuple and have fresh allocation if the tuple length increases.


Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> On 2/11/07, Hannu Krosing <hannu@skype.net> wrote:
>>
>> Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane:
>> > Hannu Krosing <hannu@skype.net> writes:
>> > > What if we would just reuse the root tuple directly instead of 
>> turning
>> > > it into a stub ?
>> > > This would create a cycle of ctid pointers, which changes the lookup
>> > > process from 'follow ctid chaint until the end' to 'follow the tid
>> chain
>> > > until you reach the start'.
>> >
>> > How do you know which one is newest?
>>
>> By xmin,cmin of course .
> 
> 
> This sounds interesting. But we might have trouble supporting HOT-update
> when
> tuple length changes. May be we can release the space consumed by the dead
> root
> tuple and have fresh allocation if the tuple length increases.

I don't see a problem with length changes. We're free to rearrange data 
on the page whenever we can acquire the vacuum lock.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: HOT for PostgreSQL 8.3

From
Heikki Linnakangas
Date:
Hannu Krosing wrote:
> Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane:
>> Hannu Krosing <hannu@skype.net> writes:
>>> What if we would just reuse the root tuple directly instead of turning
>>> it into a stub ?
>>> This would create a cycle of ctid pointers, which changes the lookup
>>> process from 'follow ctid chaint until the end' to 'follow the tid chain
>>> until you reach the start'.
>> How do you know which one is newest? 
> 
> By xmin,cmin of course .
> 
>> What happens when you have to put a newer version off-page for lack of space?
> 
> Then this scheme won't work.

Couldn't the ctid of the latest tuple point to the off-page tuple as usual?

As long as the index points to any tuple in the update chain, an index 
scan can even scan the whole page to find all the earlier tuples in the 
chain. The latest version can always be found by following the c_tid 
pointers, so that would only be needed to find an older version of the row.

> How about adding a new 2-byte field to header for in-page c_tid poiner
> for HOT ?
> 
> It grows header from 26 to 28 bytes, but for MAXALIGN=4 the space usage
> would stay the same.

Actually the tuple header is now 23 bytes, thanks to combo (aka phantom) 
cids. Adding a new 2-byte field would take it over 24 bytes again, 
making it either 28 or 32 bytes depending on MAXALIGN.

It might be acceptable, if it was only stored on those tuples that are 
(HOT) updated. But it's not clear to me what you're proposing to do with 
the field, anyway, Which tuples would have it, and what would it point to?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/12/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Hannu Krosing wrote:
> Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane:
>> Hannu Krosing <hannu@skype.net> writes:
>>> What if we would just reuse the root tuple directly instead of turning
>>> it into a stub ?
>>> This would create a cycle of ctid pointers, which changes the lookup
>>> process from 'follow ctid chaint until the end' to 'follow the tid chain
>>> until you reach the start'.
>> How do you know which one is newest?
>
> By xmin,cmin of course .
>
>> What happens when you have to put a newer version off-page for lack of space?
>
> Then this scheme won't work.

Couldn't the ctid of the latest tuple point to the off-page tuple as usual?

It could. But then you may lose reference for older version(s). We can do
the whole page scans to locate older versions, but that might prove too
costly
 

It might be acceptable, if it was only stored on those tuples that are
(HOT) updated. But it's not clear to me what you're proposing to do with
the field, anyway, Which tuples would have it, and what would it point to?

My guess what Hannu is suggesting is to have a  circular chain of HOT-updated
tuples in a page. The index can point to any tuple in the chain. When
update goes off-page or is a COLD update, t_ctid points to the newer version
as usual. So a tuple which is COLD updated would need two pointers, one
which points to the oldest version in the circular on-page chain and other which
points to the new version.
 
Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> How about adding a new 2-byte field to header for in-page c_tid poiner
> for HOT ?

We just finished sweating blood to get the tuple header size down to 23
bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8).  We are not
going to blow that again on HOT.
        regards, tom lane


Re: HOT for PostgreSQL 8.3

From
"Zeugswetter Andreas ADI SD"
Date:
> > > What are the problems with just shuffling the last (and only
visible)
> > > tuple to replace the HOT-hain root and be done with it ?
> >
> > ctid stops being a reliable identifier.

A backend selecting the row by ctid would need to take one step to
the root slot to get at the tuple. This does seem possible if the
tuples got swapped.
The old "current version" slot would need to become RECENTLY_DEAD
and could not be removed right away. It also needs a flag to
denote that the one step is necessary.

It sure would be nice if a root could be reused during regular vacuum,
(even if it were only possible under very restricted conditions,
that eventually apply).

Andreas


Re: HOT for PostgreSQL 8.3

From
mark@mark.mielke.cc
Date:
On Mon, Feb 12, 2007 at 12:48:06AM -0500, Tom Lane wrote:
> Hannu Krosing <hannu@skype.net> writes:
> > How about adding a new 2-byte field to header for in-page c_tid poiner
> > for HOT ?
> We just finished sweating blood to get the tuple header size down to 23
> bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8).  We are not
> going to blow that again on HOT.

I haven't had enough time to follow all of the details here - but if the
ability to update a row with minimal overhead, as long as there is extra
room in the same block is a great idea (it sounds appealing to me) - could
it be done with just a 1 byte list? 24 instead of 23 for the tuple size.

I'll try to catch up at some point. :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: HOT for PostgreSQL 8.3

From
Heikki Linnakangas
Date:
mark@mark.mielke.cc wrote:
> On Mon, Feb 12, 2007 at 12:48:06AM -0500, Tom Lane wrote:
>> We just finished sweating blood to get the tuple header size down to 23
>> bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8).  We are not
>> going to blow that again on HOT.
> 
> I haven't had enough time to follow all of the details here - but if the
> ability to update a row with minimal overhead, as long as there is extra
> room in the same block is a great idea (it sounds appealing to me) - could
> it be done with just a 1 byte list? 24 instead of 23 for the tuple size.

Assuming 8k pages, you could in theory store reference to a line pointer 
in just 1 byte.

But actually that 1 free byte in the header is not currently just waste 
of space. If you have any nulls in your tuple, there's going to be a 
null bitmap in addition to the header. 1 byte is conveniently enough to 
store the null bitmap for a table with max 8 columns, and if a table has 
more than 8 columns, the extra 4 or 8 bytes needed for the null bitmap 
probably doesn't matter so much because the tuples are quite wide anyway.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: HOT for PostgreSQL 8.3

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2007-02-12 kell 17:23, kirjutas Heikki
Linnakangas:
> mark@mark.mielke.cc wrote:
> > On Mon, Feb 12, 2007 at 12:48:06AM -0500, Tom Lane wrote:
> >> We just finished sweating blood to get the tuple header size down to 23
> >> bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8).  We are not
> >> going to blow that again on HOT.
> > 
> > I haven't had enough time to follow all of the details here - but if the
> > ability to update a row with minimal overhead, as long as there is extra
> > room in the same block is a great idea (it sounds appealing to me) - could
> > it be done with just a 1 byte list? 24 instead of 23 for the tuple size.
> 
> Assuming 8k pages, you could in theory store reference to a line pointer 
> in just 1 byte.
> 
> But actually that 1 free byte in the header is not currently just waste 
> of space. If you have any nulls in your tuple, there's going to be a 
> null bitmap in addition to the header. 1 byte is conveniently enough to 
> store the null bitmap for a table with max 8 columns,

Are we actually doing that ? I.E are null bitmaps really allocated in 1
byte steps nowadays ?

> and if a table has 
> more than 8 columns, the extra 4 or 8 bytes needed for the null bitmap 
> probably doesn't matter so much because the tuples are quite wide anyway.
> 
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: HOT for PostgreSQL 8.3

From
Heikki Linnakangas
Date:
Hannu Krosing wrote:
>> But actually that 1 free byte in the header is not currently just waste 
>> of space. If you have any nulls in your tuple, there's going to be a 
>> null bitmap in addition to the header. 1 byte is conveniently enough to 
>> store the null bitmap for a table with max 8 columns,
> 
> Are we actually doing that ? I.E are null bitmaps really allocated in 1
> byte steps nowadays ?

Yes.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: HOT for PostgreSQL 8.3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Hannu Krosing wrote:
>> Are we actually doing that ? I.E are null bitmaps really allocated in 1
>> byte steps nowadays ?

> Yes.

Not really; we still have to MAXALIGN at the end of the bitmap.  The
point is that you can get 8 bits in there before paying the first
additional MAXALIGN increment.

It's all moot anyway since 8 bits isn't enough for a pointer ...
        regards, tom lane


Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:


On 2/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Hannu Krosing wrote:
>> Are we actually doing that ? I.E are null bitmaps really allocated in 1
>> byte steps nowadays ?

> Yes.

Not really; we still have to MAXALIGN at the end of the bitmap.  The
point is that you can get 8 bits in there before paying the first
additional MAXALIGN increment.

It's all moot anyway since 8 bits isn't enough for a pointer ...


We could live with 8 bits actually. We can store only the least
significant 8 bits of the pointer. It would point to a set of tuples and
we may need to search within that set to find the required tuple.
This would still be better than scanning the entire page.

But I agree that utilizing those 8 bits would result in a penalty
for tables with few columns.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2007-02-13 kell 09:38, kirjutas Tom Lane:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > Hannu Krosing wrote:
> >> Are we actually doing that ? I.E are null bitmaps really allocated in 1
> >> byte steps nowadays ?
> 
> > Yes.
> 
> Not really; we still have to MAXALIGN at the end of the bitmap.  The
> point is that you can get 8 bits in there before paying the first
> additional MAXALIGN increment.
> 
> It's all moot anyway since 8 bits isn't enough for a pointer ...

With 8k pages and MAXALIGN=8 we just barely can, as with current page
structure (tuple headers together with data) the minimal tuple size for
that case is 24 for header + 8 for data = 32 bytes which means 256
tuples per page (minus page header) and so we can store tuple number in
8 bits.

OTOH, for same page HOT tuples, we have the command and trx ids stored
twice first as cmax,xmax of the old tuple and as cmin,xmin of the
updated tuple. One of these could probably be used for in-page HOT tuple
pointer.

As we can't rely on get-xmax-from-next-in-chain behaviour for off-page
tuples, we need to move the other way, that is storing a pointer to
previous version and getting xmin from there. That means keeping a chain
back pointers in xmin fields for all but the oldest tuple in HOT
chain/cycle.

This requires a little shuffling around of xmin/xmax info when removing
dead HOT chain members, but this would void the need to do any index
changes during HOT updates. Avoiding all index updates is probably good,
as it means we dont need to do any index page locking.

So the proposed structure for HOT chain is like this:

* Oldest tuple in chain has real xmin/xmax, this needs a flag to be
recognisable, of we may just start transaction counter at 64K so that
any xmin that fits in 2 bytes is immediately recognized as hot back
pointer. Or start cmin at 1 and use cmin=0 as pointer flag.

* All newer tuples have real xmax, and in-page pointer to previous
version in place of xmin. the real xmin is xmax of the previous tuple.
When previous tuple is removed, xmin is stored in the new oldest tuple.

* Index can point to any tuple in HOT chain. Finding the tuple by index

* To find a tuple, one errives to (possibly) center of HOT chain, and
the tuple may be found either by following the in-page pointer stored in
xmin field or ctid pointers.

* If the tuples ctid is pointing to off page, which means it must thus
be the newest one in HOT chain and also end of it. The chain ends also
when the tuple is inserted by an rollbacked transaction.

* there is a special case when the index pointer points to a rollbacked
tuple. In this case the tuple can't be removed, but should be reused on 
next update. But this is the case for any HOT rollbacks.

The proposed structure should enables cyclic reuse of dead tuples as
they fall out of visibility window and avoids updating index pointers.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/14/07, Hannu Krosing <hannu@skype.net> wrote:

OTOH, for same page HOT tuples, we have the command and trx ids stored
twice first as cmax,xmax of the old tuple and as cmin,xmin of the
updated tuple. One of these could probably be used for in-page HOT tuple
pointer.

I think we recently merged cmin/cmax into a single combo cid. So I don't
think we can use cmin for storing back pointer. Idea of using xmin  seems
feasible though.

As we can't rely on get-xmax-from-next-in-chain behaviour for off-page
tuples, we need to move the other way, that is storing a pointer to
previous version and getting xmin from there. That means keeping a chain
back pointers in xmin fields for all but the oldest tuple in HOT
chain/cycle.

Why can't we store the real xmin/xmax at the end of the chain and store
forward pointers in the xmax ? I find the idea of having back pointers more
compelling though.
 

This requires a little shuffling around of xmin/xmax info when removing
dead HOT chain members, but this would void the need to do any index
changes during HOT updates. Avoiding all index updates is probably good,
as it means we dont need to do any index page locking.

I agree. If we can design a solution that does not require any index updates,
that would save us a lot. One problem with such a approach though is that we
might have to live with a dead tuple/stub until another update occurs and
the tuple/stub is reused.

So the proposed structure for HOT chain is like this:

* Oldest tuple in chain has real xmin/xmax, this needs a flag to be
recognisable, of we may just start transaction counter at 64K so that
any xmin that fits in 2 bytes is immediately recognized as hot back
pointer. Or start cmin at 1 and use cmin=0 as pointer flag.

* All newer tuples have real xmax, and in-page pointer to previous
version in place of xmin. the real xmin is xmax of the previous tuple.
When previous tuple is removed, xmin is stored in the new oldest tuple.

* Index can point to any tuple in HOT chain. Finding the tuple by index

* To find a tuple, one errives to (possibly) center of HOT chain, and
the tuple may be found either by following the in-page pointer stored in
xmin field or ctid pointers.

Can we quickly figure out based on the given snapshot and the root tuple
xmin/xmax, whether to follow the forward or the backward pointer to arrive
at a visible tuple ?
 

* If the tuples ctid is pointing to off page, which means it must thus
be the newest one in HOT chain and also end of it. The chain ends also
when the tuple is inserted by an rollbacked transaction.

* there is a special case when the index pointer points to a rollbacked
tuple. In this case the tuple can't be removed, but should be reused on
next update. But this is the case for any HOT rollbacks.


One problem is with aborted HOT-updates. According to this scheme, xmin
of the aborted version is stored in the previous version. What happens when
that gets updated again and committed ? If we follow the back pointer from
the aborted tuple to fetch the xmin, we would actually be fetching the txid
of the committed transaction which later updated the tuple. This would fool
us to incorrectly believe that the aborted tuple is LIVE.

May be we can set XMIN_INVALID for the next tuple in the chain when
a tuple being updated has t_ctid pointing to a tuple in the same page,
other than itself.

I am sure there are interesting interactions between vacuum-ing that needs
to be considered as well.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> Ühel kenal päeval, T, 2007-02-13 kell 09:38, kirjutas Tom Lane:
>> It's all moot anyway since 8 bits isn't enough for a pointer ...

> With 8k pages and MAXALIGN=8 we just barely can, as with current page
> structure (tuple headers together with data) the minimal tuple size for
> that case is 24 for header + 8 for data = 32 bytes which means 256
> tuples per page (minus page header) and so we can store tuple number in
> 8 bits.

But neither of those assumptions is acceptable --- I would think in fact
that people would become *more* interested in having large pages with
HOT, because it would give more room for updates-on-the-same-page.
Besides, you forgot the case of an empty tuple (<= 8 columns, all NULL).

> OTOH, for same page HOT tuples, we have the command and trx ids stored
> twice first as cmax,xmax of the old tuple and as cmin,xmin of the
> updated tuple. One of these could probably be used for in-page HOT tuple
> pointer.

This proposal seems awfully fragile, because the existing
tuple-chain-following logic *depends for correctness* on comparing each
tuple's xmin to prior xmax.  I don't think you can just wave your hands
and say we don't need that cross-check.  Furthermore it seems to me you
haven't fixed the problem, which is that you can't remove the chain
member that is being pointed at by off-page links (either index entries
or a previous generation of the same tuple).  As described, you've made
that problem worse because you're trying to say we don't know which of
the chain entries is pointed at.
        regards, tom lane


Re: HOT for PostgreSQL 8.3

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2007-02-14 kell 10:41, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > OTOH, for same page HOT tuples, we have the command and trx ids stored
> > twice first as cmax,xmax of the old tuple and as cmin,xmin of the
> > updated tuple. One of these could probably be used for in-page HOT tuple
> > pointer.
> 
> This proposal seems awfully fragile, because the existing
> tuple-chain-following logic *depends for correctness* on comparing each
> tuple's xmin to prior xmax.  

What kinds of correctnes guarantees does this give for same-page tuples?

The comparing of each tuple's xmin to prior xmax should stay for
inter-page ctid links.

Mostly you can think of the same-page HOT chain as one extended tuple
when looking at it from outside of that page.

> I don't think you can just wave your hands and say we don't need that cross-check.  

> Furthermore it seems to me you
> haven't fixed the problem, which is that you can't remove the chain
> member that is being pointed at by off-page links (either index entries
> or a previous generation of the same tuple).  

You can't remove any tuples before they are invisible for all
transactions (i.e. dead). And being dead implies that all previous
versions are dead as well. So if I can remove a tuple, I can also remove
all its previous versions as well. Or are you trying to say that VACUUM
follows ctid links of dead tuples for some purpose ?

The problem I am trying to fix is reusing in-page space without need to
touch indexes.

> As described, you've made
> that problem worse because you're trying to say we don't know which of
> the chain entries is pointed at.

There should be a flag, say HOT_CHAIN_ENTRY for the tuple the index(es)
point at. And this should be the preferred CTID for inserting new
versions once the old one is dead.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/16/07, Hannu Krosing <hannu@skype.net> wrote:
Ühel kenal päeval, K, 2007-02-14 kell 10:41, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > OTOH, for same page HOT tuples, we have the command and trx ids stored
> > twice first as cmax,xmax of the old tuple and as cmin,xmin of the
> > updated tuple. One of these could probably be used for in-page HOT tuple
> > pointer.
>
> This proposal seems awfully fragile, because the existing
> tuple-chain-following logic *depends for correctness* on comparing each
> tuple's xmin to prior xmax.

What kinds of correctnes guarantees does this give for same-page tuples?

I agree with Tom that xmin/xmax check does help to guarantee correctness.
I myself have used it often during HOT development to find/fix bugs. But
ISTM that we don't need atleast for in-page tuple chain, if we are
careful. So if removing this buys us something important, I am all for it.

The comparing of each tuple's xmin to prior xmax should stay for
inter-page ctid links.

Agree.

Mostly you can think of the same-page HOT chain as one extended tuple
when looking at it from outside of that page.

> I don't think you can just wave your hands and say we don't need that cross-check.

> Furthermore it seems to me you
> haven't fixed the problem, which is that you can't remove the chain
> member that is being pointed at by off-page links (either index entries
> or a previous generation of the same tuple).

You can't remove any tuples before they are invisible for all
transactions (i.e. dead). And being dead implies that all previous
versions are dead as well. So if I can remove a tuple, I can also remove
all its previous versions as well. Or are you trying to say that VACUUM
follows ctid links of dead tuples for some purpose ?

The only exception to this would be the case of aborted updates. In that
case a tuple is dead, but the one pointing to it is still live. But I don't see
any reason somebody would want to follow a chain past a live tuple. Not
sure about the VACUUM FULL code path though. Thats the only
place other than EvalPlanQual where we follow ctid chain.
 

The problem I am trying to fix is reusing in-page space without need to
touch indexes.

Can we do some kind of indirection from the root line pointer ? Haven't
completely thought through yet, but the basic idea is to release the actual
space consumed by the root tuple once it becomes dead, but store the
offnum of the new root in the line pointer of the original root tuple. We
may need to flag the line pointer for that, but if I am not wrong, LP_DELETE
is not used for heap tuples.

We would waste  4 bytes of line pointer until the tuple is COLD updated and
the entire chain and the associated index entry is removed.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Zeugswetter Andreas ADI SD"
Date:
> > As described, you've made
> > that problem worse because you're trying to say we don't know which
of
> > the chain entries is pointed at.
>
> There should be a flag, say HOT_CHAIN_ENTRY for the tuple the

it's called HEAP_UPDATE_ROOT

> index(es) point at. And this should be the preferred CTID for
> inserting new versions once the old one is dead.

This is not possible, see my reply to Bruce (maybe unless the whole hot
chain is dead).
(because that would need a back pointer, so readers arriving at the root
find the visible tuple)

Andreas


Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/16/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:

> > As described, you've made
> > that problem worse because you're trying to say we don't know which
of
> > the chain entries is pointed at.
>
> There should be a flag, say HOT_CHAIN_ENTRY for the tuple the

it's called HEAP_UPDATE_ROOT


Just to avoid any confusion with the patch I sent out this week, we are
setting HEAP_UPDATE_ROOT on all tuples which are HOT-updated.

We set HEAP_ONLY_TUPLE for all tuples which does not have index
reference. So may be combination of (HEAP_UPDATE_ROOT & ~HEAP_ONLY_TUPLE)
can be used to identify index referred tuple in a HOT-update chain.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Zeugswetter Andreas ADI SD"
Date:
> Just to avoid any confusion with the patch I sent out this
> week, we are setting HEAP_UPDATE_ROOT on all tuples which are
> HOT-updated.
>
> We set HEAP_ONLY_TUPLE for all tuples which does not have
> index reference. So may be combination of (HEAP_UPDATE_ROOT &
> ~HEAP_ONLY_TUPLE) can be used to identify index referred
> tuple in a HOT-update chain.

Oh sorry. Thanks for the clarification. Imho HEAP_UPDATE_ROOT should be
renamed for this meaning then (or what does ROOT mean here ?).
Maybe HEAP_UPDATE_CHAIN ?

Andreas


Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/16/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:

Oh sorry. Thanks for the clarification. Imho HEAP_UPDATE_ROOT should be
renamed for this meaning then (or what does ROOT mean here ?).
Maybe HEAP_UPDATE_CHAIN ?

Yes, you are right. There is some disconnect between what Simon had
originally posted and the patch I sent out. Hopefully as we discuss HOT
more here, everything will converge.

Thanks,
Pavan
 
--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Mon, 2007-02-12 at 09:24 +0530, Pavan Deolasee wrote:

> On 2/12/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
>         Hannu Krosing wrote:
>         > Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom
>         Lane:
>         >> Hannu Krosing <hannu@skype.net> writes:
>         >>> What if we would just reuse the root tuple directly
>         instead of turning
>         >>> it into a stub ?
>         >>> This would create a cycle of ctid pointers, which changes
>         the lookup
>         >>> process from 'follow ctid chaint until the end' to 'follow
>         the tid chain
>         >>> until you reach the start'.
>         >> How do you know which one is newest?
>         >
>         > By xmin,cmin of course .
>         >
>         >> What happens when you have to put a newer version off-page
>         for lack of space?
>         >
>         > Then this scheme won't work.
>
>         Couldn't the ctid of the latest tuple point to the off-page
>         tuple as usual?
>
> It could. But then you may lose reference for older version(s). We can
> do
> the whole page scans to locate older versions, but that might prove
> too
> costly
>
>
>         It might be acceptable, if it was only stored on those tuples
>         that are
>         (HOT) updated. But it's not clear to me what you're proposing
>         to do with
>         the field, anyway, Which tuples would have it, and what would
>         it point to?
>
> My guess what Hannu is suggesting is to have a  circular chain of
> HOT-updated
> tuples in a page. The index can point to any tuple in the chain. When
> update goes off-page or is a COLD update, t_ctid points to the newer
> version
> as usual. So a tuple which is COLD updated would need two pointers,
> one
> which points to the oldest version in the circular on-page chain and
> other which
> points to the new version.

I very much like Hannu's idea, but it does present some issues.

My feeling is we need to get some clarity on whether to go this route,
or not, because it takes us away from the pointer-swinging idea. To
maximise our chances of a something good for 8.3 we really need to pick
just one idea now. Maybe we've already done that, by default.

I've tried to analyse all aspects of the discussion to see if we can get
something out of this:

Hannu has described the possibility of having the index point to a tuple
version which isn't the earliest one on the block. That raises the issue
of how will we locate earlier tuple versions when the current Snapshot
cannot see the root tuple?

First off, we only need to locate the earlier tuple when the root was
re-placed by a HOT update - we can tell that, so we're good so far.

Second, we would like to be able to validate the xmax == xmin rule. If
the original root tuple was dead, so will the prior versions that are
off-block, so nobody will ever follow the chain to the root tuple again
as part of EvalPlanQual. So we need to check newroot.xmax ==
nextinchain.xmin only.

Third we need to locate the appropriate tuple version.

We have a number of approaches:

a) Block Seq Scan: using a scan to get there is possible, but not good.
The best mechanism is to scan the block backwards (i.e. highest item to
lowest item) picking up the tuple which points to the root, then
scanning backwards picking up the parts of the chain as they are seen
until we get to a visible tuple that we can validate as part of the
chain. If we scanned forwards we'd need to remember the whole chain as
we went, which would be less useful. *But* this doesn't allow us to
validate the xmax == xmin rule, so that kills this sub-option, IMHO.

b) Additional tuple fields

We could have additional fields on the root tuple to help us locate the
"true root" or oldest version of this tuple on the block. Those
additional header fields are only required on the root tuple - no other
tuple needs this info. So we need not complain about space
savings/losses - the comparison is between extra tuple headers and a
TupleStub, which is a full header width (24 bytes). This would need to
include a copy of the xmin of the original root tuple, to allow us to
validate the xmax == xmin rule (4 bytes). It would also need to include
an in-page pointer to the true root (2 bytes). So additional tuple
fields of 6 bytes would be added to the root tuple, probably maxaligning
to 8 more bytes than normal - a saving of 16 bytes in comparison with
the TupleStub approach. Since there is no separate TupleStub, we can
replace the root when required, not just at full table VACUUM time. The
presence, or not, of those fields can be marked using various info bits.
The NULL bitmap is effectively an optional header field, so the idea
seems workable - or at least as workable as pointer swinging.

The in-page pointer is only ever followed as a result of an index scan.
In all those cases we don't record the xmin,  so we don't check it - we
only need to store the xmin of the tuple pointed to by the in-page
pointer. There may be some off-block tuple versions that point at the
new root, but those links will never be followed because they are
by-definition dead rows.

I didn't start this analysis with any preconceived outcome, but ISTM now
that Hannu's idea can be made to work robustly, whilst avoiding the need
for pointer swinging, which thus allows retail VACUUMs, *and* saving
space on-block overall. What it requires is two optional fields on any
root tuple that has been replaced following a HOT update. If that is
acceptable, then we have a better design as a result of Hannu's
suggestion.

Comments?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/21/07, Simon Riggs <simon@2ndquadrant.com> wrote:

I very much like Hannu's idea, but it does present some issues.


I too liked Hannu's idea initially, but Tom raised a valid concern
that it does not address the basic issue of root tuples. According
to the idea, a DEAD root tuple can be used for a subsequent
update of the same row. For a very large table, even if its updated
frequently, it is not unlikely that the same row might not be updated
for a long time. Even when the update happens we would be
constrained by the length of the new version being same or less
than the root tuple OR ability to perform retail-vacuum of the block.

Did you or anybody else got a chance to think about the other idea
I proposed of having indirection from the root line pointer ? As I
mentioned earlier, I myself haven't thought through it completely,
but at the face of it, it looks doable. It would add a four-byte
overhead per live tuple-chain, but IMHO would be much simpler
to implement and not too invasive.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Simon Riggs"
Date:
On Thu, 2007-02-22 at 00:00 +0530, Pavan Deolasee wrote:
> 
> On 2/21/07, Simon Riggs <simon@2ndquadrant.com> wrote:
>         
>         I very much like Hannu's idea, but it does present some
>         issues.
>         
> 
> I too liked Hannu's idea initially, but Tom raised a valid concern
> that it does not address the basic issue of root tuples. According 
> to the idea, a DEAD root tuple can be used for a subsequent
> update of the same row. For a very large table, even if its updated
> frequently, it is not unlikely that the same row might not be updated
> for a long time. 

That's a fair point and pointer swinging would address that.

However, the space wastage is identical to the current situation. We
need to choose which use-case we are optimising for: the case you
mention would be optimising for high volume but very thinly spread
updates. Personally, I don't see that as the most important use case out
of the many possibilities. The problem we are trying to address is rows
that *are* very frequently updated. There are so many sub-cases its easy
to get distracted about which ones are actually causing usage problems
right now.

Anyway I take the point that pointer swinging is long term necessary,
but my question is whether we need this yet.

> Even when the update happens we would be 
> constrained by the length of the new version being same or less
> than the root tuple OR ability to perform retail-vacuum of the block.

Well, thats a more important question: surely we have agreed that retail
VACUUM is both possible and beneficial?

> Did you or anybody else got a chance to think about the other idea 
> I proposed of having indirection from the root line pointer ? 

Well yes, I saw that, but I was pointing out that we don't need to use
just a single byte if we have a part of the tuple header that only
exists in these circumstances.

> As I
> mentioned earlier, I myself haven't thought through it completely,
> but at the face of it, it looks doable. It would add a four-byte
> overhead per live tuple-chain, but IMHO would be much simpler 
> to implement and not too invasive.

Cool.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: HOT for PostgreSQL 8.3

From
"Zeugswetter Andreas ADI SD"
Date:
> > I very much like Hannu's idea, but it does present some issues.
> >
> >
> I too liked Hannu's idea initially, but Tom raised a valid
> concern that it does not address the basic issue of root
> tuples. According to the idea, a DEAD root tuple can be used
> for a subsequent update of the same row.

If you are reusing the existing slot of a root tuple how will that
slot likely have room for an extra pointer and a live tuple ?
If the idea does not cover root reuse we don't need pointers.

Imho we should follow the swing idea. It should even be possible
to point the root to the newest dead tuple during update (if it
was index path), and reuse an older dead slot from the chain.
Then we can limit the chain to number of potentially visible
tuples + root + 2 without vacuum.

Andreas


Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/22/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:

> > I very much like Hannu's idea, but it does present some issues.
> >
> >
> I too liked Hannu's idea initially, but Tom raised a valid
> concern that it does not address the basic issue of root
> tuples. According to the idea, a DEAD root tuple can be used
> for a subsequent update of the same row.

If you are reusing the existing slot of a root tuple how will that
slot likely have room for an extra pointer and a live tuple ?
If the idea does not cover root reuse we don't need pointers.

Hannu talked about using one of xmin/xmax for storing
back-pointers. There were objections to that since it breaks
the xmax/xmin matching robustness that we have today.
 
Imho we should follow the swing idea.

Yes, thats one option. Though given a choice I would waste
four bytes in the heap-page than inserting a new index entry.
The heap tuples can be vacuumed rather easily than the index
entries which, if I am mistaken, can not be reused even after
marked LP_DELETEd.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Zeugswetter Andreas ADI SD"
Date:
> > Imho we should follow the swing idea.
>
>
> Yes, thats one option. Though given a choice I would waste
> four bytes in the heap-page than inserting a new index entry.

No question about that. My point was, that it would mean wasting
the 2 (2 must be enough for a slot pointer) bytes on every heap tuple,
hot or not. And then the decision is not so obvious anymore.
If you don't have the room for the back pointer on every slot,
there is no room to add one later.

Andreas


Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/22/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:
>
> Yes, thats one option. Though given a choice I would waste
> four bytes in the heap-page than inserting a new index entry.

No question about that. My point was, that it would mean wasting
the 2 (2 must be enough for a slot pointer) bytes on every heap tuple,
hot or not. And then the decision is not so obvious anymore.
If you don't have the room for the back pointer on every slot,
there is no room to add one later.


Oh yes, I agree. I was referring to the idea of line pointer redirection
which would waste four bytes (for the root line pointer)  per
hot-update chain. That occurs only when a tuple is hot-updated.
So there is no overhead for normal tuples. Also, since its outside
the tuple header, we don't have issues of additional space wastage
because of alignment.

We would need to teach the code to ignore all such pointers which
don't point to a real tuple, but only redirects us to another line pointer.
We arrive at this line pointer from the index and then get redirected
to another line pointer on the same page. Vacuum would need to
delay freeing this line pointer until the hot-update chain is dead.

I am waiting for feedback on this since I would like to work on
this next. Anybody sees any issue with this approach ? Comments
please.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Zeugswetter Andreas ADI SD"
Date:
> > > Yes, thats one option. Though given a choice I would waste four
> > > bytes in the heap-page than inserting a new index entry.
> >
> > No question about that. My point was, that it would mean wasting the
2
> > (2 must be enough for a slot pointer) bytes on every heap tuple, hot

> > or not. And then the decision is not so obvious anymore.
> > If you don't have the room for the back pointer on every slot, there

> > is no room to add one later.
> >
> >
> Oh yes, I agree. I was referring to the idea of line pointer
> redirection which would waste four bytes (for the root line
> pointer)  per hot-update chain. That occurs only when a tuple
> is hot-updated.
> So there is no overhead for normal tuples.

I think you are still misunderstanding me, sorry if I am not beeing
clear
enough. When the row is hot-updated it is too late. You do not have room

in the root for the line pointer.

Say a table's tuple size is typically 40 bytes. Your root slot has room
for exactly 40 bytes. When the new tuple also (as expected) needs 40
bytes
there is no room for the line pointer.

Or are you suggesting, that vacuum resizes all root tuples to make room
for
the extra bytes, and only then you can use it ? Imho that would not be
good.

Imho the "relink root to newest dead tuple during update" path is more
promising, and does not depend on vacuum.

If we do reuse dead tuples without vacuum we should probably, as already
suggested,
disconnect the "what is dead enough for reuse/vacuum" from global xmin
right from the
start.

Andreas



Re: HOT for PostgreSQL 8.3

From
"Pavan Deolasee"
Date:

On 2/22/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:


I think you are still misunderstanding me, sorry if I am not beeing
clear
enough. When the row is hot-updated it is too late. You do not have room

in the root for the line pointer.

I think the word "line pointer" is causing some confusion here. Let me
explain the idea again: Each page has a set of line pointers OR item-ids
as they are referred in the code (I shall use the word item-id here after).
The item-id stores the offset(15 bits), length (15 bits) and two
flags, LP_USED and LP_DELETE (2 bits).

The root tuple is pointed to by some item-id (say, I1) on the page. When
the tuple is hot updated, a heap-only tuple is added which is linked to
the root tuple by its t_ctid and is pointed by another item-id I2 on
the page.

Now, when the root tuple subsequently becomes DEAD, I am proposing
to store I2 in the offset field of I1. The length field in I1 can be set to
zero (or some other special value) to mark that I1 is now just a
redirection to I2 and does not point to any real tuple, dead or live.
The actual storage used by the root tuple is now released (or we can
add another item pointer to keep track of it so that it can be reused
without vacuum of the page. But lets defer that for the time being).

In short, the pointer is NOT stored in the tuple header, but in the
item-id.

If we do reuse dead tuples without vacuum we should probably, as already
suggested,
disconnect the "what is dead enough for reuse/vacuum" from global xmin
right from the
start.

I did not get that completely. Can you elaborate on that ?

Once we find a heap-only DEAD tuple, we remove it from the
tuple-chain (and thus remove its only reference) and set LP_DELETE
on its item-id. When we run out of free-space in a page, we search for
an item-id whose LP_DELETE is set and whose length is atleast equal
to the new tuple length. We reuse that item-id and the associated
storage for storing the new tuple.
 
I hope things would be more clear once I post next version of the
HOT-patch. But I really appreciate these comments.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: HOT for PostgreSQL 8.3

From
"Zeugswetter Andreas ADI SD"
Date:
> I think the word "line pointer" is causing some confusion
> here. Let me explain the idea again: Each page has a set of
> line pointers OR item-ids as they are referred in the code (I
> shall use the word item-id here after).
> The item-id stores the offset(15 bits), length (15 bits) and
> two flags, LP_USED and LP_DELETE (2 bits).
>
> The root tuple is pointed to by some item-id (say, I1) on the
> page. When the tuple is hot updated, a heap-only tuple is
> added which is linked to the root tuple by its t_ctid and is
> pointed by another item-id I2 on the page.
>
> Now, when the root tuple subsequently becomes DEAD, I am
> proposing to store I2 in the offset field of I1. The length
> field in I1 can be set to zero (or some other special value)
> to mark that I1 is now just a redirection to I2 and does not
> point to any real tuple, dead or live.

Oh, thanks for explaining. I was really misunderstanding.
Is that possible ? I thought item-id's (slots) need to be in physical
order.

> If we do reuse dead tuples without vacuum we should probably,
> as already
> > suggested,
> > disconnect the "what is dead enough for reuse/vacuum" from
> global xmin
> > right from the start.
>
>
> I did not get that completely. Can you elaborate on that ?
>
> Once we find a heap-only DEAD tuple, we remove it from the
> tuple-chain (and thus remove its only reference) and set LP_DELETE
> on its item-id. When we run out of free-space in a page, we search for
> an item-id whose LP_DELETE is set and whose length is atleast equal
> to the new tuple length. We reuse that item-id and the associated
> storage for storing the new tuple.

With the recent discussion about flashback (explicitly setting a
snapshot
to 5 min ago) and forensics, I think we should have a way to delay
what is considered "DEAD and available for immediate reuse".
I know this can reduce the efficiency of HOT when delayed too far,
but I think we should have that possibility if we reuse without vacuum.

Andreas