Thread: HOT for PostgreSQL 8.3
Heap Only Tuples ("HOT") is a simplification of earlier proposals for improving the way the server handles frequent updates, based upon what's been learned and feedback received. Heap Only Tuples ---------------- The basic idea is that when a tuple is UPDATEd we can, in certain circumstances, avoid inserting index tuples for a tuple. Such tuples are marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to other tuples. The pre-conditions for allowing a HOT UPDATE are - UPDATE doesn't change any indexed columns - there is space on the same block as the tuple being updated There is no restriction on tuple length changes, nor any requirement for an additional field in the tuple header; as a result this change does not require activation by an additional WITH parameter and this technique can be used on *all* tables. HOT will, in some cases, perform better in conjunction with the use of the fillfactor storage parameter. For smaller tables, this will seldom be required, so database tuning will not increase in complexity (in comparison with carefully planned VACUUM strategies in earlier releases). In many cases, the update rate will cause a steady state to be reached, with on-block space being reused cyclically. At the same time we insert the HEAP_ONLY_TUPLE, the just-updated tuple will be marked HEAP_UPDATE_ROOT. When we use an index to locate a heap tuple, we start from this root tuple and hop forwards using the ctid chain until we find the appropriate tuple. CREATE INDEX requires some careful work to allow it to identify and correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as a result of the new index. This will cause additional work to be required for those cases. CREATE INDEX on a newly loaded table will be completely unaffected. There is some complexity there, though we don't go into detail on those issues here. Please read on! To allow HOT to work effectively we need to consider how we will VACUUM, noting that in many cases we can remove HEAP_ONLY_TUPLEs much more easily because they have no index tuples referencing them. There are various options at this stage, but for clarity only one of those options is presented here. When we try to UPDATE a tuple and the new tuple version doesn't fit on the block, we get the BufferCleanupLock if possible and then perform a single-block VACUUM. Any tuple that is both HEAP_DEAD & HEAP_ONLY_TUPLE can be removed completely. This is possible by changing the t_ctid field so that it points at the first visible-to-someone tuple in the chain, so it points "over" the previous HOT tuples. The root tuple is also dead - it cannot be removed completely, so it is replaced it with "just a TupleHeader", which is referred to as a TupleStub. (Credit to Itagaki for this concept). e.g. t1 (t_ctid: t2 ) - info HEAP_UPDATE_ROOT status HEAPTUPLE_DEAD t2 (t_ctid: t3 ) - info HEAP_ONLY status HEAPTUPLE_DEAD t3 (t_ctid:self) - info HEAP_ONLY status HEAPTUPLE_LIVE after single-page VACUUM t1 (t_ctid: t3 ) - info HEAP_UPDATE_ROOT & HEAP_TUPLE_STUB - status HEAPTUPLE_RECENTLY_DEAD - t1 is now a TupleStub only t3 (t_ctid:self) - info HEAP_ONLY status HEAPTUPLE_LIVE Status shown is the return value from HeapTupleSatisfiesVacuum() The single-block VACUUM would alter *all* tuple chains on the block, not just the one for the current tuple being UPDATEd. This technique means that a tuple never changes its CTID, so everything that currently uses CTID can continue normally. SeqScan would also work identically to the way it works today. It also means that we can't easily remove the root tuple, even if it is now just a TupleStub (unless the whole chain is also removable because of DELETE). Removing the root tuple will require a VACUUM *FULL*. Even so, this option is still space-neutral in the worst-case, in comparison with inserting index tuples. When we perform the single-block VACUUM we don't change the FSM, nor do we try to check/increment the table's freezelimit. HOT would alter slightly the way that UPDATEs are signalled to stats, so that these HOT UPDATEs don't count towards the threshold for autovacuuming - so that a frequently HOT-updated table may only very seldom require a normal VACUUM. The number of itempointers would increase in many cases, though this would still be limited by current maximums. Various tweaks on this basic idea exist, which can be considered in more detail if the basic concept is accepted. - - - This design is aimed at being a no-frills version of the code that has already been written. The existing version is available for testing now and will be made available on community.enterprisedb.com shortly. Four PostgreSQL developers have various amounts of time to contribute to developing the above solution and customising it further according to the wishes of the Community. That is myself, Heikki Linnakangas, Pavan Deolasee and Nikhil Sontakke. Taken together, it seems possible to craft something that can be acceptable for PostgreSQL 8.3 Plan from here is to publish the WIP patch(es) weekly until 8.3 code freeze, together with any performance results. Your comments are welcome. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > The basic idea is that when a tuple is UPDATEd we can, in certain > circumstances, avoid inserting index tuples for a tuple. Such tuples are > marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to > other tuples. What is VACUUM FULL going to do when it wants to move one of these things? > CREATE INDEX requires some careful work to allow it to identify and > correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as > a result of the new index. I think you've glossed over the CREATE INDEX problem much too easily. The difficulty with that is going to be that de-HOT-ifying a tuple is going to require multiple updates that can't possibly be put into a single WAL record, and I don't think that WAL replay can clean up after an incomplete update (since it can't run user-defined functions and hence cannot be expected to compute index entries for itself). So I don't think you can do that while preserving crash safety. > Removing the root tuple will require a VACUUM *FULL*. That seems unacceptable ... it won't take too long for your table to fill up with stubs, and we don't want to return to the bad old days when periodic VACUUM FULL was unavoidable. ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from "remove these TIDs when you find them", there could be "replace these TIDs with those TIDs when you find them". This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. This has got the same atomicity problem as for CREATE INDEX, because it's the same thing: you're de-HOT-ifying the child. So if you can solve the former, I think you can make this work too. regards, tom lane
On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > The basic idea is that when a tuple is UPDATEd we can, in certain > > circumstances, avoid inserting index tuples for a tuple. Such tuples are > > marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to > > other tuples. > > What is VACUUM FULL going to do when it wants to move one of these things? This question stands out from the others. I'm not sure which aspect you're thinking of - do you see some failure cases? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > Heap Only Tuples ("HOT") is a simplification of earlier proposals for > improving the way the server handles frequent updates, based upon what's > been learned and feedback received. > > Heap Only Tuples > ---------------- > > The basic idea is that when a tuple is UPDATEd we can, in certain > circumstances, avoid inserting index tuples for a tuple. Such tuples are > marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to > other tuples. The pre-conditions for allowing a HOT UPDATE are > - UPDATE doesn't change any indexed columns Uhmmm... how often is that the case? Don't get me wrong, bravo but that seems a rather large limitation. Considering it, this would certainly be a boon in web space where you have things like Rails doing: UPDATE foo SET first_name = 'Barney' WHERE id = 1; That would qualify correct? Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
On 2/7/07, Joshua D. Drake <jd@commandprompt.com> wrote: > Simon Riggs wrote: > > Heap Only Tuples ("HOT") is a simplification of earlier proposals for > > improving the way the server handles frequent updates, based upon what's > > been learned and feedback received. > Uhmmm... how often is that the case? Don't get me wrong, bravo but that > seems a rather large limitation. > > Considering it, this would certainly be a boon in web space where you > have things like Rails doing: HOT is great for tables that are updated frequently via triggers or cron right? so it this eliminate the need to vacuum foo following executing: update foo set v =1 where id =1; where v is not an index, right? if so, it would be great all kinds of things, especially materialization techniques. or any situation where a table is updated so frequently autovac can't keep up. I can think of tons of places where this would be useful. merlin
Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> The basic idea is that when a tuple is UPDATEd we can, in certain >> circumstances, avoid inserting index tuples for a tuple. Such tuples are >> marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to >> other tuples. > > What is VACUUM FULL going to do when it wants to move one of these things? I suppose it could move the whole chain to the same target page, if there is one with enough space to accommodate the whole chain. Or we could just cop out and not move tuples marked with HEAP_ONLY_TUPLE. I think that would be acceptable; after the last transaction that can see the old tuple is finished, the old tuple is dead. After that, VACUUM FULL can remove the old tuple and move the remaining tuple as usual. >> CREATE INDEX requires some careful work to allow it to identify and >> correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as >> a result of the new index. > > I think you've glossed over the CREATE INDEX problem much too easily. > The difficulty with that is going to be that de-HOT-ifying a tuple > is going to require multiple updates that can't possibly be put into > a single WAL record, and I don't think that WAL replay can clean up > after an incomplete update (since it can't run user-defined functions > and hence cannot be expected to compute index entries for itself). > So I don't think you can do that while preserving crash safety. Yeah, chilling tuples from HOT state to normal tuples is not easy. One solution I thought of is to add another flag to heap tuples, CHILL_IN_PROGRESS. To chill a tuple, you would: 1. Mark heap tuple with CHILL_IN_PROGRESS 2. Insert missing index entries 3. Clear CHILL_IN_PROGRESS and HEAP_ONLY_TUPLE flags Index scans would ignore tuples with CHILL_IN_PROGRESS and directly pointed to from the index. That way if we crash in the middle of step 2, scans and updates would work normally after replay, as if the index entries weren't there. CREATE INDEX would have to fail if there's any CHILL_IN_PROGRESS tuples, because we wouldn't know which index entries need to be inserted; some might already be there. To clear the CHILL_IN_PROGRESS flag, a vacuum would be needed. Vacuum would remove all index entries for those tuples, but instead of removing the heap tuple in the 2nd scan it would just clear the CHILL_IN_PROGRESS flag, bringing us back to where we started. However, the easiest solution would be to make CREATE INDEX wait until the old tuple is dead. That should be ok at least for concurrent CREATE INDEX, because it already has that kind of a wait between 1st and 2nd phase. >> Removing the root tuple will require a VACUUM *FULL*. > > That seems unacceptable ... it won't take too long for your table to > fill up with stubs, and we don't want to return to the bad old days > when periodic VACUUM FULL was unavoidable. Agreed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > The basic idea is that when a tuple is UPDATEd we can, in certain > > circumstances, avoid inserting index tuples for a tuple. Such tuples are > > marked HEAP_ONLY_TUPLE, but their storage is otherwise identical to > > other tuples. > > What is VACUUM FULL going to do when it wants to move one of these things? In addition to others suggested, one option is to rework VACUUM FULL: Use case for VACUUM FULL is very low these days. It has an appallingly long execution time. This can be speeded up by dropping and re-creating indexes, says the manual, but it is still lengthy. It is even faster to drop the indexes, do a CREATE TABLE AS SELECT * FROM table, drop the old table and then rebuild the indexes. When moving into the new relation the space requirement is not double, since we only copy useful data/space and we also do this using the space the indexes occupied, so the actual space overhead isn't that high. VACUUM FULL also generates lots of WAL and ties up lots of memory while it operates - and it uses memory without any constraint. The CTAS technique doesn't carry across all of the visible tuple chains, but then to be brutally frank, by the time VACUUM FULL has actually finished executing, it is very frequently the oldest transaction anyway, so we needn't really have gone to all the trouble of moving the tuple chains. So the main use case for VACUUM FULL is when the space to be freed inside the table is low enough to make defraging the table quicker than a CTAS, yet still high enough that we were worried enough to do a VACUUM FULL. Thats a very narrow use case, and if it exists at all there's a narrow time window associated with it - only a heavily updated/deleted table needs vacuuming anyway - that means most often be performing the VF when we're already into the zone where the CTAS approach is quicker. VACUUM FULL also forces us to handle various failure cases that leave half-moved tuples scattered across tables. (Incomplete VACUUM FULLs are actually fairly common because of its incredibly long run time). So the radical suggestion is to continue to support the VACUUM FULL command, but using a newly coded technique that is both higher performance and more robust. We can either make VACUUM FULL wait until it actually is the oldest transaction, or we can mark the table in some way so that a lock cannot be obtained upon it by an earlier Xid. We should be able to compact the files of a large table one physical file at a time, so the space overhead is only ever MAX_PHYSICAL_FILESIZE and the space overhead may become a net space gain as the VF continues. This is of course (almost) identical to the approach already in use for CLUSTER, so it seems like that should be acceptable. As a result, its really not that much code and can still be accomplished on time. Note also that we would not have to drop and re-add Foreign Keys, since nothing can have changed while we have the table locked. Doing this also frees up two heap info bits and simplifies many of the HeapTupleSatisfies code, which could probably use the helping hand. Tuples moved to the new files would retain their info bit settings as they are copied across. So overall, it seems a lot easier to completely replace VF than to fight through its complexities and failure cases. If we do the above, then we'll speed up VACUUM FULL and we'll be able to handle HOT tuples easily. > > CREATE INDEX requires some careful work to allow it to identify and > > correctly index HEAP_ONLY_TUPLEs that need to become ~HEAP_ONLY_TUPLE as > > a result of the new index. > > I think you've glossed over the CREATE INDEX problem much too easily. > The difficulty with that is going to be that de-HOT-ifying a tuple > is going to require multiple updates that can't possibly be put into > a single WAL record, and I don't think that WAL replay can clean up > after an incomplete update (since it can't run user-defined functions > and hence cannot be expected to compute index entries for itself). > So I don't think you can do that while preserving crash safety. No intention to gloss, just wanted to get past first post and onto the really complex stuff. You're absolutely right that this is where much of the thinking/work needs to take place. > > Removing the root tuple will require a VACUUM *FULL*. > > That seems unacceptable ... it won't take too long for your table to > fill up with stubs, and we don't want to return to the bad old days > when periodic VACUUM FULL was unavoidable. Completely agree. I wanted to start right at the very beginning, so everybody would understand the issues, rather than jump straight in again with additional complexity. > ISTM we could fix that by extending the index VACUUM interface to > include two concepts: aside from "remove these TIDs when you find them", > there could be "replace these TIDs with those TIDs when you find them". > This would allow pointer-swinging to one of the child tuples, after > which the old root could be removed. This has got the same atomicity > problem as for CREATE INDEX, because it's the same thing: you're > de-HOT-ifying the child. So if you can solve the former, I think you > can make this work too. This is looking like the best option out of the many, since it doesn't have any serious restrictions or penalties. Let's see what Pavan thinks, since he's been working on this aspect. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On 2/9/07, Simon Riggs <simon@2ndquadrant.com > wrote:
ISTM that there two related issues that we need to solve to make
progress.
- We must make de-HOTifying or CHILLing crash safe
- Concurrent index scans should work correctly with CHILLing operations
I think the first issue can be addressed on the lines of what Heikki suggested.
We can CHILL one tuple at a time. I am thinking of a two step process.
In the first step, the root-tuple and the heap-only tuple (which needs CHILLing)
are marked with a special flag, CHILL_IN_PROGRESS. This operation is
WAL logged. We then insert appropriate index entries for the tuple under
consideration.
In the second step, the HEAP_UPDATED_ROOT and HEAP_ONLY_TUPLE
flags on the heap tuples are adjusted and CHILL_IN_PROGRESS flags are cleared.
During normal operations, if CHILL_IN_PROGRESS flag is found set, we might
need to do some more work to figure out whether the index insert operations
were successful or not. If we find that there are missing index entries for the tuple
under consideration for CHILLing, then those could be added now and flags
are set/reset appropriately.
The second problem of concurrent index scans seems a bit more complex.
We need a mechanism so that no tuples are missed or tuples are
not returned twice. Since CHILLing of a tuple adds a new access path to the
tuple from the index, a concurrent index scan may return a tuple twice.
How about grabbing a AccessExclusiveLock during CHILLing
operation ? This would prevent any concurrent index scans. Since CHILLing
of a large table can take a long time, the operation can be spread across
time with periodic acquire/release of the lock. This would prevent starvation
of other backends. Since CHILLing is required only for CREATE INDEX
and stub-cleanup, I am assuming that its ok for it to be lazy in nature.
Thanks,
Pavan
-- On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote:
> ISTM we could fix that by extending the index VACUUM interface to
> include two concepts: aside from "remove these TIDs when you find them",
> there could be "replace these TIDs with those TIDs when you find them".
> This would allow pointer-swinging to one of the child tuples, after
> which the old root could be removed. This has got the same atomicity
> problem as for CREATE INDEX, because it's the same thing: you're
> de-HOT-ifying the child. So if you can solve the former, I think you
> can make this work too.
This is looking like the best option out of the many, since it doesn't
have any serious restrictions or penalties. Let's see what Pavan thinks,
since he's been working on this aspect.
ISTM that there two related issues that we need to solve to make
progress.
- We must make de-HOTifying or CHILLing crash safe
- Concurrent index scans should work correctly with CHILLing operations
I think the first issue can be addressed on the lines of what Heikki suggested.
We can CHILL one tuple at a time. I am thinking of a two step process.
In the first step, the root-tuple and the heap-only tuple (which needs CHILLing)
are marked with a special flag, CHILL_IN_PROGRESS. This operation is
WAL logged. We then insert appropriate index entries for the tuple under
consideration.
In the second step, the HEAP_UPDATED_ROOT and HEAP_ONLY_TUPLE
flags on the heap tuples are adjusted and CHILL_IN_PROGRESS flags are cleared.
During normal operations, if CHILL_IN_PROGRESS flag is found set, we might
need to do some more work to figure out whether the index insert operations
were successful or not. If we find that there are missing index entries for the tuple
under consideration for CHILLing, then those could be added now and flags
are set/reset appropriately.
The second problem of concurrent index scans seems a bit more complex.
We need a mechanism so that no tuples are missed or tuples are
not returned twice. Since CHILLing of a tuple adds a new access path to the
tuple from the index, a concurrent index scan may return a tuple twice.
How about grabbing a AccessExclusiveLock during CHILLing
operation ? This would prevent any concurrent index scans. Since CHILLing
of a large table can take a long time, the operation can be spread across
time with periodic acquire/release of the lock. This would prevent starvation
of other backends. Since CHILLing is required only for CREATE INDEX
and stub-cleanup, I am assuming that its ok for it to be lazy in nature.
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-02-08 at 14:47 +0000, Heikki Linnakangas wrote: > However, the easiest solution would be to make CREATE INDEX wait until > the old tuple is dead. That should be ok at least for concurrent CREATE > INDEX, because it already has that kind of a wait between 1st and 2nd > phase. I'm not sure this part of the idea is possible; the rest sounded good. Looking at DefineIndex() the wait happens only for transactions that already have a lock on the table being indexed, which may not be very many. AFAICS the ref page for CREATE INDEX CONCURRENTLY isn't fully accurate (any more?) when it says "[CREATE INDEX] must wait for all existing transactions to terminate". Waiting until an arbitrary Xid dies could be deadlock-prone, if the lock isn't taken carefully. Imagine a pg_dump coming towards you and then waiting on the locked table. You'd need to wait at the beginning of the command, before locks were taken. However, that would means CREATE INDEX would only be possible outside of transaction blocks, which I don't think is acceptable. I wanted this for VACUUM FULL also, but same problem exists. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > ISTM we could fix that by extending the index VACUUM interface to > include two concepts: aside from "remove these TIDs when you find them", > there could be "replace these TIDs with those TIDs when you find them". > This would allow pointer-swinging to one of the child tuples, after > which the old root could be removed. Implementing the "replace these TIDs" operation atomically would be simple, except for the new bitmap index am. It should be possible there as well, but if the old and new tid happen to be on a different bitmap page, it requires some care to avoid deadlocks. Also, we'd need more work mem for vacuum. > This has got the same atomicity > problem as for CREATE INDEX, because it's the same thing: you're > de-HOT-ifying the child. Not exactly. De-HOT-ifying, or chilling, a child means inserting new index entries. But if we're just replacing the tids from the existing index entries, it's ok if we crash after replacing some but not all of them. The next vacuum would replace the rest of the pointers, and remove the old root tuple. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Implementing the "replace these TIDs" operation atomically would be > simple, except for the new bitmap index am. It should be possible there That isn't simple (may be, even possible) from GIN. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> Implementing the "replace these TIDs" operation atomically would be >> simple, except for the new bitmap index am. It should be possible there > That isn't simple (may be, even possible) from GIN. I suspect that those pushing this idea only care about btrees anyway, so one possible answer is that HOT is only possible when the table has only btree indexes --- or at least, only indexes of AMs that support the replace-these-TIDs operation. (Yet another pg_am flag...) regards, tom lane
On Fri, 2007-02-09 at 10:17 -0500, Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: > >> Implementing the "replace these TIDs" operation atomically would be > >> simple, except for the new bitmap index am. It should be possible there > > > That isn't simple (may be, even possible) from GIN. > > I suspect that those pushing this idea only care about btrees anyway, > so one possible answer is that HOT is only possible when the table has > only btree indexes --- or at least, only indexes of AMs that support the > replace-these-TIDs operation. (Yet another pg_am flag...) Well, thats me. Yes, I think b-trees-only is acceptable. Realistically, very frequent updating and full text indexing are easily separable use cases, at least into separate tables. HOT should be of use in Data Warehousing applications also, when summary tables are maintained alongside detailed data, but that also sounds like HOT and bitmap indexes would be separable at the table level without difficulty. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-02-09 at 13:39 +0000, Heikki Linnakangas wrote: > Tom Lane wrote: > > ISTM we could fix that by extending the index VACUUM interface to > > include two concepts: aside from "remove these TIDs when you find them", > > there could be "replace these TIDs with those TIDs when you find them". > > This would allow pointer-swinging to one of the child tuples, after > > which the old root could be removed. > > Implementing the "replace these TIDs" operation atomically would be > simple, except for the new bitmap index am. It should be possible there > as well, but if the old and new tid happen to be on a different bitmap > page, it requires some care to avoid deadlocks. Grouped Item Indexes cope with this easily also, yes? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Fri, 2007-02-09 at 13:39 +0000, Heikki Linnakangas wrote: >> Tom Lane wrote: >>> ISTM we could fix that by extending the index VACUUM interface to >>> include two concepts: aside from "remove these TIDs when you find them", >>> there could be "replace these TIDs with those TIDs when you find them". >>> This would allow pointer-swinging to one of the child tuples, after >>> which the old root could be removed. >> Implementing the "replace these TIDs" operation atomically would be >> simple, except for the new bitmap index am. It should be possible there >> as well, but if the old and new tid happen to be on a different bitmap >> page, it requires some care to avoid deadlocks. > > Grouped Item Indexes cope with this easily also, yes? Yes, as long as the old and the new tid point to the same page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, R, 2007-02-09 kell 13:39, kirjutas Heikki Linnakangas: > Tom Lane wrote: > > ISTM we could fix that by extending the index VACUUM interface to > > include two concepts: aside from "remove these TIDs when you find them", > > there could be "replace these TIDs with those TIDs when you find them". > > This would allow pointer-swinging to one of the child tuples, after > > which the old root could be removed. > > Implementing the "replace these TIDs" operation atomically would be > simple, except for the new bitmap index am. It should be possible there > as well, but if the old and new tid happen to be on a different bitmap > page, it requires some care to avoid deadlocks. > > Also, we'd need more work mem for vacuum. Why do we need to muck around with indexes at all ? What are the problems with just shuffling the last (and only visible) tuple to replace the HOT-hain root and be done with it ? Can there be some problems with seqscans moving one tuple at a time or doing revisits of the "same" (by TID) tuple ? If there are can't these be fixed by creative use of ctid chains form the original live one to the new live one ? > > This has got the same atomicity > > problem as for CREATE INDEX, because it's the same thing: you're > > de-HOT-ifying the child. > > Not exactly. De-HOT-ifying, or chilling, a child means inserting new > index entries. If we can just move the tuple inside the page we can avoid even that. > But if we're just replacing the tids from the existing > index entries, it's ok if we crash after replacing some but not all of > them. The next vacuum would replace the rest of the pointers, and remove > the old root tuple. > -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hannu Krosing <hannu@skype.net> writes: > What are the problems with just shuffling the last (and only visible) > tuple to replace the HOT-hain root and be done with it ? ctid stops being a reliable identifier. regards, tom lane
On Fri, 2007-02-09 at 13:47 -0500, Tom Lane wrote: > Hannu Krosing <hannu@skype.net> writes: > > What are the problems with just shuffling the last (and only visible) > > tuple to replace the HOT-hain root and be done with it ? > > ctid stops being a reliable identifier. Yes, that sums it up. The issue can be overcome in the internals, but the only means of doing so that seemed workable required changing the output of the CTID special column. Some applications use the tid datatype directly to relocate rows within a transaction using the tidscan. ODBC driver uses that concept to implement Updateable Cursors from the client. We can change that, but we'd break all the other ones we don't know about. This issue was one of the major contributing factors to the size of the previous HOT prototype, making it more invasive than is really desirable. Pointer-swinging avoids those issues and seems workable, even if it is a pain to have to visit the index during VACUUM. So changing CTID isn't a bridge we really need to cross, for which I'm glad. Just as a matter of record, the tuple identifier would need to include (block, itemid, xmin, cmin) to make this idea work, with the itemid being that of the root tuple and the xmin and cmin being the tuple in the chain that is being referenced. This would've then allowed callers to relocate a specific tuple, even when the update chain had changed between block accesses. Anyway, glad we're not going there anymore. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-02-09 at 13:16 +0530, Pavan Deolasee wrote: > The second problem of concurrent index scans seems a bit more complex. > We need a mechanism so that no tuples are missed or tuples are > not returned twice. Since CHILLing of a tuple adds a new access path > to the tuple from the index, a concurrent index scan may return a > tuple twice. > > How about grabbing a AccessExclusiveLock during CHILLing > operation ? This would prevent any concurrent index scans. Since > CHILLing of a large table can take a long time, the operation can be > spread across time with periodic acquire/release of the lock. This > would prevent starvation of other backends. Since CHILLing is required > only for CREATE INDEX and stub-cleanup, I am assuming that its ok for > it to be lazy in nature. We've just spoken about this, so just wanted to add those thoughts here. A pointer-swing operation will begin when we see a tuple that is both status of HEAPTUPLE_DEAD and is marked HEAP_UPDATE_ROOT. Perhaps that requires a new status from HeapTupleSatisfiesVacuum()? We chill all tuples in the chain, up to the new root. We mark those tuples, so that HeapTupleSatisfiesVacuum() will be describe them as HEAPTUPLE_RECENTLY_DEAD. So the current Vacuum won't immediately remove them, but they'll go away in the future as part of an on-demand block vacuum or Vacuum. That's similar to the way we handle HALF_DEAD index pages. The index scans are page-at-a-time, so when we pointer-swing from the root tuple to one of the HOT tuples we'll be OK. We're switching a specific index tuple, so there's no multi-page locking on the index to consider. Right now, I wouldn't want to assume that the way tuples are marked prior to pointer-swinging is exactly the same as the chilling required by CREATE INDEX: CHILL_IN_PROGRESS. It may well be, but I'm wary that we assume they are exactly the same and introduce a subtle bug. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > > Removing the root tuple will require a VACUUM *FULL*. > > That seems unacceptable ... it won't take too long for your table to > fill up with stubs, and we don't want to return to the bad old days > when periodic VACUUM FULL was unavoidable. > > ISTM we could fix that by extending the index VACUUM interface to > include two concepts: aside from "remove these TIDs when you find them", > there could be "replace these TIDs with those TIDs when you find them". > This would allow pointer-swinging to one of the child tuples, after > which the old root could be removed. This has got the same atomicity > problem as for CREATE INDEX, because it's the same thing: you're > de-HOT-ifying the child. So if you can solve the former, I think you > can make this work too. I need clarification here. Is removing dead heap tuple always going to require an index scan, or was this just for chilling a row (adding an index)? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2007-02-09 at 18:10 -0500, Bruce Momjian wrote: > Tom Lane wrote: > > > Removing the root tuple will require a VACUUM *FULL*. > > > > That seems unacceptable ... it won't take too long for your table to > > fill up with stubs, and we don't want to return to the bad old days > > when periodic VACUUM FULL was unavoidable. > > > > ISTM we could fix that by extending the index VACUUM interface to > > include two concepts: aside from "remove these TIDs when you find them", > > there could be "replace these TIDs with those TIDs when you find them". > > This would allow pointer-swinging to one of the child tuples, after > > which the old root could be removed. This has got the same atomicity > > problem as for CREATE INDEX, because it's the same thing: you're > > de-HOT-ifying the child. So if you can solve the former, I think you > > can make this work too. > > I need clarification here. Is removing dead heap tuple always going to > require an index scan, or was this just for chilling a row (adding an > index)? We can remove a tupled marked HEAP_ONLY_TUPLE when it is status HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a TupleStub, but not removed. Multiple tuples in the chain can be removed, though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the first non-removed tuple in the chain. All of that can be done when we hold a CleanupLock on the block, without reference to the indexes; this can be performed on-demand, when we attempt an UPDATE. This is similar to what already happens prior to a btree split operation. (This could also be performed by bgwriter, but that isn't proposed at this time because the buffer transit time through the cache is often not long enough to allow tuples to die and get benefit from space reuse). TupleStubs can be marked for removal by a pointer-swing operation during normal VACUUM, i.e. it will require touching the indexes. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > > I need clarification here. Is removing dead heap tuple always going to > > require an index scan, or was this just for chilling a row (adding an > > index)? > > We can remove a tupled marked HEAP_ONLY_TUPLE when it is status > HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a > TupleStub, but not removed. Multiple tuples in the chain can be removed, > though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the > first non-removed tuple in the chain. All of that can be done when we > hold a CleanupLock on the block, without reference to the indexes; this > can be performed on-demand, when we attempt an UPDATE. This is similar > to what already happens prior to a btree split operation. (This could > also be performed by bgwriter, but that isn't proposed at this time > because the buffer transit time through the cache is often not long > enough to allow tuples to die and get benefit from space reuse). > > TupleStubs can be marked for removal by a pointer-swing operation during > normal VACUUM, i.e. it will require touching the indexes. OK, that sounds like a good plan. Thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Ühel kenal päeval, K, 2007-02-07 kell 17:38, kirjutas Simon Riggs: > When we try to UPDATE a tuple and the new tuple version doesn't fit on > the block, we get the BufferCleanupLock if possible and then perform a > single-block VACUUM. Any tuple that is both HEAP_DEAD & > HEAP_ONLY_TUPLE > can be removed completely. This is possible by changing the t_ctid > field > so that it points at the first visible-to-someone tuple in the chain, > so > it points "over" the previous HOT tuples. The root tuple is also dead > - > it cannot be removed completely, so it is replaced it with "just a > TupleHeader", which is referred to as a TupleStub. (Credit to Itagaki > for this concept). What if we would just reuse the root tuple directly instead of turning it into a stub ? This would create a cycle of ctid pointers, which changes the lookup process from 'follow ctid chaint until the end' to 'follow the tid chain until you reach the start'. It also allows one to chill tuples for which all other tuples except the root are dead by just flipping the HOT flag bit (or juts making the ctid chain ptr to point to self.) > The single-block VACUUM would alter *all* tuple chains on the block, > not just the one for the current tuple being UPDATEd. > > This technique means that a tuple never changes its CTID, so > everything that currently uses CTID can continue normally. SeqScan > would also work identically to the way it works today. > > It also means that we can't easily remove the root tuple, even if it > is now just a TupleStub (unless the whole chain is also removable > because of DELETE). Removing the root tuple will require a VACUUM > *FULL*. Even so, this option is still space-neutral in the worst-case, > in comparison with inserting index tuples. if you can afford changing xmin and xmax, you can actually do it in 2 steps - first do an "update" that moves the tuple to hot chain root position, and then, after the original is dead for all backends, just mark it as not hot. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hannu Krosing <hannu@skype.net> writes: > What if we would just reuse the root tuple directly instead of turning > it into a stub ? > This would create a cycle of ctid pointers, which changes the lookup > process from 'follow ctid chaint until the end' to 'follow the tid chain > until you reach the start'. How do you know which one is newest? What happens when you have to put a newer version off-page for lack of space? regards, tom lane
Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > What if we would just reuse the root tuple directly instead of turning > > it into a stub ? > > This would create a cycle of ctid pointers, which changes the lookup > > process from 'follow ctid chaint until the end' to 'follow the tid chain > > until you reach the start'. > > How do you know which one is newest? By xmin,cmin of course . > What happens when you have to put a newer version off-page for lack of space? Then this scheme won't work. How about adding a new 2-byte field to header for in-page c_tid poiner for HOT ? It grows header from 26 to 28 bytes, but for MAXALIGN=4 the space usage would stay the same. > regards, tom lane -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On 2/11/07, Hannu Krosing <hannu@skype.net> wrote:
This sounds interesting. But we might have trouble supporting HOT-update when
tuple length changes. May be we can release the space consumed by the dead root
tuple and have fresh allocation if the tuple length increases.
Thanks,
Pavan
Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > What if we would just reuse the root tuple directly instead of turning
> > it into a stub ?
> > This would create a cycle of ctid pointers, which changes the lookup
> > process from 'follow ctid chaint until the end' to 'follow the tid chain
> > until you reach the start'.
>
> How do you know which one is newest?
By xmin,cmin of course .
This sounds interesting. But we might have trouble supporting HOT-update when
tuple length changes. May be we can release the space consumed by the dead root
tuple and have fresh allocation if the tuple length increases.
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On 2/11/07, Hannu Krosing <hannu@skype.net> wrote: >> >> Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane: >> > Hannu Krosing <hannu@skype.net> writes: >> > > What if we would just reuse the root tuple directly instead of >> turning >> > > it into a stub ? >> > > This would create a cycle of ctid pointers, which changes the lookup >> > > process from 'follow ctid chaint until the end' to 'follow the tid >> chain >> > > until you reach the start'. >> > >> > How do you know which one is newest? >> >> By xmin,cmin of course . > > > This sounds interesting. But we might have trouble supporting HOT-update > when > tuple length changes. May be we can release the space consumed by the dead > root > tuple and have fresh allocation if the tuple length increases. I don't see a problem with length changes. We're free to rearrange data on the page whenever we can acquire the vacuum lock. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hannu Krosing wrote: > Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane: >> Hannu Krosing <hannu@skype.net> writes: >>> What if we would just reuse the root tuple directly instead of turning >>> it into a stub ? >>> This would create a cycle of ctid pointers, which changes the lookup >>> process from 'follow ctid chaint until the end' to 'follow the tid chain >>> until you reach the start'. >> How do you know which one is newest? > > By xmin,cmin of course . > >> What happens when you have to put a newer version off-page for lack of space? > > Then this scheme won't work. Couldn't the ctid of the latest tuple point to the off-page tuple as usual? As long as the index points to any tuple in the update chain, an index scan can even scan the whole page to find all the earlier tuples in the chain. The latest version can always be found by following the c_tid pointers, so that would only be needed to find an older version of the row. > How about adding a new 2-byte field to header for in-page c_tid poiner > for HOT ? > > It grows header from 26 to 28 bytes, but for MAXALIGN=4 the space usage > would stay the same. Actually the tuple header is now 23 bytes, thanks to combo (aka phantom) cids. Adding a new 2-byte field would take it over 24 bytes again, making it either 28 or 32 bytes depending on MAXALIGN. It might be acceptable, if it was only stored on those tuples that are (HOT) updated. But it's not clear to me what you're proposing to do with the field, anyway, Which tuples would have it, and what would it point to? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 2/12/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
It could. But then you may lose reference for older version(s). We can do
the whole page scans to locate older versions, but that might prove too
costly
My guess what Hannu is suggesting is to have a circular chain of HOT-updated
tuples in a page. The index can point to any tuple in the chain. When
update goes off-page or is a COLD update, t_ctid points to the newer version
as usual. So a tuple which is COLD updated would need two pointers, one
which points to the oldest version in the circular on-page chain and other which
points to the new version.
Thanks,
Pavan
-- Hannu Krosing wrote:
> Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom Lane:
>> Hannu Krosing <hannu@skype.net> writes:
>>> What if we would just reuse the root tuple directly instead of turning
>>> it into a stub ?
>>> This would create a cycle of ctid pointers, which changes the lookup
>>> process from 'follow ctid chaint until the end' to 'follow the tid chain
>>> until you reach the start'.
>> How do you know which one is newest?
>
> By xmin,cmin of course .
>
>> What happens when you have to put a newer version off-page for lack of space?
>
> Then this scheme won't work.
Couldn't the ctid of the latest tuple point to the off-page tuple as usual?
It could. But then you may lose reference for older version(s). We can do
the whole page scans to locate older versions, but that might prove too
costly
It might be acceptable, if it was only stored on those tuples that are
(HOT) updated. But it's not clear to me what you're proposing to do with
the field, anyway, Which tuples would have it, and what would it point to?
My guess what Hannu is suggesting is to have a circular chain of HOT-updated
tuples in a page. The index can point to any tuple in the chain. When
update goes off-page or is a COLD update, t_ctid points to the newer version
as usual. So a tuple which is COLD updated would need two pointers, one
which points to the oldest version in the circular on-page chain and other which
points to the new version.
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
Hannu Krosing <hannu@skype.net> writes: > How about adding a new 2-byte field to header for in-page c_tid poiner > for HOT ? We just finished sweating blood to get the tuple header size down to 23 bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8). We are not going to blow that again on HOT. regards, tom lane
> > > What are the problems with just shuffling the last (and only visible) > > > tuple to replace the HOT-hain root and be done with it ? > > > > ctid stops being a reliable identifier. A backend selecting the row by ctid would need to take one step to the root slot to get at the tuple. This does seem possible if the tuples got swapped. The old "current version" slot would need to become RECENTLY_DEAD and could not be removed right away. It also needs a flag to denote that the one step is necessary. It sure would be nice if a root could be reused during regular vacuum, (even if it were only possible under very restricted conditions, that eventually apply). Andreas
On Mon, Feb 12, 2007 at 12:48:06AM -0500, Tom Lane wrote: > Hannu Krosing <hannu@skype.net> writes: > > How about adding a new 2-byte field to header for in-page c_tid poiner > > for HOT ? > We just finished sweating blood to get the tuple header size down to 23 > bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8). We are not > going to blow that again on HOT. I haven't had enough time to follow all of the details here - but if the ability to update a row with minimal overhead, as long as there is extra room in the same block is a great idea (it sounds appealing to me) - could it be done with just a 1 byte list? 24 instead of 23 for the tuple size. I'll try to catch up at some point. :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
mark@mark.mielke.cc wrote: > On Mon, Feb 12, 2007 at 12:48:06AM -0500, Tom Lane wrote: >> We just finished sweating blood to get the tuple header size down to 23 >> bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8). We are not >> going to blow that again on HOT. > > I haven't had enough time to follow all of the details here - but if the > ability to update a row with minimal overhead, as long as there is extra > room in the same block is a great idea (it sounds appealing to me) - could > it be done with just a 1 byte list? 24 instead of 23 for the tuple size. Assuming 8k pages, you could in theory store reference to a line pointer in just 1 byte. But actually that 1 free byte in the header is not currently just waste of space. If you have any nulls in your tuple, there's going to be a null bitmap in addition to the header. 1 byte is conveniently enough to store the null bitmap for a table with max 8 columns, and if a table has more than 8 columns, the extra 4 or 8 bytes needed for the null bitmap probably doesn't matter so much because the tuples are quite wide anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, E, 2007-02-12 kell 17:23, kirjutas Heikki Linnakangas: > mark@mark.mielke.cc wrote: > > On Mon, Feb 12, 2007 at 12:48:06AM -0500, Tom Lane wrote: > >> We just finished sweating blood to get the tuple header size down to 23 > >> bytes from 27 (which saves 8 bytes not 4 if MAXALIGN=8). We are not > >> going to blow that again on HOT. > > > > I haven't had enough time to follow all of the details here - but if the > > ability to update a row with minimal overhead, as long as there is extra > > room in the same block is a great idea (it sounds appealing to me) - could > > it be done with just a 1 byte list? 24 instead of 23 for the tuple size. > > Assuming 8k pages, you could in theory store reference to a line pointer > in just 1 byte. > > But actually that 1 free byte in the header is not currently just waste > of space. If you have any nulls in your tuple, there's going to be a > null bitmap in addition to the header. 1 byte is conveniently enough to > store the null bitmap for a table with max 8 columns, Are we actually doing that ? I.E are null bitmaps really allocated in 1 byte steps nowadays ? > and if a table has > more than 8 columns, the extra 4 or 8 bytes needed for the null bitmap > probably doesn't matter so much because the tuples are quite wide anyway. > -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Hannu Krosing wrote: >> But actually that 1 free byte in the header is not currently just waste >> of space. If you have any nulls in your tuple, there's going to be a >> null bitmap in addition to the header. 1 byte is conveniently enough to >> store the null bitmap for a table with max 8 columns, > > Are we actually doing that ? I.E are null bitmaps really allocated in 1 > byte steps nowadays ? Yes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Hannu Krosing wrote: >> Are we actually doing that ? I.E are null bitmaps really allocated in 1 >> byte steps nowadays ? > Yes. Not really; we still have to MAXALIGN at the end of the bitmap. The point is that you can get 8 bits in there before paying the first additional MAXALIGN increment. It's all moot anyway since 8 bits isn't enough for a pointer ... regards, tom lane
On 2/13/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Hannu Krosing wrote:
>> Are we actually doing that ? I.E are null bitmaps really allocated in 1
>> byte steps nowadays ?
> Yes.
Not really; we still have to MAXALIGN at the end of the bitmap. The
point is that you can get 8 bits in there before paying the first
additional MAXALIGN increment.
It's all moot anyway since 8 bits isn't enough for a pointer ...
We could live with 8 bits actually. We can store only the least
significant 8 bits of the pointer. It would point to a set of tuples and
we may need to search within that set to find the required tuple.
This would still be better than scanning the entire page.
But I agree that utilizing those 8 bits would result in a penalty
for tables with few columns.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, T, 2007-02-13 kell 09:38, kirjutas Tom Lane: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > Hannu Krosing wrote: > >> Are we actually doing that ? I.E are null bitmaps really allocated in 1 > >> byte steps nowadays ? > > > Yes. > > Not really; we still have to MAXALIGN at the end of the bitmap. The > point is that you can get 8 bits in there before paying the first > additional MAXALIGN increment. > > It's all moot anyway since 8 bits isn't enough for a pointer ... With 8k pages and MAXALIGN=8 we just barely can, as with current page structure (tuple headers together with data) the minimal tuple size for that case is 24 for header + 8 for data = 32 bytes which means 256 tuples per page (minus page header) and so we can store tuple number in 8 bits. OTOH, for same page HOT tuples, we have the command and trx ids stored twice first as cmax,xmax of the old tuple and as cmin,xmin of the updated tuple. One of these could probably be used for in-page HOT tuple pointer. As we can't rely on get-xmax-from-next-in-chain behaviour for off-page tuples, we need to move the other way, that is storing a pointer to previous version and getting xmin from there. That means keeping a chain back pointers in xmin fields for all but the oldest tuple in HOT chain/cycle. This requires a little shuffling around of xmin/xmax info when removing dead HOT chain members, but this would void the need to do any index changes during HOT updates. Avoiding all index updates is probably good, as it means we dont need to do any index page locking. So the proposed structure for HOT chain is like this: * Oldest tuple in chain has real xmin/xmax, this needs a flag to be recognisable, of we may just start transaction counter at 64K so that any xmin that fits in 2 bytes is immediately recognized as hot back pointer. Or start cmin at 1 and use cmin=0 as pointer flag. * All newer tuples have real xmax, and in-page pointer to previous version in place of xmin. the real xmin is xmax of the previous tuple. When previous tuple is removed, xmin is stored in the new oldest tuple. * Index can point to any tuple in HOT chain. Finding the tuple by index * To find a tuple, one errives to (possibly) center of HOT chain, and the tuple may be found either by following the in-page pointer stored in xmin field or ctid pointers. * If the tuples ctid is pointing to off page, which means it must thus be the newest one in HOT chain and also end of it. The chain ends also when the tuple is inserted by an rollbacked transaction. * there is a special case when the index pointer points to a rollbacked tuple. In this case the tuple can't be removed, but should be reused on next update. But this is the case for any HOT rollbacks. The proposed structure should enables cyclic reuse of dead tuples as they fall out of visibility window and avoids updating index pointers. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On 2/14/07, Hannu Krosing <hannu@skype.net> wrote:
I think we recently merged cmin/cmax into a single combo cid. So I don't
think we can use cmin for storing back pointer. Idea of using xmin seems
feasible though.
Why can't we store the real xmin/xmax at the end of the chain and store
forward pointers in the xmax ? I find the idea of having back pointers more
compelling though.
I agree. If we can design a solution that does not require any index updates,
that would save us a lot. One problem with such a approach though is that we
might have to live with a dead tuple/stub until another update occurs and
the tuple/stub is reused.
Can we quickly figure out based on the given snapshot and the root tuple
xmin/xmax, whether to follow the forward or the backward pointer to arrive
at a visible tuple ?
OTOH, for same page HOT tuples, we have the command and trx ids stored
twice first as cmax,xmax of the old tuple and as cmin,xmin of the
updated tuple. One of these could probably be used for in-page HOT tuple
pointer.
I think we recently merged cmin/cmax into a single combo cid. So I don't
think we can use cmin for storing back pointer. Idea of using xmin seems
feasible though.
As we can't rely on get-xmax-from-next-in-chain behaviour for off-page
tuples, we need to move the other way, that is storing a pointer to
previous version and getting xmin from there. That means keeping a chain
back pointers in xmin fields for all but the oldest tuple in HOT
chain/cycle.
Why can't we store the real xmin/xmax at the end of the chain and store
forward pointers in the xmax ? I find the idea of having back pointers more
compelling though.
This requires a little shuffling around of xmin/xmax info when removing
dead HOT chain members, but this would void the need to do any index
changes during HOT updates. Avoiding all index updates is probably good,
as it means we dont need to do any index page locking.
I agree. If we can design a solution that does not require any index updates,
that would save us a lot. One problem with such a approach though is that we
might have to live with a dead tuple/stub until another update occurs and
the tuple/stub is reused.
So the proposed structure for HOT chain is like this:
* Oldest tuple in chain has real xmin/xmax, this needs a flag to be
recognisable, of we may just start transaction counter at 64K so that
any xmin that fits in 2 bytes is immediately recognized as hot back
pointer. Or start cmin at 1 and use cmin=0 as pointer flag.
* All newer tuples have real xmax, and in-page pointer to previous
version in place of xmin. the real xmin is xmax of the previous tuple.
When previous tuple is removed, xmin is stored in the new oldest tuple.
* Index can point to any tuple in HOT chain. Finding the tuple by index
* To find a tuple, one errives to (possibly) center of HOT chain, and
the tuple may be found either by following the in-page pointer stored in
xmin field or ctid pointers.
Can we quickly figure out based on the given snapshot and the root tuple
xmin/xmax, whether to follow the forward or the backward pointer to arrive
at a visible tuple ?
* If the tuples ctid is pointing to off page, which means it must thus
be the newest one in HOT chain and also end of it. The chain ends also
when the tuple is inserted by an rollbacked transaction.
* there is a special case when the index pointer points to a rollbacked
tuple. In this case the tuple can't be removed, but should be reused on
next update. But this is the case for any HOT rollbacks.
One problem is with aborted HOT-updates. According to this scheme, xmin
of the aborted version is stored in the previous version. What happens when
that gets updated again and committed ? If we follow the back pointer from
the aborted tuple to fetch the xmin, we would actually be fetching the txid
of the committed transaction which later updated the tuple. This would fool
us to incorrectly believe that the aborted tuple is LIVE.
May be we can set XMIN_INVALID for the next tuple in the chain when
a tuple being updated has t_ctid pointing to a tuple in the same page,
other than itself.
I am sure there are interesting interactions between vacuum-ing that needs
to be considered as well.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Hannu Krosing <hannu@skype.net> writes: > Ühel kenal päeval, T, 2007-02-13 kell 09:38, kirjutas Tom Lane: >> It's all moot anyway since 8 bits isn't enough for a pointer ... > With 8k pages and MAXALIGN=8 we just barely can, as with current page > structure (tuple headers together with data) the minimal tuple size for > that case is 24 for header + 8 for data = 32 bytes which means 256 > tuples per page (minus page header) and so we can store tuple number in > 8 bits. But neither of those assumptions is acceptable --- I would think in fact that people would become *more* interested in having large pages with HOT, because it would give more room for updates-on-the-same-page. Besides, you forgot the case of an empty tuple (<= 8 columns, all NULL). > OTOH, for same page HOT tuples, we have the command and trx ids stored > twice first as cmax,xmax of the old tuple and as cmin,xmin of the > updated tuple. One of these could probably be used for in-page HOT tuple > pointer. This proposal seems awfully fragile, because the existing tuple-chain-following logic *depends for correctness* on comparing each tuple's xmin to prior xmax. I don't think you can just wave your hands and say we don't need that cross-check. Furthermore it seems to me you haven't fixed the problem, which is that you can't remove the chain member that is being pointed at by off-page links (either index entries or a previous generation of the same tuple). As described, you've made that problem worse because you're trying to say we don't know which of the chain entries is pointed at. regards, tom lane
Ühel kenal päeval, K, 2007-02-14 kell 10:41, kirjutas Tom Lane: > Hannu Krosing <hannu@skype.net> writes: > > OTOH, for same page HOT tuples, we have the command and trx ids stored > > twice first as cmax,xmax of the old tuple and as cmin,xmin of the > > updated tuple. One of these could probably be used for in-page HOT tuple > > pointer. > > This proposal seems awfully fragile, because the existing > tuple-chain-following logic *depends for correctness* on comparing each > tuple's xmin to prior xmax. What kinds of correctnes guarantees does this give for same-page tuples? The comparing of each tuple's xmin to prior xmax should stay for inter-page ctid links. Mostly you can think of the same-page HOT chain as one extended tuple when looking at it from outside of that page. > I don't think you can just wave your hands and say we don't need that cross-check. > Furthermore it seems to me you > haven't fixed the problem, which is that you can't remove the chain > member that is being pointed at by off-page links (either index entries > or a previous generation of the same tuple). You can't remove any tuples before they are invisible for all transactions (i.e. dead). And being dead implies that all previous versions are dead as well. So if I can remove a tuple, I can also remove all its previous versions as well. Or are you trying to say that VACUUM follows ctid links of dead tuples for some purpose ? The problem I am trying to fix is reusing in-page space without need to touch indexes. > As described, you've made > that problem worse because you're trying to say we don't know which of > the chain entries is pointed at. There should be a flag, say HOT_CHAIN_ENTRY for the tuple the index(es) point at. And this should be the preferred CTID for inserting new versions once the old one is dead. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On 2/16/07, Hannu Krosing <hannu@skype.net> wrote:
I agree with Tom that xmin/xmax check does help to guarantee correctness.
I myself have used it often during HOT development to find/fix bugs. But
ISTM that we don't need atleast for in-page tuple chain, if we are
careful. So if removing this buys us something important, I am all for it.
Agree.
The only exception to this would be the case of aborted updates. In that
case a tuple is dead, but the one pointing to it is still live. But I don't see
any reason somebody would want to follow a chain past a live tuple. Not
sure about the VACUUM FULL code path though. Thats the only
place other than EvalPlanQual where we follow ctid chain.
Can we do some kind of indirection from the root line pointer ? Haven't
completely thought through yet, but the basic idea is to release the actual
space consumed by the root tuple once it becomes dead, but store the
offnum of the new root in the line pointer of the original root tuple. We
may need to flag the line pointer for that, but if I am not wrong, LP_DELETE
is not used for heap tuples.
We would waste 4 bytes of line pointer until the tuple is COLD updated and
the entire chain and the associated index entry is removed.
Thanks,Ühel kenal päeval, K, 2007-02-14 kell 10:41, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > OTOH, for same page HOT tuples, we have the command and trx ids stored
> > twice first as cmax,xmax of the old tuple and as cmin,xmin of the
> > updated tuple. One of these could probably be used for in-page HOT tuple
> > pointer.
>
> This proposal seems awfully fragile, because the existing
> tuple-chain-following logic *depends for correctness* on comparing each
> tuple's xmin to prior xmax.
What kinds of correctnes guarantees does this give for same-page tuples?
I agree with Tom that xmin/xmax check does help to guarantee correctness.
I myself have used it often during HOT development to find/fix bugs. But
ISTM that we don't need atleast for in-page tuple chain, if we are
careful. So if removing this buys us something important, I am all for it.
The comparing of each tuple's xmin to prior xmax should stay for
inter-page ctid links.
Agree.
Mostly you can think of the same-page HOT chain as one extended tuple
when looking at it from outside of that page.
> I don't think you can just wave your hands and say we don't need that cross-check.
> Furthermore it seems to me you
> haven't fixed the problem, which is that you can't remove the chain
> member that is being pointed at by off-page links (either index entries
> or a previous generation of the same tuple).
You can't remove any tuples before they are invisible for all
transactions (i.e. dead). And being dead implies that all previous
versions are dead as well. So if I can remove a tuple, I can also remove
all its previous versions as well. Or are you trying to say that VACUUM
follows ctid links of dead tuples for some purpose ?
The only exception to this would be the case of aborted updates. In that
case a tuple is dead, but the one pointing to it is still live. But I don't see
any reason somebody would want to follow a chain past a live tuple. Not
sure about the VACUUM FULL code path though. Thats the only
place other than EvalPlanQual where we follow ctid chain.
The problem I am trying to fix is reusing in-page space without need to
touch indexes.
Can we do some kind of indirection from the root line pointer ? Haven't
completely thought through yet, but the basic idea is to release the actual
space consumed by the root tuple once it becomes dead, but store the
offnum of the new root in the line pointer of the original root tuple. We
may need to flag the line pointer for that, but if I am not wrong, LP_DELETE
is not used for heap tuples.
We would waste 4 bytes of line pointer until the tuple is COLD updated and
the entire chain and the associated index entry is removed.
Pavan
--
EnterpriseDB http://www.enterprisedb.com
> > As described, you've made > > that problem worse because you're trying to say we don't know which of > > the chain entries is pointed at. > > There should be a flag, say HOT_CHAIN_ENTRY for the tuple the it's called HEAP_UPDATE_ROOT > index(es) point at. And this should be the preferred CTID for > inserting new versions once the old one is dead. This is not possible, see my reply to Bruce (maybe unless the whole hot chain is dead). (because that would need a back pointer, so readers arriving at the root find the visible tuple) Andreas
On 2/16/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:
> > As described, you've made
> > that problem worse because you're trying to say we don't know which
of
> > the chain entries is pointed at.
>
> There should be a flag, say HOT_CHAIN_ENTRY for the tuple the
it's called HEAP_UPDATE_ROOT
Just to avoid any confusion with the patch I sent out this week, we are
setting HEAP_UPDATE_ROOT on all tuples which are HOT-updated.
We set HEAP_ONLY_TUPLE for all tuples which does not have index
reference. So may be combination of (HEAP_UPDATE_ROOT & ~HEAP_ONLY_TUPLE)
can be used to identify index referred tuple in a HOT-update chain.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
> Just to avoid any confusion with the patch I sent out this > week, we are setting HEAP_UPDATE_ROOT on all tuples which are > HOT-updated. > > We set HEAP_ONLY_TUPLE for all tuples which does not have > index reference. So may be combination of (HEAP_UPDATE_ROOT & > ~HEAP_ONLY_TUPLE) can be used to identify index referred > tuple in a HOT-update chain. Oh sorry. Thanks for the clarification. Imho HEAP_UPDATE_ROOT should be renamed for this meaning then (or what does ROOT mean here ?). Maybe HEAP_UPDATE_CHAIN ? Andreas
On 2/16/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:
Yes, you are right. There is some disconnect between what Simon had
originally posted and the patch I sent out. Hopefully as we discuss HOT
more here, everything will converge.
Thanks,
Pavan
--
Oh sorry. Thanks for the clarification. Imho HEAP_UPDATE_ROOT should be
renamed for this meaning then (or what does ROOT mean here ?).
Maybe HEAP_UPDATE_CHAIN ?
Yes, you are right. There is some disconnect between what Simon had
originally posted and the patch I sent out. Hopefully as we discuss HOT
more here, everything will converge.
Thanks,
Pavan
EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-02-12 at 09:24 +0530, Pavan Deolasee wrote: > On 2/12/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Hannu Krosing wrote: > > Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom > Lane: > >> Hannu Krosing <hannu@skype.net> writes: > >>> What if we would just reuse the root tuple directly > instead of turning > >>> it into a stub ? > >>> This would create a cycle of ctid pointers, which changes > the lookup > >>> process from 'follow ctid chaint until the end' to 'follow > the tid chain > >>> until you reach the start'. > >> How do you know which one is newest? > > > > By xmin,cmin of course . > > > >> What happens when you have to put a newer version off-page > for lack of space? > > > > Then this scheme won't work. > > Couldn't the ctid of the latest tuple point to the off-page > tuple as usual? > > It could. But then you may lose reference for older version(s). We can > do > the whole page scans to locate older versions, but that might prove > too > costly > > > It might be acceptable, if it was only stored on those tuples > that are > (HOT) updated. But it's not clear to me what you're proposing > to do with > the field, anyway, Which tuples would have it, and what would > it point to? > > My guess what Hannu is suggesting is to have a circular chain of > HOT-updated > tuples in a page. The index can point to any tuple in the chain. When > update goes off-page or is a COLD update, t_ctid points to the newer > version > as usual. So a tuple which is COLD updated would need two pointers, > one > which points to the oldest version in the circular on-page chain and > other which > points to the new version. I very much like Hannu's idea, but it does present some issues. My feeling is we need to get some clarity on whether to go this route, or not, because it takes us away from the pointer-swinging idea. To maximise our chances of a something good for 8.3 we really need to pick just one idea now. Maybe we've already done that, by default. I've tried to analyse all aspects of the discussion to see if we can get something out of this: Hannu has described the possibility of having the index point to a tuple version which isn't the earliest one on the block. That raises the issue of how will we locate earlier tuple versions when the current Snapshot cannot see the root tuple? First off, we only need to locate the earlier tuple when the root was re-placed by a HOT update - we can tell that, so we're good so far. Second, we would like to be able to validate the xmax == xmin rule. If the original root tuple was dead, so will the prior versions that are off-block, so nobody will ever follow the chain to the root tuple again as part of EvalPlanQual. So we need to check newroot.xmax == nextinchain.xmin only. Third we need to locate the appropriate tuple version. We have a number of approaches: a) Block Seq Scan: using a scan to get there is possible, but not good. The best mechanism is to scan the block backwards (i.e. highest item to lowest item) picking up the tuple which points to the root, then scanning backwards picking up the parts of the chain as they are seen until we get to a visible tuple that we can validate as part of the chain. If we scanned forwards we'd need to remember the whole chain as we went, which would be less useful. *But* this doesn't allow us to validate the xmax == xmin rule, so that kills this sub-option, IMHO. b) Additional tuple fields We could have additional fields on the root tuple to help us locate the "true root" or oldest version of this tuple on the block. Those additional header fields are only required on the root tuple - no other tuple needs this info. So we need not complain about space savings/losses - the comparison is between extra tuple headers and a TupleStub, which is a full header width (24 bytes). This would need to include a copy of the xmin of the original root tuple, to allow us to validate the xmax == xmin rule (4 bytes). It would also need to include an in-page pointer to the true root (2 bytes). So additional tuple fields of 6 bytes would be added to the root tuple, probably maxaligning to 8 more bytes than normal - a saving of 16 bytes in comparison with the TupleStub approach. Since there is no separate TupleStub, we can replace the root when required, not just at full table VACUUM time. The presence, or not, of those fields can be marked using various info bits. The NULL bitmap is effectively an optional header field, so the idea seems workable - or at least as workable as pointer swinging. The in-page pointer is only ever followed as a result of an index scan. In all those cases we don't record the xmin, so we don't check it - we only need to store the xmin of the tuple pointed to by the in-page pointer. There may be some off-block tuple versions that point at the new root, but those links will never be followed because they are by-definition dead rows. I didn't start this analysis with any preconceived outcome, but ISTM now that Hannu's idea can be made to work robustly, whilst avoiding the need for pointer swinging, which thus allows retail VACUUMs, *and* saving space on-block overall. What it requires is two optional fields on any root tuple that has been replaced following a HOT update. If that is acceptable, then we have a better design as a result of Hannu's suggestion. Comments? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On 2/21/07, Simon Riggs <simon@2ndquadrant.com> wrote:
I very much like Hannu's idea, but it does present some issues.
I too liked Hannu's idea initially, but Tom raised a valid concern
that it does not address the basic issue of root tuples. According
to the idea, a DEAD root tuple can be used for a subsequent
update of the same row. For a very large table, even if its updated
frequently, it is not unlikely that the same row might not be updated
for a long time. Even when the update happens we would be
constrained by the length of the new version being same or less
than the root tuple OR ability to perform retail-vacuum of the block.
Did you or anybody else got a chance to think about the other idea
I proposed of having indirection from the root line pointer ? As I
mentioned earlier, I myself haven't thought through it completely,
but at the face of it, it looks doable. It would add a four-byte
overhead per live tuple-chain, but IMHO would be much simpler
to implement and not too invasive.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-02-22 at 00:00 +0530, Pavan Deolasee wrote: > > On 2/21/07, Simon Riggs <simon@2ndquadrant.com> wrote: > > I very much like Hannu's idea, but it does present some > issues. > > > I too liked Hannu's idea initially, but Tom raised a valid concern > that it does not address the basic issue of root tuples. According > to the idea, a DEAD root tuple can be used for a subsequent > update of the same row. For a very large table, even if its updated > frequently, it is not unlikely that the same row might not be updated > for a long time. That's a fair point and pointer swinging would address that. However, the space wastage is identical to the current situation. We need to choose which use-case we are optimising for: the case you mention would be optimising for high volume but very thinly spread updates. Personally, I don't see that as the most important use case out of the many possibilities. The problem we are trying to address is rows that *are* very frequently updated. There are so many sub-cases its easy to get distracted about which ones are actually causing usage problems right now. Anyway I take the point that pointer swinging is long term necessary, but my question is whether we need this yet. > Even when the update happens we would be > constrained by the length of the new version being same or less > than the root tuple OR ability to perform retail-vacuum of the block. Well, thats a more important question: surely we have agreed that retail VACUUM is both possible and beneficial? > Did you or anybody else got a chance to think about the other idea > I proposed of having indirection from the root line pointer ? Well yes, I saw that, but I was pointing out that we don't need to use just a single byte if we have a part of the tuple header that only exists in these circumstances. > As I > mentioned earlier, I myself haven't thought through it completely, > but at the face of it, it looks doable. It would add a four-byte > overhead per live tuple-chain, but IMHO would be much simpler > to implement and not too invasive. Cool. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> > I very much like Hannu's idea, but it does present some issues. > > > > > I too liked Hannu's idea initially, but Tom raised a valid > concern that it does not address the basic issue of root > tuples. According to the idea, a DEAD root tuple can be used > for a subsequent update of the same row. If you are reusing the existing slot of a root tuple how will that slot likely have room for an extra pointer and a live tuple ? If the idea does not cover root reuse we don't need pointers. Imho we should follow the swing idea. It should even be possible to point the root to the newest dead tuple during update (if it was index path), and reuse an older dead slot from the chain. Then we can limit the chain to number of potentially visible tuples + root + 2 without vacuum. Andreas
On 2/22/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:
Hannu talked about using one of xmin/xmax for storing
back-pointers. There were objections to that since it breaks
the xmax/xmin matching robustness that we have today.
> > I very much like Hannu's idea, but it does present some issues.
> >
> >
> I too liked Hannu's idea initially, but Tom raised a valid
> concern that it does not address the basic issue of root
> tuples. According to the idea, a DEAD root tuple can be used
> for a subsequent update of the same row.
If you are reusing the existing slot of a root tuple how will that
slot likely have room for an extra pointer and a live tuple ?
If the idea does not cover root reuse we don't need pointers.
Hannu talked about using one of xmin/xmax for storing
back-pointers. There were objections to that since it breaks
the xmax/xmin matching robustness that we have today.
Imho we should follow the swing idea.
Yes, thats one option. Though given a choice I would waste
four bytes in the heap-page than inserting a new index entry.
The heap tuples can be vacuumed rather easily than the index
entries which, if I am mistaken, can not be reused even after
marked LP_DELETEd.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
> > Imho we should follow the swing idea. > > > Yes, thats one option. Though given a choice I would waste > four bytes in the heap-page than inserting a new index entry. No question about that. My point was, that it would mean wasting the 2 (2 must be enough for a slot pointer) bytes on every heap tuple, hot or not. And then the decision is not so obvious anymore. If you don't have the room for the back pointer on every slot, there is no room to add one later. Andreas
On 2/22/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:
>
> Yes, thats one option. Though given a choice I would waste
> four bytes in the heap-page than inserting a new index entry.
No question about that. My point was, that it would mean wasting
the 2 (2 must be enough for a slot pointer) bytes on every heap tuple,
hot or not. And then the decision is not so obvious anymore.
If you don't have the room for the back pointer on every slot,
there is no room to add one later.
Oh yes, I agree. I was referring to the idea of line pointer redirection
which would waste four bytes (for the root line pointer) per
hot-update chain. That occurs only when a tuple is hot-updated.
So there is no overhead for normal tuples. Also, since its outside
the tuple header, we don't have issues of additional space wastage
because of alignment.
We would need to teach the code to ignore all such pointers which
don't point to a real tuple, but only redirects us to another line pointer.
We arrive at this line pointer from the index and then get redirected
to another line pointer on the same page. Vacuum would need to
delay freeing this line pointer until the hot-update chain is dead.
I am waiting for feedback on this since I would like to work on
this next. Anybody sees any issue with this approach ? Comments
please.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
> > > Yes, thats one option. Though given a choice I would waste four > > > bytes in the heap-page than inserting a new index entry. > > > > No question about that. My point was, that it would mean wasting the 2 > > (2 must be enough for a slot pointer) bytes on every heap tuple, hot > > or not. And then the decision is not so obvious anymore. > > If you don't have the room for the back pointer on every slot, there > > is no room to add one later. > > > > > Oh yes, I agree. I was referring to the idea of line pointer > redirection which would waste four bytes (for the root line > pointer) per hot-update chain. That occurs only when a tuple > is hot-updated. > So there is no overhead for normal tuples. I think you are still misunderstanding me, sorry if I am not beeing clear enough. When the row is hot-updated it is too late. You do not have room in the root for the line pointer. Say a table's tuple size is typically 40 bytes. Your root slot has room for exactly 40 bytes. When the new tuple also (as expected) needs 40 bytes there is no room for the line pointer. Or are you suggesting, that vacuum resizes all root tuples to make room for the extra bytes, and only then you can use it ? Imho that would not be good. Imho the "relink root to newest dead tuple during update" path is more promising, and does not depend on vacuum. If we do reuse dead tuples without vacuum we should probably, as already suggested, disconnect the "what is dead enough for reuse/vacuum" from global xmin right from the start. Andreas
On 2/22/07, Zeugswetter Andreas ADI SD <ZeugswetterA@spardat.at> wrote:
I think the word "line pointer" is causing some confusion here. Let me
explain the idea again: Each page has a set of line pointers OR item-ids
as they are referred in the code (I shall use the word item-id here after).
The item-id stores the offset(15 bits), length (15 bits) and two
flags, LP_USED and LP_DELETE (2 bits).
The root tuple is pointed to by some item-id (say, I1) on the page. When
the tuple is hot updated, a heap-only tuple is added which is linked to
the root tuple by its t_ctid and is pointed by another item-id I2 on
the page.
Now, when the root tuple subsequently becomes DEAD, I am proposing
to store I2 in the offset field of I1. The length field in I1 can be set to
zero (or some other special value) to mark that I1 is now just a
redirection to I2 and does not point to any real tuple, dead or live.
The actual storage used by the root tuple is now released (or we can
add another item pointer to keep track of it so that it can be reused
without vacuum of the page. But lets defer that for the time being).
In short, the pointer is NOT stored in the tuple header, but in the
item-id.
I did not get that completely. Can you elaborate on that ?
Once we find a heap-only DEAD tuple, we remove it from the
tuple-chain (and thus remove its only reference) and set LP_DELETE
on its item-id. When we run out of free-space in a page, we search for
an item-id whose LP_DELETE is set and whose length is atleast equal
to the new tuple length. We reuse that item-id and the associated
storage for storing the new tuple.
I hope things would be more clear once I post next version of the
I think you are still misunderstanding me, sorry if I am not beeing
clear
enough. When the row is hot-updated it is too late. You do not have room
in the root for the line pointer.
I think the word "line pointer" is causing some confusion here. Let me
explain the idea again: Each page has a set of line pointers OR item-ids
as they are referred in the code (I shall use the word item-id here after).
The item-id stores the offset(15 bits), length (15 bits) and two
flags, LP_USED and LP_DELETE (2 bits).
The root tuple is pointed to by some item-id (say, I1) on the page. When
the tuple is hot updated, a heap-only tuple is added which is linked to
the root tuple by its t_ctid and is pointed by another item-id I2 on
the page.
Now, when the root tuple subsequently becomes DEAD, I am proposing
to store I2 in the offset field of I1. The length field in I1 can be set to
zero (or some other special value) to mark that I1 is now just a
redirection to I2 and does not point to any real tuple, dead or live.
The actual storage used by the root tuple is now released (or we can
add another item pointer to keep track of it so that it can be reused
without vacuum of the page. But lets defer that for the time being).
In short, the pointer is NOT stored in the tuple header, but in the
item-id.
If we do reuse dead tuples without vacuum we should probably, as already
suggested,
disconnect the "what is dead enough for reuse/vacuum" from global xmin
right from the
start.
I did not get that completely. Can you elaborate on that ?
Once we find a heap-only DEAD tuple, we remove it from the
tuple-chain (and thus remove its only reference) and set LP_DELETE
on its item-id. When we run out of free-space in a page, we search for
an item-id whose LP_DELETE is set and whose length is atleast equal
to the new tuple length. We reuse that item-id and the associated
storage for storing the new tuple.
HOT-patch. But I really appreciate these comments.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
> I think the word "line pointer" is causing some confusion > here. Let me explain the idea again: Each page has a set of > line pointers OR item-ids as they are referred in the code (I > shall use the word item-id here after). > The item-id stores the offset(15 bits), length (15 bits) and > two flags, LP_USED and LP_DELETE (2 bits). > > The root tuple is pointed to by some item-id (say, I1) on the > page. When the tuple is hot updated, a heap-only tuple is > added which is linked to the root tuple by its t_ctid and is > pointed by another item-id I2 on the page. > > Now, when the root tuple subsequently becomes DEAD, I am > proposing to store I2 in the offset field of I1. The length > field in I1 can be set to zero (or some other special value) > to mark that I1 is now just a redirection to I2 and does not > point to any real tuple, dead or live. Oh, thanks for explaining. I was really misunderstanding. Is that possible ? I thought item-id's (slots) need to be in physical order. > If we do reuse dead tuples without vacuum we should probably, > as already > > suggested, > > disconnect the "what is dead enough for reuse/vacuum" from > global xmin > > right from the start. > > > I did not get that completely. Can you elaborate on that ? > > Once we find a heap-only DEAD tuple, we remove it from the > tuple-chain (and thus remove its only reference) and set LP_DELETE > on its item-id. When we run out of free-space in a page, we search for > an item-id whose LP_DELETE is set and whose length is atleast equal > to the new tuple length. We reuse that item-id and the associated > storage for storing the new tuple. With the recent discussion about flashback (explicitly setting a snapshot to 5 min ago) and forensics, I think we should have a way to delay what is considered "DEAD and available for immediate reuse". I know this can reduce the efficiency of HOT when delayed too far, but I think we should have that possibility if we reuse without vacuum. Andreas