Thread: HOT WIP Patch - version 1
This is a WIP patch based on the recent posting by Simon and discussions
thereafter. We are trying to do one piece at a time and intention is to post
the work ASAP so that we could get early and continuous feedback from
the community. We could then incorporate those suggestions in the next
WIP patch.
To start with, this patch implements HOT-update for a simple case
when there is enough free space in the same block so that it can
accommodate the new version of the tuple. A necessary condition for
doing HOT-update is that none of the index columns is changed.
The old version is marked as HEAP_UPDATE_ROOT and the new
version is marked as HEAP_ONLY_TUPLE. If a tuple is HOT-updated,
no new index entry is added.
When fetching a tuple using an index, if the root tuple is not visible to
the given snapshot, the ctid chain is followed until a visible tuple is found or
end of HOT-update chain is reached. The prior_xmax/next_xmin chain
is validated while following the ctid chain.
This patch is generated on the current CVS head. It passes all the regression
tests, but I haven't measured any performance impact since thats not the
goal for posting this early version. There are several things that are not
yet implemented and there are few unresolved issues for which I am looking
for community help and feedback.
Open Issues:
------------------
- CREATE INDEX needs more work in the HOT context. The existing HOT
tuples may require chilling for the CREATE INDEX to work correctly. There
are concerns about the crash-safety on chilling operation. Few suggestions
were posted in this regard. We need to conclude that and post a working
design/patch.
- We need to find a way to handle DEAD root tuples, either convert them into
stubs or overwrite them with a new version. We can also perform pointer
swinging from the index. Again there are concerns about crash-safety and
concurrent index-scans working properly. We don't have a community
consensus on any of the suggestions in this regard. But hopefully we
would converge on some design soon.
- Retail VACUUM. We need to implement the block-level vacuum for
UPDATEs to find enough free space in the block to do HOT-update.
Though we are still discussing how to handle the dead root tuples, we
should be able to remove any intermediate dead tuples in the HOT-update
chain safely. If we do so without fixing the root tuple, the
prior_xmax/next_xmin chain would be broken. A similar problem exists
with freezing HOT tuples.
Whats Next:
-----------------
In the current implementation, an HOT-updated tuple can not be vacuumed
because it might be in the middle of the access path to the heap-only visible tuple.
This can cause the table to grow rapidly even if autovacuum is turned on. The
HOT-update chain also keeps growing if there is enough free space in the block.
I am thinking of implementing some sort of HOT-update chain squeezing logic
so that intermediate dead tuples can be retired and vacuumed away. This would
also help us keep the HOT-update chain small enough so that the chain following
does not become unduly costly.
I am thinking of squeezing the HOT-update chain while following it in the index fetch.
If the root tuple is dead, we follow the chain until the first LIVE or
RECENTLY_DEAD tuple is found. The ctid pointer in the root tuple is made
point to the first LIVE or RECENTLY_DEAD tuple. All the intermediate
DEAD tuples are marked ~HEAP_UPDATE_ROOT so that they can be vacuumed
in the next cycle. We hold an exclusive lock on the page while doing so. That should
avoid any race conditions. This infrastructure should also help us retail vacuum the
block later.
Please let me know your comments.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Attachment
Pavan Deolasee wrote: > - We need to find a way to handle DEAD root tuples, either convert them > into > stubs or overwrite them with a new version. We can also perform pointer > swinging from the index. Again there are concerns about crash-safety and > concurrent index-scans working properly. We don't have a community > consensus on any of the suggestions in this regard. But hopefully we > would converge on some design soon. This seems to be the most fundamental problem we have at the moment. If we stick to the basic rule we have now that a live tuple's ctid doesn't change, the only way to get rid of a dead tuple at the root of the update chain is by changing the index pointer. The backward-pointers Hannu suggested or the "scan the whole page to find the previous tuple" would allow reuse of those dead tuples for new tuples in the chain, but even those methods wouldn't completely eliminate the problem. We could say that in some scenarios we just leave behind some dead tuples/stubs that can never be reclaimed. What do you guys think, if we can bring it down to just an extra line pointer, would that be acceptable? We could also do that for now and implement the pointer-swinging later if it turns out to be a problem in practice. What's the verdict on relaxing the "live tuple's ctid doesn't change rule"? If we did allow that within a page, what would we need to change? Inside the backend we'd have to make sure thatwhenever a ctid is used, the page is kept pinned. How likely is it that it would brake any external projects, and how difficult would it be to detect the broken usage pattern? It would mean that the ctid system column could change within a transaction, unless we change it so that it returns the ctid of the root tuple + xmin + cmin, but it'd be a user-visible change anyhow. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > What's the verdict on relaxing the "live tuple's ctid doesn't change > rule"? I think that's unacceptable; it is known that that will break the ODBC and JDBC drivers, as well as any other programs that make use of the ctid for re-finding a tuple they read earlier in the same transaction. We have not only never deprecated client-side use of ctid for this, but actively encouraged it, for instance by going out of our way to support fast access for queries "WHERE ctid = 'constant'". What's more, your proposal would break plain old UPDATE and DELETE, as well as SELECT FOR UPDATE, none of which promise to hold a pin continuously on every page containing a tuple they might decide to revisit (by ctid) later. Are you prepared to disallow hash join and sort/merge join in all such queries? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> What's the verdict on relaxing the "live tuple's ctid doesn't change >> rule"? > > I think that's unacceptable; it is known that that will break the ODBC > and JDBC drivers, as well as any other programs that make use of the > ctid for re-finding a tuple they read earlier in the same transaction. AFAIK the JDBC driver doesn't use ctid. But ODBC and other programs do. > We have not only never deprecated client-side use of ctid for this, but > actively encouraged it, for instance by going out of our way to support > fast access for queries "WHERE ctid = 'constant'". The idea I had was to change what the ctid system column returns to root ctid + xmin + cmin. As long as programs treat the ctid as an opaque string, it should work. Tid scan would use that to locate the original tuple. > What's more, your proposal would break plain old UPDATE and DELETE, > as well as SELECT FOR UPDATE, none of which promise to hold a pin > continuously on every page containing a tuple they might decide to > revisit (by ctid) later. Are you prepared to disallow hash join and > sort/merge join in all such queries? No, of course not. We'd have to do the same thing here; use root tid + xmin + cmin instead of just ctid. But now that I think of it, how do we get the root tid of a tuple? I suppose we'd be back to having backpointers or scanning the whole page... I guess pointer-swinging it is, then. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 2/14/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Not that I am suggesting we do this, but I believe we had some solution
to this problem in the earlier version of HOT. The live tuple when copied-back
to the root tuple, the tuple is marked with a HEAP_COPIED_BACK flag.
HeapTupleSatisfiesUpdate() checks for this flag and if set returns a new
return code, HeapTupleCopiedBack. heap_update() returns the same
to ExecUpdate along with the ctid of the root tuple. The UPDATE/DELETE
operation then retried on the root tuple, very similar to read-committed
update/delete. The xmax of the copied-back tuple is set so that its
not vacuumed away until all the current transactions are completed.
Though I have tested this patch several times and it seems to work fine,
I probably don''t have insight into the code as much others on this list
has. So if someone wants to take a look and see if it would work fine,
I would be more than happy to post the latest HOT patch (older design).
Thanks,
Pavan
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> What's the verdict on relaxing the "live tuple's ctid doesn't change
> rule"?
I think that's unacceptable; it is known that that will break the ODBC
and JDBC drivers, as well as any other programs that make use of the
ctid for re-finding a tuple they read earlier in the same transaction.
We have not only never deprecated client-side use of ctid for this, but
actively encouraged it, for instance by going out of our way to support
fast access for queries "WHERE ctid = 'constant'".
What's more, your proposal would break plain old UPDATE and DELETE,
as well as SELECT FOR UPDATE, none of which promise to hold a pin
continuously on every page containing a tuple they might decide to
revisit (by ctid) later. Are you prepared to disallow hash join and
sort/merge join in all such queries?
Not that I am suggesting we do this, but I believe we had some solution
to this problem in the earlier version of HOT. The live tuple when copied-back
HeapTupleSatisfiesUpdate() checks for this flag and if set returns a new
return code, HeapTupleCopiedBack. heap_update() returns the same
to ExecUpdate along with the ctid of the root tuple. The UPDATE/DELETE
operation then retried on the root tuple, very similar to read-committed
update/delete. The xmax of the copied-back tuple is set so that its
not vacuumed away until all the current transactions are completed.
Though I have tested this patch several times and it seems to work fine,
I probably don''t have insight into the code as much others on this list
has. So if someone wants to take a look and see if it would work fine,
I would be more than happy to post the latest HOT patch (older design).
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
> What's the verdict on relaxing the "live tuple's ctid doesn't > change rule"? If we did allow that within a page, what would > we need to change? I already said this, but why would this need to be visible from the outside ? A few assumptions:no back pointersindexes only point at slots marked as roots (and non hot tuples) During vacuum, you swap the tuples and keep a stub at the slot that the user's ctid might be pointing at. You mark the stub to detect this situation. When a select/update by ctid comes along it needs to do one step to the root and use that tuple instead. It needs a second vacuum (or a per page vacuum during update) to remove the extra stub when it is dead and not recently dead. I fail to see the hole. Andreas
> But now that I think of it, how do we get the root tid of a > tuple? I suppose we'd be back to having backpointers or > scanning the whole page... I guess pointer-swinging it is, then. During vacuum you see a root [stub] not recently dead. You follow the chain to detect if you find a live tuple that can replace the root. You replace the root. You replace the original with a stub that points at the root and mark it recently dead (and HEAP_COPIED_BACK aka Pavan). ... (see prev post) No need for anyone but vacuum to find a root. Andreas
Just to summarize: o Every tuple gets a heap ctido Only the root tuple gets an index entryo We can easily remove dead tuples that aren'tthe root because by definition, nothing points to them, including backends and indexes The problem is that a dead root tuple has to stay around because while no backends can see it, the index does. We could move a live tuple into root ctid slot, but if we do that, the live tuple changes its ctid while it is visible. Could we insert index tuples for the live tuple and then remove the root tuple, perhaps later? So basically we break the chain at that time. The problem there is that we basically have nothing better than what we have now --- we are just delaying the index insert, and I don't see what that buys us. Could a _new_ tuple take over the root tuple slot? It is new, so it doesn't have a ctid yet to change. I think that means the index walking could go forward or backward, but on the same page. To illustrate, with ctid slot numbers:[1] root INSERT[2][3][1] root INSERT[2] UPDATE[3][1] root INSERT (dead)[2] UPDATE 1[3][1] root UPDATE2[2] UPDATE 1[3] --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Pavan Deolasee wrote: > > - We need to find a way to handle DEAD root tuples, either convert them > > into > > stubs or overwrite them with a new version. We can also perform pointer > > swinging from the index. Again there are concerns about crash-safety and > > concurrent index-scans working properly. We don't have a community > > consensus on any of the suggestions in this regard. But hopefully we > > would converge on some design soon. > > This seems to be the most fundamental problem we have at the moment. If > we stick to the basic rule we have now that a live tuple's ctid doesn't > change, the only way to get rid of a dead tuple at the root of the > update chain is by changing the index pointer. The backward-pointers > Hannu suggested or the "scan the whole page to find the previous tuple" > would allow reuse of those dead tuples for new tuples in the chain, but > even those methods wouldn't completely eliminate the problem. > > We could say that in some scenarios we just leave behind some dead > tuples/stubs that can never be reclaimed. What do you guys think, if we > can bring it down to just an extra line pointer, would that be > acceptable? We could also do that for now and implement the > pointer-swinging later if it turns out to be a problem in practice. > > What's the verdict on relaxing the "live tuple's ctid doesn't change > rule"? If we did allow that within a page, what would we need to change? > Inside the backend we'd have to make sure that whenever a ctid is > used, the page is kept pinned. How likely is it that it would brake any > external projects, and how difficult would it be to detect the broken > usage pattern? It would mean that the ctid system column could change > within a transaction, unless we change it so that it returns the ctid of > the root tuple + xmin + cmin, but it'd be a user-visible change anyhow. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, Feb 14, 2007 at 01:56:03PM -0500, Bruce Momjian wrote: > Could we insert index tuples for the live tuple and then remove the root > tuple, perhaps later? So basically we break the chain at that time. > The problem there is that we basically have nothing better than what we > have now --- we are just delaying the index insert, and I don't see what > that buys us. At some point - inserting into the block would not be possible, as there is no free space. Would that be a good time to do the index insert? Then, a later vacuum would eventually prune out the whole old chain. As long as vacuuming the intermediate entries in the chain keeps the block with free space, there is no need to remove the root tuple. If space ever runs out (vacuum not running frequently enough - many updates performed in the same interval) - fall back to the mechanism that is being used today. I see it buying increased performance for rows that are frequently updated. If it can delay modifying the indices to only once every 10 or more updates, it seems to me that the improvement should be significant. Perhaps PostgreSQL could be used for page hit counters again... :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
Zeugswetter Andreas ADI SD wrote: > A few assumptions: > no back pointers > indexes only point at slots marked as roots (and non hot tuples) > > During vacuum, you swap the tuples and keep a stub at the slot that the > user's ctid might be pointing at. You mark the stub to detect this > situation. > When a select/update by ctid comes along it needs to do one step to the > root > and use that tuple instead. As Pavan pointed out, that's more or less what he ended up doing originally. You need to mark the stub with the current most recent xid, and wait until that's no longer running. Only after that you can remove the stub. > It needs a second vacuum (or a per page vacuum during update) to remove > the > extra stub when it is dead and not recently dead. Requiring two vacuums to remove the tuple sounds bad at first, but it's actually not so bad since both steps could by done by retail vacuum, or even normal scans while. > I fail to see the hole. The only potential problem I can see is how to make sure that a heap scan or a bitmap heap scan doesn't visit the tuple twice. If we make sure that the page is scanned in one go while keeping the buffer pinned, we're good. We already do that except for system catalogs, so I believe we'd have to forbid hot updates on system tables, like we forbid bitmap scans. To me this sounds like the best idea this far. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > Just to summarize: > > o Every tuple gets a heap ctid > o Only the root tuple gets an index entry > o We can easily remove dead tuples that aren't the root because > by definition, nothing points to them, including backends and > indexes I am still wondering about the "easily" here. Basically this needs some kind of wal entry to be crash safe. Else some later tx might reuse the slot:- some update on page produces page image in wal- slot removed- slot reused and comitted-page not written- crash- wal fullpage restores the page to the version before slot removed(- would need a wal replay for slot removed from hot chain here)- wal restores slot reuse, but the slot is now partof a wrong hot chain and the chain is broken (unless we have the above step) Do we have this wal entry ? > The problem is that a dead root tuple has to stay around > because while no backends can see it, the index does. We > could move a live tuple into root ctid slot, but if we do > that, the live tuple changes its ctid while it is visible. > > Could we insert index tuples for the live tuple and then > remove the root tuple, perhaps later? So basically we break > the chain at that time. > The problem there is that we basically have nothing better > than what we have now --- we are just delaying the index > insert, and I don't see what that buys us. yes, not really promising. > Could a _new_ tuple take over the root tuple slot? It is > new, so it doesn't have a ctid yet to change. I think that > means the index walking > could go forward or backward, but on the same page. To illustrate, > with ctid slot numbers: > > [1] root INSERT > [2] > [3] > > [1] root INSERT > [2] UPDATE > [3] > > [1] root INSERT (dead) > [2] UPDATE 1 > [3] > > [1] root UPDATE 2 > [2] UPDATE 1 > [3] Imho no, because that would need a back pointer in [1] to [2] so that readers arriving at [1] (e.g. through index) can see [2] until [1] is visible. I think we don't want backpointers. (rather go the tuple swapping route) Andreas
Zeugswetter Andreas ADI SD wrote: > I am still wondering about the "easily" here. Basically this > needs some kind of wal entry to be crash safe. > > Else some later tx might reuse the slot: > - some update on page produces page image in wal > - slot removed > - slot reused and comitted > - page not written > - crash > - wal fullpage restores the page to the version before slot > removed > (- would need a wal replay for slot removed from hot chain here) > - wal restores slot reuse, but the slot is now part of a wrong > hot chain > and the chain is broken (unless we have the above step) > > Do we have this wal entry ? We already log tuple removals by normal vacuums. We can't use that wal entry as it is: if a dead tuple is in the middle of an update chain, it needs to be unlinked from the chain. But I don't see any particular problem with that, it just needs to be wal logged like every other data changing operation. Do we actually ever want to remove dead tuples from the middle of the chain? If a tuple in the middle of the chain is dead, surely every tuple before it in the chain is dead as well, and we want to remove them as well. I'm thinking, removing tuples from the middle of the chain can be problematic, because we'd need to fiddle with the xmin/xmax of the other tuples to make them match. Or change the tuple-following logic to not do the xmin=xmax check, but it's a nice robustness feature. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 2/15/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
I am precisely working on this right now. In the next patch version that I
intend to send shortly, I am thinking of removing the dead tuples in the
middle of the chain. We don't have agreement on how to deal with the
root tuple, but we can safely remove the intermediate dead tuples and
vacuum them. Also when all the tuples in the chain are dead because the
last tuple is either deleted or COLD updated, the entire chain along with
the root tuple and the index entry can be vacuumed.
The operation must be WAL logged and you caught the xmin/xmax
problem very rightly. One option is to change the xmax of root tuple
to the xmin of the first live/recently-dead tuple, if we remove a set of
intermediate dead tuples. This xmin of the first live/recently-dead tuple
is also the xmax of the last dead tuple we removed and hence must
be older than the oldtestXmin. So assigning that to the root tuple
should not break any visibility rules for the root tuple (it would still
be dead).
Do we see any problem with this ?
Thanks,
Pavan
Do we actually ever want to remove dead tuples from the middle of the
chain? If a tuple in the middle of the chain is dead, surely every tuple
before it in the chain is dead as well, and we want to remove them as
well. I'm thinking, removing tuples from the middle of the chain can be
problematic, because we'd need to fiddle with the xmin/xmax of the other
tuples to make them match. Or change the tuple-following logic to not do
the xmin=xmax check, but it's a nice robustness feature.
I am precisely working on this right now. In the next patch version that I
intend to send shortly, I am thinking of removing the dead tuples in the
middle of the chain. We don't have agreement on how to deal with the
root tuple, but we can safely remove the intermediate dead tuples and
vacuum them. Also when all the tuples in the chain are dead because the
last tuple is either deleted or COLD updated, the entire chain along with
the root tuple and the index entry can be vacuumed.
The operation must be WAL logged and you caught the xmin/xmax
problem very rightly. One option is to change the xmax of root tuple
to the xmin of the first live/recently-dead tuple, if we remove a set of
intermediate dead tuples. This xmin of the first live/recently-dead tuple
is also the xmax of the last dead tuple we removed and hence must
be older than the oldtestXmin. So assigning that to the root tuple
should not break any visibility rules for the root tuple (it would still
be dead).
Do we see any problem with this ?
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, N, 2007-02-15 kell 10:49, kirjutas Heikki Linnakangas: > We already log tuple removals by normal vacuums. We can't use that wal > entry as it is: if a dead tuple is in the middle of an update chain, it > needs to be unlinked from the chain. But I don't see any particular > problem with that, it just needs to be wal logged like every other data > changing operation. > > Do we actually ever want to remove dead tuples from the middle of the > chain? If a tuple in the middle of the chain is dead, surely every tuple > before it in the chain is dead as well, and we want to remove them as > well. I'm thinking, removing tuples from the middle of the chain can be > problematic, because we'd need to fiddle with the xmin/xmax of the other > tuples to make them match. Or change the tuple-following logic to not do > the xmin=xmax check, but it's a nice robustness feature. What kind of robustness does it provide ? In other words - what failure scenario does this guard against ? I can't see the case where the xmin=xmax check can not succeed, at least not for same page tuples. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Fri, 2007-02-16 at 09:36 +0200, Hannu Krosing wrote: > Ühel kenal päeval, N, 2007-02-15 kell 10:49, kirjutas Heikki > Linnakangas: > > > We already log tuple removals by normal vacuums. We can't use that wal > > entry as it is: if a dead tuple is in the middle of an update chain, it > > needs to be unlinked from the chain. But I don't see any particular > > problem with that, it just needs to be wal logged like every other data > > changing operation. > > > > Do we actually ever want to remove dead tuples from the middle of the > > chain? If a tuple in the middle of the chain is dead, surely every tuple > > before it in the chain is dead as well, and we want to remove them as > > well. I'm thinking, removing tuples from the middle of the chain can be > > problematic, because we'd need to fiddle with the xmin/xmax of the other > > tuples to make them match. Or change the tuple-following logic to not do > > the xmin=xmax check, but it's a nice robustness feature. > > What kind of robustness does it provide ? In other words - what failure > scenario does this guard against ? > > I can't see the case where the xmin=xmax check can not succeed, at least > not for same page tuples. The xmin == xmax case was required to support a complex VACUUM FULL failure case discovered some time ago. I'll post the details. With HOT, ISTM that testing this becomes even more important in some form because of the extra complexity we're putting into that part of the code. We need some way of detecting a corruption introduced by some weird use case we haven't thought of. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-02-15 at 10:49 +0000, Heikki Linnakangas wrote: > Do we actually ever want to remove dead tuples from the middle of the > chain? If a tuple in the middle of the chain is dead, surely every tuple > before it in the chain is dead as well, and we want to remove them as > well. I'm thinking, removing tuples from the middle of the chain can be > problematic, because we'd need to fiddle with the xmin/xmax of the other > tuples to make them match. Or change the tuple-following logic to not do > the xmin=xmax check, but it's a nice robustness feature. I think the phrase "middle of the chain" is being used in two ways here. If a tuple in the chain is dead, then all prior tuples are dead also. That gives us a live portion of the chain and a dead portion of the chain. We will only ever need to follow the chain along the live portion, so breaking the live portion of the chain is not allowable. Breaking the chain in the dead portion *is* allowable and that is what is being proposed here. We break the chain rather than simply remove all of it because there may still be index pointers to some of the dead rows. Pavan's code does this now. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com