Thread: HOT patch - version 14
Hi All,
Please see the version 14 of HOT patch attached. I have also
attached a differential patch from the last version posted. This
now includes support for partial and expression indexes. As per
the discussion, we collect all the index columns appearing in
the partial index predicates and/or expression index expressions.
If any of these columns are being updated, the update does not
qualify for HOT update.
I also took this opportunity to remove the modularity invasion caused
by heap_check_idxupdate() since it was using resultRelInfo. We now
build the list of attributes that must be checked to satisfy HOT update.
This list includes all the index columns, columns in the partial index
predicates and expression index expressions and is built in the
executor.
heap_check_idxupdate() is renamed to HeapSatisfiesHOTUpdate()
The patch also includes a bug fix in the CREATE INDEX code path.
There was a race between CREATE INDEX and concurrent chain pruning
operation. Since CREATE INDEX works on SnapshotAny, a tuple returned
by heap-scan may get pruned (and marked ~LP_USED) before it can
be inspected later. So we must check for LP_USED after reaquiring
the buffer lock.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Please see the version 14 of HOT patch attached. I have also
attached a differential patch from the last version posted. This
now includes support for partial and expression indexes. As per
the discussion, we collect all the index columns appearing in
the partial index predicates and/or expression index expressions.
If any of these columns are being updated, the update does not
qualify for HOT update.
I also took this opportunity to remove the modularity invasion caused
by heap_check_idxupdate() since it was using resultRelInfo. We now
build the list of attributes that must be checked to satisfy HOT update.
This list includes all the index columns, columns in the partial index
predicates and expression index expressions and is built in the
executor.
heap_check_idxupdate() is renamed to HeapSatisfiesHOTUpdate()
The patch also includes a bug fix in the CREATE INDEX code path.
There was a race between CREATE INDEX and concurrent chain pruning
operation. Since CREATE INDEX works on SnapshotAny, a tuple returned
by heap-scan may get pruned (and marked ~LP_USED) before it can
be inspected later. So we must check for LP_USED after reaquiring
the buffer lock.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Attachment
Pavan Deolasee wrote: > Please see the version 14 of HOT patch attached. I have also > attached a differential patch from the last version posted. This > now includes support for partial and expression indexes. As per > the discussion, we collect all the index columns appearing in > the partial index predicates and/or expression index expressions. > If any of these columns are being updated, the update does not > qualify for HOT update. I see that you have a separate bitmapset to keep track of indexes on system attributes. But having an index on a system attribute doesn't make any sense, does it? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I see that you have a separate bitmapset to keep track of indexes on > system attributes. But having an index on a system attribute doesn't > make any sense, does it? Counterexample: OID. regards, tom lane
On 8/20/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I see that you have a separate bitmapset to keep track of indexes on
> system attributes. But having an index on a system attribute doesn't
> make any sense, does it?
Counterexample: OID.
Right. Further, we allow creating indexes on system attributes. So we
must support those.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > Please see the version 14 of HOT patch attached. I expected to find either a large new README, or some pretty substantial additions to existing README files, to document how this all works. The comments included do not represent nearly enough documentation. One thing I was unable to discern from the comments is how CREATE INDEX can work at all. A new index might mean that tuples that could formerly legally be part of the same hot-chain no longer can. I can't find any code here that looks like it's addressing that. I also don't think I believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, but the concurrent delete rolls back? Then you've got a valid tuple that's not in the index. > I also took this opportunity to remove the modularity invasion caused > by heap_check_idxupdate() since it was using resultRelInfo. We now > build the list of attributes that must be checked to satisfy HOT update. > This list includes all the index columns, columns in the partial index > predicates and expression index expressions and is built in the > executor. The executor is the wrong place for that: I'm not sure why you think that making heapam depend on the executor is a modularity improvement. Furthermore this approach requires recalculating the list during each query, which is wasteful when it could only change during schema updates. I'd suggest making the relcache responsible for computing and saving this data, along the same lines as RelationGetIndexList(). (That also means that heapam can get the data from the relcache, saving a lot of API-bloating from passing around Attrids explicitly.) Also, rather than inventing Attrids, I'd suggest just using one Bitmapset with the convention that its contents are offset by FirstLowInvalidHeapAttributeNumber. The redefinition of the value of MaxHeapTuplesPerPage seems pretty ugly and unprincipled. I think it'd be better to leave it as-is, and just enforce that we don't allow more than that many line pointers on a heap page (as indeed you have to do anyway, so it's not clear what the change is buying). Is it really necessary to add hot_update to xl_heap_update? Seems the information should be available from the tuple header fields. Have you demonstrated that the "prune_hard" logic is worth its weight? Considering that many of us want to drop VACUUM FULL, adding more complexity to it doesn't seem like a profitable use of time. Is it really safe, or productive, to run heap_page_prune when the buffer is not locked for cleanup (ie, there are other people with pins on it)? Even if it's safe, ISTM what you're mostly accomplishing there is to expend a lot of cycles while holding exclusive lock on the page, when there is good reason to think that you're blocking other people who are interested in using the page. Eliminating the separation between that and cleanup would also allow eliminating the separate "PD_FRAGMENTED" page state. PlanSetValidForThisTransaction is completely bogus --- it's not re-entrant, and it needs to be. I think you need some state in PlannerGlobal instead. rd_avgfsm seems fairly bogus ... when does it get updated? regards, tom lane
On 8/30/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I shall take that up. There are couple of documents posted by Heikki and Greg,
apart from several email threads and original design doc by Simon. I shall
consolidate everything in a single README.
You are right - a new index might mean that an existing HOT chain
is broken as far as the new index is concerned. The way we address
that is by indexing the root tuple of the chain, but the index key is
extracted from the last tuple in the chain. The index is marked "unusable"
for all those existing transactions which can potentially see any intermediate
tuples in the chain.
Please see this document written by Greg Stark. He has nicely summarized
how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.
http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php
Since CREATE INDEX works with ShareLock on the relation, we can
safely assume that we can't find any DELETE_IN_PROGRESS tuples except
those deleted by our own transaction. The only exception is system relation,
but we don't do HOT updates on system relation. Given that, it seems OK
to me to ignore these tuples because if the transaction aborts,
CREATE INDEX is aborted as well. Am I overlooking something here ?
There is a comment to this regard in the current code as well.
Earlier (in the previous version of HOT patch) we were passing ResultRelInfo to heap_update, which was more ugly. The comment is in that context :-)
I liked all these suggestions. I know I thought about computing
the attribute list in relcache, but probably avoided to keep things simple.
I shall make these changes.
The only reason to redefine MaxHeapTuplesPerPage to higher side is
because HOT allows us to accommodate more tuples per page because
of redirect-dead line pointers. For a table with sufficiently large tuples,
the original bound would work well, but for very small tuples - we might
hit the line pointer limit even if there is free space available in the
page. Doubling the value is chosen as a balance between heap page
utilization, line pointer bloating and overhead for bitmap scans. But
I agree that the factor choice is rather arbitrary.
I think its not necessary. The reason I did that way because the new block
might be backed up in the WAL record and hence we might have recorded
the new tuple infomasks. But we can surely handle that corner case. I shall fix this.
The prune_hard code is lot more simple in this version (it was horrible
before Heikki and I reworked the pruning logic). We just follow the HOT
chain to the last definitely DEAD tuple and remove all tuples preceding
that. This simplifies the second phase of VACUUM FULL since
we don't need to handle any broken HOT chains because of intermediate
DEAD tuples. Also, prune_hard allows us to cleanup the redirected
line pointers.
Another reason for doing this is to avoid any significant changes to
VACUUM FULL code since the code itself is complex and we might
just drop this in future. But until that time, we need to make sure that
HOT works fine with VACUUM FULL.
It looks safe to me. We don't move tuples around during chain pruning.
We only fix the HOT chains by removing intermediate DEAD tuples.
Other places where we follow HOT chains, we do so while holding
at-least SHARE lock on the buffer. So there are no race conditions here.
A place where this bite me is that the heap-scan with SnapshotAny
may return a tuple which gets pruned (and marked ~LP_USED)
before the caller can see it. Fortunately there are only limited users
of SnapshotAny. We make sure that IndexBuildHeapScan confirms that
the ItemIdIsValid() after acquiring lock on the buffer. Cluster works with
exclusive lock on the relation and hence there is no concurrent pruning
possible.
The reason we did it that way because repairing fragmentation seems
much more costly that pruning. Please note that we prune a single
chain during index fetch. Its only for heap-scans (and VACUUM) that
we try to prune all chains in the page. So unless we are doing heap-scan,
I am not sure if we are spending too much time holding the exclusive
lock. I agree we don't have any specific numbers to prove that though.
Another reasoning behind separating these two steps is: pruning
requires exclusive lock whereas repairing fragmentation requires
cleanup lock. Also we want to prune the chains aggressively
because the visible tuple is most likely at the end of the chain. Longer
the chain, greater the cost to fetch the tuple.
I was not happy with this - frankly I don't understand the planning code
much. Thanks for pointing me to the appropriate code. I shall try fix this,
unless you would like to do it yourself.
Yeah, its not the best solution. What I wanted to do is delay repairing
page fragmentation as far as possible. But I noticed that fsmrel->avgRequest
(rd_avgfsm is initialized to it) starts with a small value (I guess 256).
For a table with large tuples, we may not repair fragmentation at all, thus
causing unnecessary COLD updates.
rd_avgfsm is updated when the relcache is rebuilt. Since we don't
send out any relcache invalidation when fsmrel->avgRequest changes,
rd_avgfsm may never get updated in a session.
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> Please see the version 14 of HOT patch attached.
I expected to find either a large new README, or some pretty substantial
additions to existing README files, to document how this all works.
The comments included do not represent nearly enough documentation.
I shall take that up. There are couple of documents posted by Heikki and Greg,
apart from several email threads and original design doc by Simon. I shall
consolidate everything in a single README.
One thing I was unable to discern from the comments is how CREATE INDEX
can work at all. A new index might mean that tuples that could formerly
legally be part of the same hot-chain no longer can. I can't find any
code here that looks like it's addressing that.
You are right - a new index might mean that an existing HOT chain
is broken as far as the new index is concerned. The way we address
that is by indexing the root tuple of the chain, but the index key is
extracted from the last tuple in the chain. The index is marked "unusable"
for all those existing transactions which can potentially see any intermediate
tuples in the chain.
Please see this document written by Greg Stark. He has nicely summarized
how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.
http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php
I also don't think I
believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
tuples: what if the index completion commits, but the concurrent delete
rolls back? Then you've got a valid tuple that's not in the index.
Since CREATE INDEX works with ShareLock on the relation, we can
safely assume that we can't find any DELETE_IN_PROGRESS tuples except
those deleted by our own transaction. The only exception is system relation,
but we don't do HOT updates on system relation. Given that, it seems OK
to me to ignore these tuples because if the transaction aborts,
CREATE INDEX is aborted as well. Am I overlooking something here ?
There is a comment to this regard in the current code as well.
> I also took this opportunity to remove the modularity invasion caused
> by heap_check_idxupdate() since it was using resultRelInfo. We now
> build the list of attributes that must be checked to satisfy HOT update.
> This list includes all the index columns, columns in the partial index
> predicates and expression index expressions and is built in the
> executor.
The executor is the wrong place for that: I'm not sure why you think
that making heapam depend on the executor is a modularity improvement.
Earlier (in the previous version of HOT patch) we were passing ResultRelInfo to heap_update, which was more ugly. The comment is in that context :-)
Furthermore this approach requires recalculating the list during
each query, which is wasteful when it could only change during schema
updates. I'd suggest making the relcache responsible for computing and
saving this data, along the same lines as RelationGetIndexList().
(That also means that heapam can get the data from the relcache, saving
a lot of API-bloating from passing around Attrids explicitly.)
Also, rather than inventing Attrids, I'd suggest just using one
Bitmapset with the convention that its contents are offset by
FirstLowInvalidHeapAttributeNumber.
I liked all these suggestions. I know I thought about computing
the attribute list in relcache, but probably avoided to keep things simple.
I shall make these changes.
The redefinition of the value of MaxHeapTuplesPerPage seems pretty
ugly and unprincipled. I think it'd be better to leave it as-is,
and just enforce that we don't allow more than that many line pointers
on a heap page (as indeed you have to do anyway, so it's not clear
what the change is buying).
The only reason to redefine MaxHeapTuplesPerPage to higher side is
because HOT allows us to accommodate more tuples per page because
of redirect-dead line pointers. For a table with sufficiently large tuples,
the original bound would work well, but for very small tuples - we might
hit the line pointer limit even if there is free space available in the
page. Doubling the value is chosen as a balance between heap page
utilization, line pointer bloating and overhead for bitmap scans. But
I agree that the factor choice is rather arbitrary.
Is it really necessary to add hot_update to xl_heap_update? Seems the
information should be available from the tuple header fields.
I think its not necessary. The reason I did that way because the new block
might be backed up in the WAL record and hence we might have recorded
the new tuple infomasks. But we can surely handle that corner case. I shall fix this.
Have you demonstrated that the "prune_hard" logic is worth its weight?
Considering that many of us want to drop VACUUM FULL, adding more
complexity to it doesn't seem like a profitable use of time.
The prune_hard code is lot more simple in this version (it was horrible
before Heikki and I reworked the pruning logic). We just follow the HOT
chain to the last definitely DEAD tuple and remove all tuples preceding
that. This simplifies the second phase of VACUUM FULL since
we don't need to handle any broken HOT chains because of intermediate
DEAD tuples. Also, prune_hard allows us to cleanup the redirected
line pointers.
Another reason for doing this is to avoid any significant changes to
VACUUM FULL code since the code itself is complex and we might
just drop this in future. But until that time, we need to make sure that
HOT works fine with VACUUM FULL.
Is it really safe, or productive, to run heap_page_prune when the buffer
is not locked for cleanup (ie, there are other people with pins on it)?
It looks safe to me. We don't move tuples around during chain pruning.
We only fix the HOT chains by removing intermediate DEAD tuples.
Other places where we follow HOT chains, we do so while holding
at-least SHARE lock on the buffer. So there are no race conditions here.
A place where this bite me is that the heap-scan with SnapshotAny
may return a tuple which gets pruned (and marked ~LP_USED)
before the caller can see it. Fortunately there are only limited users
of SnapshotAny. We make sure that IndexBuildHeapScan confirms that
the ItemIdIsValid() after acquiring lock on the buffer. Cluster works with
exclusive lock on the relation and hence there is no concurrent pruning
possible.
Even if it's safe, ISTM what you're mostly accomplishing there is to
expend a lot of cycles while holding exclusive lock on the page, when
there is good reason to think that you're blocking other people who are
interested in using the page. Eliminating the separation between that
and cleanup would also allow eliminating the separate "PD_FRAGMENTED"
page state.
The reason we did it that way because repairing fragmentation seems
much more costly that pruning. Please note that we prune a single
chain during index fetch. Its only for heap-scans (and VACUUM) that
we try to prune all chains in the page. So unless we are doing heap-scan,
I am not sure if we are spending too much time holding the exclusive
lock. I agree we don't have any specific numbers to prove that though.
Another reasoning behind separating these two steps is: pruning
requires exclusive lock whereas repairing fragmentation requires
cleanup lock. Also we want to prune the chains aggressively
because the visible tuple is most likely at the end of the chain. Longer
the chain, greater the cost to fetch the tuple.
PlanSetValidForThisTransaction is completely bogus --- it's not
re-entrant, and it needs to be. I think you need some state in
PlannerGlobal instead.
I was not happy with this - frankly I don't understand the planning code
much. Thanks for pointing me to the appropriate code. I shall try fix this,
unless you would like to do it yourself.
rd_avgfsm seems fairly bogus ... when does it get updated?
page fragmentation as far as possible. But I noticed that fsmrel->avgRequest
(rd_avgfsm is initialized to it) starts with a small value (I guess 256).
For a table with large tuples, we may not repair fragmentation at all, thus
causing unnecessary COLD updates.
rd_avgfsm is updated when the relcache is rebuilt. Since we don't
send out any relcache invalidation when fsmrel->avgRequest changes,
rd_avgfsm may never get updated in a session.
Should we just revert to checking if the free space is less than a certain
percentage of BLCKSZ and trigger defragmentation ? Or may be do a
combination of both ?
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: >> Please see the version 14 of HOT patch attached. > > I expected to find either a large new README, or some pretty substantial > additions to existing README files, to document how this all works. > The comments included do not represent nearly enough documentation. The Heikki and I posted a two-part README of sorts: http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php > One thing I was unable to discern from the comments is how CREATE INDEX > can work at all. A new index might mean that tuples that could formerly > legally be part of the same hot-chain no longer can. I can't find any > code here that looks like it's addressing that. This is one of the weirdest things in HOT and one of the hardest problems it faced. The best solution proposed was to index the head of the HOT chain and just ban older transactions which might be able to see the older versions fro using the index. That's the purpose of the indcreatexid column. It's set in index_set_createxid() and checked in get_relation_info(). > I also don't think I believe the reasoning for not indexing > DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, > but the concurrent delete rolls back? Then you've got a valid tuple that's > not in the index. You're talking about the concurrent index build case? Wouldn't the second pass pick up that tuple? I have to look back at it to see for sure. > The redefinition of the value of MaxHeapTuplesPerPage seems pretty > ugly and unprincipled. I think it'd be better to leave it as-is, > and just enforce that we don't allow more than that many line pointers > on a heap page (as indeed you have to do anyway, so it's not clear > what the change is buying). Note that in a heavily updated table basically every tuple will have two line pointers, the head which just redirects to another line pointer, and the line pointer for the actual tuple. It's hard to get rid of the actual head line pointer without declaring that the tid might sometimes not change after an update. Not sure what this translates to for MaxHeapTuplesPerPage though. The rest I know less about and will leave to Pavan and Heikki (or anyone else who was following those details more closely). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Gregory Stark" <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > >> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: >>> Please see the version 14 of HOT patch attached. >> >> I expected to find either a large new README, or some pretty substantial >> additions to existing README files, to document how this all works. >> The comments included do not represent nearly enough documentation. > > The Heikki and I posted a two-part README of sorts: Er, editing error there. Has a ring to it though. > http://archives.postgresql.org/pgsql-patches/2007-07/msg00142.php > http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php ... >> I also don't think I believe the reasoning for not indexing >> DELETE_IN_PROGRESS hot-updated tuples: what if the index completion commits, >> but the concurrent delete rolls back? Then you've got a valid tuple that's >> not in the index. > > You're talking about the concurrent index build case? Wouldn't the second pass > pick up that tuple? I have to look back at it to see for sure. Sorry, that's misguided. The concurrent index build uses snapshots now so it can't see DELETE_IN_PROGRESS. And the non-concurrent index build has an lock so it ought to be back to the way it was before I messed with it where there was an assert against finding *_IN_PROGRESS (except as Pavan points out in the same transaction). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > You are right - a new index might mean that an existing HOT chain > is broken as far as the new index is concerned. The way we address > that is by indexing the root tuple of the chain, but the index key is > extracted from the last tuple in the chain. The index is marked "unusable" > for all those existing transactions which can potentially see any > intermediate tuples in the chain. I don't think that works --- what if the last tuple in the chain isn't committed good yet? If its inserter ultimately rolls back, you've indexed the wrong value. > Please see this document written by Greg Stark. He has nicely summarized > how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. > http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php Isn't the extra machination for C.I.C. just useless complication? What's the point of avoiding creation of new broken HOT chains when you still have to deal with existing ones? >> I also don't think I >> believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated >> tuples: what if the index completion commits, but the concurrent delete >> rolls back? Then you've got a valid tuple that's not in the index. > Since CREATE INDEX works with ShareLock on the relation, we can > safely assume that we can't find any DELETE_IN_PROGRESS tuples except > those deleted by our own transaction. The only exception is system relation, > but we don't do HOT updates on system relation. That chain of reasoning is way too shaky for me. Essentially what you're saying is you'll promise not to corrupt non-system indexes. Nor am I really thrilled about having to disable HOT for system catalogs. > The only reason to redefine MaxHeapTuplesPerPage to higher side is > because HOT allows us to accommodate more tuples per page because > of redirect-dead line pointers. No, it doesn't allow more tuples per page. It might mean there can be more line pointers than that on a page, but not more actual tuples. The question is whether there is any real use in allowing more line pointers than that --- the limit is already unrealistically high, since it assumes no data content in any of the tuples. If there is a rationale for it then you should invent a different constant MaxLinePointersPerPage or some such, but I really think there's no point. > Doubling the value is chosen as a balance between heap page > utilization, line pointer bloating and overhead for bitmap scans. I'd say it allows a ridiculous amount of line pointer bloat. >> Even if it's safe, ISTM what you're mostly accomplishing there is to >> expend a lot of cycles while holding exclusive lock on the page, when >> there is good reason to think that you're blocking other people who are >> interested in using the page. Eliminating the separation between that >> and cleanup would also allow eliminating the separate "PD_FRAGMENTED" >> page state. > The reason we did it that way because repairing fragmentation seems > much more costly that pruning. Please note that we prune a single > chain during index fetch. Its only for heap-scans (and VACUUM) that > we try to prune all chains in the page. So unless we are doing heap-scan, > I am not sure if we are spending too much time holding the exclusive > lock. I agree we don't have any specific numbers to prove that though. If you don't have numbers proving that this extra complication has a benefit, I'd vote for simplifying it. The SnapshotAny case is going to bite other people besides you in future. > Another reasoning behind separating these two steps is: pruning > requires exclusive lock whereas repairing fragmentation requires > cleanup lock. This is nonsense. Those are the same lock. If you have the former and not the latter, it just means that you *know* there is contention for the page. It seems to me that performing optional maintenance work in such a situation is completely wrong. regards, tom lane
On 8/30/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I am confused. How could we get ShareLock on the relation while
there is some open transaction which has inserted a tuple in the
table ? (Of course, I am not considering the system tables here)
If the transaction which inserted the last tuple in the chain is already
aborted, we are not indexing that tuple (even if that tuple is at the
end). We would rather index the last committed-good tuple in the
chain. Running the tuples with HeapTupleSatisfiesVacuum guarantees
that. Isn't it ?
IMHO the extra step in C.I.C simplifies the index build. The transaction-waits
takes care of the existing broken chains and the early catalog entry for
the index helps us avoid creating new broken chains. I am not against
doing it a different way, but this is the best solution we could come up
when we worked on CIC.
I am not against supporting HOT for system catalogs. But IMHO
its not a strict requirements because system catalogs are small, less
frequently updated and HOT adds little value to them. If we don't have
a generic solution which works for system and non-system tables,
thats the best we can get. We can start with non-system
tables and expand its scope later.
I agree. I think the current limit on MaxHeapTuplesPerPage is sufficiently
large. May be we can keep it the same and tune it later if we have numbers
to prove its worthiness.
OK. Lets keep MaxHeapTuplesPerPage unchanged.
OK. So if I get you correctly, you are suggesting to acquire cleanup lock.
If we don't get that, we don't to any maintenance work. Otherwise, we prune
and repair fragmentation in one go.
A related question is: should we always prune all chains in the
page ? I guess if we are going to repair fragmentation, it would make
more sense to do that.
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> You are right - a new index might mean that an existing HOT chain
> is broken as far as the new index is concerned. The way we address
> that is by indexing the root tuple of the chain, but the index key is
> extracted from the last tuple in the chain. The index is marked "unusable"
> for all those existing transactions which can potentially see any
> intermediate tuples in the chain.
I don't think that works --- what if the last tuple in the chain isn't
committed good yet? If its inserter ultimately rolls back, you've
indexed the wrong value.
I am confused. How could we get ShareLock on the relation while
there is some open transaction which has inserted a tuple in the
table ? (Of course, I am not considering the system tables here)
If the transaction which inserted the last tuple in the chain is already
aborted, we are not indexing that tuple (even if that tuple is at the
end). We would rather index the last committed-good tuple in the
chain. Running the tuples with HeapTupleSatisfiesVacuum guarantees
that. Isn't it ?
> Please see this document written by Greg Stark. He has nicely summarized
> how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT.
> http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php
Isn't the extra machination for C.I.C. just useless complication?
What's the point of avoiding creation of new broken HOT chains when
you still have to deal with existing ones?
IMHO the extra step in C.I.C simplifies the index build. The transaction-waits
takes care of the existing broken chains and the early catalog entry for
the index helps us avoid creating new broken chains. I am not against
doing it a different way, but this is the best solution we could come up
when we worked on CIC.
>> I also don't think I
>> believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated
>> tuples: what if the index completion commits, but the concurrent delete
>> rolls back? Then you've got a valid tuple that's not in the index.
> Since CREATE INDEX works with ShareLock on the relation, we can
> safely assume that we can't find any DELETE_IN_PROGRESS tuples except
> those deleted by our own transaction. The only exception is system relation,
> but we don't do HOT updates on system relation.
That chain of reasoning is way too shaky for me. Essentially what
you're saying is you'll promise not to corrupt non-system indexes.
Nor am I really thrilled about having to disable HOT for system
catalogs.
I am not against supporting HOT for system catalogs. But IMHO
its not a strict requirements because system catalogs are small, less
frequently updated and HOT adds little value to them. If we don't have
a generic solution which works for system and non-system tables,
thats the best we can get. We can start with non-system
tables and expand its scope later.
> The only reason to redefine MaxHeapTuplesPerPage to higher side is
> because HOT allows us to accommodate more tuples per page because
> of redirect-dead line pointers.
No, it doesn't allow more tuples per page. It might mean there can be
more line pointers than that on a page, but not more actual tuples.
The question is whether there is any real use in allowing more line
pointers than that --- the limit is already unrealistically high,
since it assumes no data content in any of the tuples. If there is a
rationale for it then you should invent a different constant
MaxLinePointersPerPage or some such, but I really think there's no
point.
I agree. I think the current limit on MaxHeapTuplesPerPage is sufficiently
large. May be we can keep it the same and tune it later if we have numbers
to prove its worthiness.
> Doubling the value is chosen as a balance between heap page
> utilization, line pointer bloating and overhead for bitmap scans.
I'd say it allows a ridiculous amount of line pointer bloat.
OK. Lets keep MaxHeapTuplesPerPage unchanged.
>> Even if it's safe, ISTM what you're mostly accomplishing there is to
>> expend a lot of cycles while holding exclusive lock on the page, when
>> there is good reason to think that you're blocking other people who are
>> interested in using the page. Eliminating the separation between that
>> and cleanup would also allow eliminating the separate "PD_FRAGMENTED"
>> page state.
> The reason we did it that way because repairing fragmentation seems
> much more costly that pruning. Please note that we prune a single
> chain during index fetch. Its only for heap-scans (and VACUUM) that
> we try to prune all chains in the page. So unless we are doing heap-scan,
> I am not sure if we are spending too much time holding the exclusive
> lock. I agree we don't have any specific numbers to prove that though.
If you don't have numbers proving that this extra complication has a
benefit, I'd vote for simplifying it. The SnapshotAny case is going to
bite other people besides you in future.
OK. So if I get you correctly, you are suggesting to acquire cleanup lock.
If we don't get that, we don't to any maintenance work. Otherwise, we prune
and repair fragmentation in one go.
A related question is: should we always prune all chains in the
page ? I guess if we are going to repair fragmentation, it would make
more sense to do that.
> Another reasoning behind separating these two steps is: pruning
> requires exclusive lock whereas repairing fragmentation requires
> cleanup lock.
This is nonsense. Those are the same lock. If you have the former and
not the latter, it just means that you *know* there is contention for
the page. It seems to me that performing optional maintenance work in
such a situation is completely wrong.
Oh yes, they are the same lock. The difference is the chances of getting
one is more than the other. But I agree with your argument about the contention
and maintenance work. I think we can do what you are suggesting and
then fine tune it.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On 8/30/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't think that works --- what if the last tuple in the chain isn't >> committed good yet? If its inserter ultimately rolls back, you've >> indexed the wrong value. > I am confused. How could we get ShareLock on the relation while > there is some open transaction which has inserted a tuple in the > table ? (Of course, I am not considering the system tables here) Not if someone else releases lock before committing. Which I remind you is a programming technique we use quite a lot with respect to the system catalogs. I'm not prepared to guarantee that there is no code path in core (much less in contrib or third-party addons) that doesn't do it on a user table; even less that no such path will ever appear in future. As things stand it's a pretty harmless error --- the worst consequence I know of is that a VACUUM FULL starting right then might bleat and refuse to do shrinking. What you propose is to turn the case into a cause of silent index corruption. A more robust solution would be to wait out the source transaction if CREATE INDEX comes across an INSERT_IN_PROGRESS or DELETE_IN_PROGRESS HOT tuple. Or you could throw an error, since it's not really likely to happen (except maybe in a system catalog REINDEX operation). But the way the patch approaches this at the moment is far too fragile IMHO. If we approach it this way, we might also be able to jettison some of the other complexity such as idxcreatexid. >> Isn't the extra machination for C.I.C. just useless complication? >> What's the point of avoiding creation of new broken HOT chains when >> you still have to deal with existing ones? > IMHO the extra step in C.I.C simplifies the index build. If you make the change suggested above, I think you don't need to do things differently in C.I.C. >>> I also don't think I >>> believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated >>> tuples: what if the index completion commits, but the concurrent delete >>> rolls back? Then you've got a valid tuple that's not in the index. >> >> Since CREATE INDEX works with ShareLock on the relation, we can >> safely assume that we can't find any DELETE_IN_PROGRESS tuples except >> those deleted by our own transaction. The only exception is system >> relation, but we don't do HOT updates on system relation. Same issue as above: this makes correctness utterly dependent on nobody releasing locks early. > OK. So if I get you correctly, you are suggesting to acquire cleanup lock. > If we don't get that, we don't to any maintenance work. Otherwise, we prune > and repair fragmentation in one go. Yeah, that's what I'm thinking; then there's no need to track a separate page state where we've pruned but not defragmented. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: >>> Isn't the extra machination for C.I.C. just useless complication? >>> What's the point of avoiding creation of new broken HOT chains when >>> you still have to deal with existing ones? > >> IMHO the extra step in C.I.C simplifies the index build. > > If you make the change suggested above, I think you don't need to do > things differently in C.I.C. It seems to me if you wait out transactions as you come across them you could end up waiting a whole lot longer than the way it works now where it waits them all out at the end of the first pass. >> OK. So if I get you correctly, you are suggesting to acquire cleanup lock. >> If we don't get that, we don't to any maintenance work. Otherwise, we prune >> and repair fragmentation in one go. > > Yeah, that's what I'm thinking; then there's no need to track a separate > page state where we've pruned but not defragmented. Note that you still need to do it in two steps, you just get to postpone the pruning work to the same point in time as the defragmenting. You can never "acquire" the cleanup lock when it comes time to insert a new update version since the executor still holds references to the original tuple. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: >> If you make the change suggested above, I think you don't need to do >> things differently in C.I.C. > It seems to me if you wait out transactions as you come across them you could > end up waiting a whole lot longer than the way it works now where it waits > them all out at the end of the first pass. I think you might have misread that --- I intended to say that C.I.C could still work the way it does today, not that it would be exactly like regular CREATE INDEX. The wait-out business should only be needed for a regular CREATE INDEX, and in that case there's no cumulative waiting effect because no new conflicting transactions are coming in. C.I.C. should be able to fix things up in its second pass instead of waiting during the first one. regards, tom lane
Tom Lane escribió: > "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > > On 8/30/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I don't think that works --- what if the last tuple in the chain isn't > >> committed good yet? If its inserter ultimately rolls back, you've > >> indexed the wrong value. > > > I am confused. How could we get ShareLock on the relation while > > there is some open transaction which has inserted a tuple in the > > table ? (Of course, I am not considering the system tables here) > > Not if someone else releases lock before committing. FWIW, a red flag raised for me here, though maybe it is irrelevant or unimportant. Currently, VACUUM acquires an exclusive lock for truncating the table. The lock is kept till commit. However I am proposing that it be released before commit. Now, VACUUM never inserts rows. But I don't claim I understand what's going on here. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> Not if someone else releases lock before committing. > FWIW, a red flag raised for me here, though maybe it is irrelevant or > unimportant. Currently, VACUUM acquires an exclusive lock for > truncating the table. The lock is kept till commit. However I am > proposing that it be released before commit. I think that's all right, because it's dealing with a different set of concerns. AFAICS the only issue for truncation is to prevent physical access to the blocks in question until we can get rid of them. Once they're gone, if there wasn't an active seqscan (with an already-established notion of the max block number to scan to), there would be no reason for anyone to try to touch them. regards, tom lane
On 8/31/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
OK. I looked at the code again. In CVS HEAD we assume that we
should never see an DELETE_IN_PROGRESS/INSERT_IN_PROGRESS
unless its deleted/inserted by our own transaction or we are indexing a
system table. Except for these cases, we throw an error.
With HOT, we keep it the same i.e. if we see a DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple, we throw an error,
unless its deleted/inserted by us or this is a system table..
What we change is if we find a DELETE_IN_PROGRESS tuple deleted
by our own transaction and its HOT-updated, we skip that tuple. This is
fine because if the CREATE INDEX commits the DELETE is also committed
and the tuple is not visible (I am still assuming the indcreatxid mechanism
is in place)
So if we don't do HOT update on system tables, the current algorithm
should work fine because we should never find a HOT updated tuple
in the system table and the error-out mechanism should ensure
that we don't corrupt user tables.
So I guess what you are suggesting is to turn on HOT on system tables
and then wait-out any DELETING/INSETING transaction if we find such
a tuple during CREATE INDEX. I think this would work and since we are
talking about system tables, the wait-out business won't be harmful - I
remember there were objections when I suggested this as a general solution.
I doubt, unless we replace it with something better. I think indcreatexid serves
another purpose. While building an index, we skip any RECENTLY_DEAD
HOT-updated tuples. This is required to handle the existing HOT chains
which are broken with respect to the new index. Essentially what it means
is we don't want to index anything other than the last committed good tuple
in the HOT chain. So when we skip any intermediate RECENTLY_DEAD tuples,
which are potentially visible to the existing running transactions, we want
those transactions NOT to use this new index.
I am not sure if I follow you correctly here. The issue with CIC is that
it works with a snapshot. So during initial index build, we might index
a tuple which is in the middle of a broken HOT chain. In the second phase,
we just don't need to index the tuple at the head of the chain, but also
remove the previously inserted tuple, otherwise there would be two
references from the index to the same root heap tuple.
The additional step which does two things:
1) Create catalog entry - stops any new broken HOT chains being created
2) Wait-out existing transactions - makes sure that when the index is built,
we only index the last committed good tuple in the chain.
As I said above, we in fact throw an error for user tables. But if we want
to support HOT updates on system tables, we may need to do the
wait-out business. In fact, now that I think about it there is no other
fundamental reason to not support HOT on system tables. So we
can very well do what you are suggesting.
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
Not if someone else releases lock before committing. Which I remind you
is a programming technique we use quite a lot with respect to the system
catalogs. I'm not prepared to guarantee that there is no code path in
core (much less in contrib or third-party addons) that doesn't do it on
a user table; even less that no such path will ever appear in future.
As things stand it's a pretty harmless error --- the worst consequence
I know of is that a VACUUM FULL starting right then might bleat and
refuse to do shrinking. What you propose is to turn the case into a
cause of silent index corruption.
A more robust solution would be to wait out the source transaction if
CREATE INDEX comes across an INSERT_IN_PROGRESS or DELETE_IN_PROGRESS
HOT tuple. Or you could throw an error, since it's not really likely to
happen (except maybe in a system catalog REINDEX operation). But the
way the patch approaches this at the moment is far too fragile IMHO.
OK. I looked at the code again. In CVS HEAD we assume that we
should never see an DELETE_IN_PROGRESS/INSERT_IN_PROGRESS
unless its deleted/inserted by our own transaction or we are indexing a
system table. Except for these cases, we throw an error.
With HOT, we keep it the same i.e. if we see a DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple, we throw an error,
unless its deleted/inserted by us or this is a system table..
What we change is if we find a DELETE_IN_PROGRESS tuple deleted
by our own transaction and its HOT-updated, we skip that tuple. This is
fine because if the CREATE INDEX commits the DELETE is also committed
and the tuple is not visible (I am still assuming the indcreatxid mechanism
is in place)
So if we don't do HOT update on system tables, the current algorithm
should work fine because we should never find a HOT updated tuple
in the system table and the error-out mechanism should ensure
that we don't corrupt user tables.
So I guess what you are suggesting is to turn on HOT on system tables
and then wait-out any DELETING/INSETING transaction if we find such
a tuple during CREATE INDEX. I think this would work and since we are
talking about system tables, the wait-out business won't be harmful - I
remember there were objections when I suggested this as a general solution.
If we approach it this way, we might also be able to jettison some of
the other complexity such as idxcreatexid.
I doubt, unless we replace it with something better. I think indcreatexid serves
another purpose. While building an index, we skip any RECENTLY_DEAD
HOT-updated tuples. This is required to handle the existing HOT chains
which are broken with respect to the new index. Essentially what it means
is we don't want to index anything other than the last committed good tuple
in the HOT chain. So when we skip any intermediate RECENTLY_DEAD tuples,
which are potentially visible to the existing running transactions, we want
those transactions NOT to use this new index.
> IMHO the extra step in C.I.C simplifies the index build.
If you make the change suggested above, I think you don't need to do
things differently in C.I.C.
I am not sure if I follow you correctly here. The issue with CIC is that
it works with a snapshot. So during initial index build, we might index
a tuple which is in the middle of a broken HOT chain. In the second phase,
we just don't need to index the tuple at the head of the chain, but also
remove the previously inserted tuple, otherwise there would be two
references from the index to the same root heap tuple.
The additional step which does two things:
1) Create catalog entry - stops any new broken HOT chains being created
2) Wait-out existing transactions - makes sure that when the index is built,
we only index the last committed good tuple in the chain.
>>
>> Since CREATE INDEX works with ShareLock on the relation, we can
>> safely assume that we can't find any DELETE_IN_PROGRESS tuples except
>> those deleted by our own transaction. The only exception is system
>> relation, but we don't do HOT updates on system relation.
Same issue as above: this makes correctness utterly dependent on nobody
releasing locks early.
As I said above, we in fact throw an error for user tables. But if we want
to support HOT updates on system tables, we may need to do the
wait-out business. In fact, now that I think about it there is no other
fundamental reason to not support HOT on system tables. So we
can very well do what you are suggesting.
> OK. So if I get you correctly, you are suggesting to acquire cleanup lock.
> If we don't get that, we don't to any maintenance work. Otherwise, we prune
> and repair fragmentation in one go.
Yeah, that's what I'm thinking; then there's no need to track a separate
page state where we've pruned but not defragmented.
I agree. Lets keep it simple and we can always improve it later. The
only thing that worries me how to balance between repair fragmentation
(which is costly operation since it involves several memmoves) and chain
pruning. It might be alright to delay repair operation, but if we end up with long
chains, fetches might suffer.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 8/31/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
In fact, now that I think about it there is no other
fundamental reason to not support HOT on system tables. So we
can very well do what you are suggesting.
On second thought, I wonder if there is really much to gain by
supporting HOT on system tables and whether it would justify all
the complexity. Initially I thought about CatalogUpdateIndexes to
which we need to teach HOT. Later I also got worried about
building the HOT attribute lists for system tables and handling
all the corner cases for bootstrapping and catalog REINDEX.
It might turn out to be straight forward, but I am not able to
establish that with my limited knowledge in the area.
I would still vote for disabling HOT on catalogs unless you see
strong value in it.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On Fri, Aug 31, 2007 at 12:53:51PM +0530, Pavan Deolasee wrote: > On 8/31/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: > > > > > > > > In fact, now that I think about it there is no other > > fundamental reason to not support HOT on system tables. So we > > can very well do what you are suggesting. > > > > > > On second thought, I wonder if there is really much to gain by > supporting HOT on system tables and whether it would justify all > the complexity. Initially I thought about CatalogUpdateIndexes to > which we need to teach HOT. Later I also got worried about > building the HOT attribute lists for system tables and handling > all the corner cases for bootstrapping and catalog REINDEX. > It might turn out to be straight forward, but I am not able to > establish that with my limited knowledge in the area. > > I would still vote for disabling HOT on catalogs unless you see > strong value in it. What about ANALYZE? Doesn't that do a lot of updates? BTW, I'm 100% in favor of pushing system catalog HOT until later; it's be silly to risk not getting hot in 8.3 because of catalog HOT. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
Decibel! <decibel@decibel.org> writes: > BTW, I'm 100% in favor of pushing system catalog HOT until later; it's > be silly to risk not getting hot in 8.3 because of catalog HOT. I see this the other way around: if it doesn't work on system catalogs, it probably doesn't work, period. I'm not in favor of treating the catalogs differently. regards, tom lane
On 9/1/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I see this the other way around: if it doesn't work on system catalogs,
it probably doesn't work, period. I'm not in favor of treating the
catalogs differently.
Now that I hear you, I know what to do next :-)
I don't think there is any fundamental problem with system catalogs,
its only the additional complexity that I was worried about. Anyways,
I will rework things as per your suggestion. And I take your point that
making it work on all tables will give us more confidence on the code.
Thanks,
Pavan
P.S. Next week is bad for me :-( I am on vacation on Thursday/Friday
and for remaining days, I may not be able to spend extra cycles,
apart from regular working hours.
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: >> Please see the version 14 of HOT patch attached. > > I expected to find either a large new README, or some pretty substantial > additions to existing README files, to document how this all works. Here's an updated version of the README I posted earlier. It now reflects the changes to how pruning works. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Use case -------- The best use case for HOT is a table that's frequently UPDATEd, and is large enough that VACUUM is painful. On small tables that fit in cache, running VACUUM every few minutes isn't a problem. Heap-only tuples ---------------- When a HOT update is performed, the new tuple is placed on the same page as the old one, marked with the HEAP_ONLY_TUPLEflag. HEAP_ONLY_TUPLE means that there's no index pointers to the tuple, which allows pruning the chain inthe future. The old tuple is marked with HEAP_HOT_UPDATE-flag, which means that the tuple pointed to by t_ctid is a heap-onlytuple. That needs to be taken into account when vacuuming, so that we don't remove the root tuple in the updatechain, when there's no index pointers to the later tuples. When doing an index scan, whenever we reach a non-visible tuple, we need to check if the tuple has been HOT-updated (== HEAP_HOT_UPDATEflag is set). If so, we need to follow the ctid pointer until we reach a visible one, or one that hasn't beenHOT-updated. Sequential scans (and bitmap heap scans with a lossy bitmap) don't need to pay attention to the flags. The pre-requirements for doing a HOT update is that none of the indexed columns are changed. That's checked at executiontime, comparing the binary representation of the old and new values. That means that dummy updates, like "UPDATEfoo SET col1 = ?", where ? is the same as the old value can be HOT. In addition to the above, there needs to be room on the page for the new tuple. If the page is full, we try to make roomby pruning the page. Pruning ------- Pruning is a lightweight vacuum operation that can be run on a single page, with no need to scan indexes, but it only removesdead HOT tuples. Other dead tuples are truncated, leaving only a redirected dead line pointer. The removed tuplesare compacted away using PageRepairFragmentation, like in normal vacuum. There's two reasons to prune a page: to makeroom on the page for future updates, and to shorten HOT chains to make index lookups cheaper. When accessing a page with HOT updated tuples on it, and less than a certain threshold of free space, we try to prune it.To do that, we need to take a vacuum strength lock on the buffer. If that fails, we don't prune; the theory is that youusually do get the lock, and if you don't, you'll get to try again next time. It would be more logical to do the pruningin heap_update when the page is full, but by the time we get there we have already pinned the page and have referencesto tuples on it, so we can't start moving tuples around it. Also, that alone wouldn't address the desire to keepHOT chains short, to avoid overhead of traversing long chains on index lookups. To reclaim the index-visible (i.e. first) tuple in a HOT chain, the line pointer is turned into a redirecting line pointerthat points to the line pointer of the next tuple in the chain. When the last live tuple in an update chain becomes dead (after a DELETE or a cold update), the redirecting line pointeris marked as redirected dead. That allows us to immediately reuse the space, sans the line pointer itself. We've effectivelyresurrected the "truncate dead tuples to just line pointer" idea that has been proposed and rejected before becauseof fear of line pointer bloat. To limit the damage in worst case, and to keep numerous arrays as well as the bitmapsin bitmap scans reasonably sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is somewhat arbitrarilycapped at 2 * what it was before. VACUUM FULL ----------- To make vacuum full work, any DEAD tuples in the middle of an update chain needs to be removed (see comments at the top ofheap_prune_hotchain_hard for details). Vacuum full performs a more aggressive pruning that not only removes dead tuplesat the beginning of an update chain, it scans the whole chain and removes any intermediate dead tuples as well. Vacuum ------ There's not much changes to regular vacuum. It removes dead HOT tuples, like pruning, and cleans up any redirected dead linepointers. In lazy vacuum, we must not freeze a tuple that's in the middle of an update chain. That can happen when a tuple has xmin> xmax; it's the same scenario that requires "hard pruning" in VACUUM FULL. Freezing such tuples will break the checkthat xmin and xmax matches when following the chain. It's not a problem without HOT, because the preceding tuple inthe chain must be dead as well so no-one will try to follow the chain, but with HOT the preceding tuple would be DEAD_CHAIN,and someone might still need to follow the chain to find the live tuple. We avoid that by just not freezing suchtuples. They can be frozen eventually, when the xmax of the preceding tuple is < OldestXmin as well. Statistics ---------- XXX: How do HOT-updates affect statistics? How often do we need to run autovacuum? CREATE INDEX CREATE INDEX CONCURRENTLY ------------------------- I'm not very familiar with how these, so I'll just shut up.. Glossary -------- Heap-only tuple A heap tuple with no index pointers. Marked with HEAP_ONLY_TUPLE flag. HOT-updated tuple An updated tuple, so that the next tuple in the chain is a heap-only tuple. Marked with HEAP_HOT_UPDATE flag. Redirecting line pointer A line pointer that points to another line pointer. lp_len is set to a magic value (ITEMID_REDIRECTED), and lp_off isthe OffsetNumber of the line pointer it points to. Redirected dead line pointer A stub line pointer, that doesn't point to anything, but can't be removed or reused yet because there is index pointersto it. Semantically same as a dead tuple. Root tuple The first tuple in a HOT update chain, that indexes point to. Update chain A chain of updated tuples, so that each tuple's ctid points to the next tuple in the chain. A HOT update chain is anupdate chain that consists of a root tuple and one or more heap-only tuples. An update chain can contain both HOT and non-HOT(cold) updated tuples. Cold update A normal, non-HOT update. HOT update An UPDATE, where the new tuple becomes a heap-only-tuple, and no index entries are made. DEAD_CHAIN (HEAPTUPLE_DEAD_CHAIN) New return value for HeapTupleSatisfiesVacuum, which means that the tuple is not visible to anyone, but it's been HOTupdated so we can't remove it yet because the following tuples in the chain would become inaccessible from indexes.
Heikki Linnakangas wrote: > Tom Lane wrote: > > "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > >> Please see the version 14 of HOT patch attached. > > > > I expected to find either a large new README, or some pretty substantial > > additions to existing README files, to document how this all works. > > Here's an updated version of the README I posted earlier. It now > reflects the changes to how pruning works. I have taken this, and Pavan's documentation about CREATE INDEX, and worked up an updated README. Comments? Corrections? I plan to put this in src/backend/access/heap/README.HOT. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + Heap Only Tuples (HOT) Introduction ------------ Added in PostgreSQL 8.3, Heap Only Tuples (HOT) allow the reuse of space taken by DELETEd or obsoleted UPDATEd tuples without performing a table-wide vacuum. It does this by allowing single-page vacuuming, also called "pruning". Technical Challenges -------------------- PostgreSQL's MVCC system makes single-page vacuums (pruning) quite difficult because rows must remain after UPDATE or DELETE until all transaction snapshots that were active during the command have completed. Traditionally, VACUUM must be run at a later time which sequentially scans the table and collects information about all obsolete rows. VACUUM also removes index entries for obsolete rows. Unfortunately, VACUUM can be an expensive operation because of its full table scan and index cleanup requirement. HOT adds several features that allow space reuse on a per-heap-page basis, particularly by eliminating the problem of index cleanup necessary to reuse the space take by obsolete rows. UPDATE Chains With a Single Index Entry --------------------------------------- Without HOT, every version of a row in an UPDATE chain has its own index entry, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same does not get a new index entry and is marked with the HEAP_ONLY_TUPLE flag. (The old row is marked as HEAP_HOT_UPDATE.) This allows the space taken by UPDATEd row versions to be reused during a single-page vacuum (pruning) when they is no longer visible to any running transactions. This is possible because there is only one index entry for the entire UPDATE chain on the heap page. Internally, things are a bit more complicated: Index points to 1 ctid [1] [2] [111111111]->[2222222222] In the above diagram, the index points to ctid 1, and is marked as HEAP_HOT_UPDATE. Row versions 2 is a HOT UPDATE, meaning it has no index row pointing to it, and is marked as HEAP_HOT_UPDATE. Later, even if row 1 is no longer visible to any transaction, its ctid pointer cannot be removed by pruning because concurrent index/heap lookup activity might be happening on the page and removing it might interfere with other backends. However, the heap space for row 1 can be reused: Index points to 1 ctid [1]->[2] [2222222222] In this case the ctid pointer 1 points to ctid 2, which points to heap row version 2. If row 2 is updated to version 3, it looks like this: [Index points to 1] -------------------------------------------------------- ctid [1]->[2] [3] [2222222222]->[3333333333] The arrow from 2 to 3 is part of the UPDATE chain already present on all update rows. At some later time when no transaction can see row 2 in its snapshot, the space taken by heap row 2 _and_ its ctid can be reused during a pruning, e.g. Index points to 1 ctid [1]------>[3] [3333333333] Notice that HEAP_HOT_UPDATE row 1 now points to row 3, and row 2 is now gone. Again, this is possible because row 2 did not have an index entry. Pruning occurs when a row is UPDATEd and there is no free space on the page containing the old row. Pruning scans the entire page looking for HEAP_HOT_UPDATE and HEAP_ONLY_TUPLE rows that can be removed. Row version 4 would look like this: Index points to 1 ctid [1]------>[3] [4] [3333333333]->[4444444444] and when row 3 is no longer visible, this: Index points to 1 ctid [1]----------->[4] [4444444444] As you can see, ctid 1 has to remain, but the space taken by a ctid is small compare to a heap row. The requirements for doing a HOT update is that none of the indexed columns are changed. That is checked at execution time, comparing the binary representation of the old and new values. That means that dummy updates, like "UPDATE foo SET col1 = ?" where ? is the same as the old value, can be HOT updated. Index/Sequential Scans ---------------------- When doing an index scan, whenever we reach a non-visible tuple, we need to check if the tuple is HEAP_HOT_UPDATE. If so, we need to follow the ctid pointer until we reach a visible one, or one that has not been HOT-updated. Sequential scans (and bitmap heap scans with a lossy bitmap) do not need to pay attention to the flags. Pruning ------- Pruning is a lightweight vacuum operation that can be run on a single page, with no need to scan indexes. Pruning a page makes room on the page for future updates and shortens HOT chains to make index lookups faster. Pruning removes more than just dead HOT tuples. Other dead tuples, such as those from DELETEs and aborted transactions, are truncated and leave behind only a dead line pointer. In the illustration below, ctid 1 is dead and points to no heap row. ctid [1D] [2] [2222222222] The removed tuples are compacted away using PageRepairFragmentation, like in normal vacuum. Pruning happens when accessing a page with HOT updated tuples and with less than a certain threshold of free space. To prune, we need to take a vacuum strength lock on the buffer. If that fails, we do not prune; the theory is that you usually do get the lock, and if you do not, you can get to try again next time. It would be more logical to do the pruning in heap_update() when the page is full, but by the time we get there we have already pinned the page and have references to tuples on it, so we cannot start moving tuples around it. Also, that alone would not address the desire to keep HOT chains short to avoid the overhead of traversing long chains on index lookups. To reclaim the index-pointed (HEAP_HOT_UPDATE) tuple in a HOT chain, its line pointer is turned into a redirecting line pointer that points to the line pointer of the next tuple in the chain and its heap space is reused. When the last live tuple in an update chain becomes dead (after a DELETE or a cold update), the redirecting line pointer is marked as redirected dead. That allows us to immediately reuse the heap space (but not the line pointer itself). We have effectively implemented the "truncate dead tuples to just line pointer" idea that has been proposed and rejected before because of fear of line pointer bloat. To limit the damage in the worst case, and to keep numerous arrays as well as the bitmaps in bitmap scans reasonably sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is arbitrarily capped at twice its previous maximum. VACUUM FULL ----------- To make vacuum full work, any DEAD tuples in the middle of an update chain need to be removed (see comments at the top of heap_prune_hotchain_hard() for details). Vacuum full performs a more aggressive pruning that not only removes dead tuples at the beginning of an update chain, but scans the whole chain and removes any intermediate dead tuples as well. VACUUM ------ There is little change to regular vacuum. It removes dead HOT tuples, like pruning does, and cleans up any redirected dead line pointers. In lazy vacuum, we must not freeze a tuple that is in the middle of an update chain. That can happen when a tuple has xmin > xmax; it is the same scenario that requires "hard pruning" in VACUUM FULL. Freezing such tuples will break the check that xmin and xmax matches when following the chain. It is not a problem without HOT, because the preceding tuple in the chain must be dead as well so no one will try to follow the chain, but with HOT the preceding tuple would be DEAD_CHAIN, and someone might still need to follow the chain to find the live tuple. We avoid that by just not freezing such tuples. They can be frozen eventually, when the xmax of the preceding tuple is < OldestXmin as well. Statistics ---------- XXX: How do HOT-updates affect statistics? How often do we need to run autovacuum? CREATE INDEX ------------ CREATE INDEX presents a problem for HOT updates. While the existing HOT chains all have the same index values, the new index might invalidate the HOT chain because the columns in the new index might change within HOT chains. To address this issue, regular (non-concurrent) CREATE INDEX makes the new index usable only by transactions newer than the CREATE INDEX command. This prevents transactions that can see the inconsistent HOT chains from trying to use the new index and getting incorrect results. New transactions can only see the rows visible after the index was created, hence the HOT chains are consistent for them. The new index indexes only Root tuples (tuples with current index pointers) so that our index uses the same index pointers as all other indexes on the table. However the row we want to index is actually at the _end_ of the chain, ie, the most recent tuple on the index chain. But, again, because new transactions will skip over inconsistent HOT rows, this is not a problem. (If we find any HOT-updated tuples with RECENTLY_DEAD or DELETE_IN_PROGRESS we ignore it assuming that we will also come across the _end_ of the update chain and index that instead.) Practically, we prevent old transactions from using the new index by putting our transaction id in pg_index.indcreatexid after building the index (but only if HOT chains exist in the table). Queries check whether indcreatexid is in their serializable snapshot; if it is not then the index is not available for them. This means that transactions started before the create index commits will not get the benefit of the new index even if their first scan of table is only after the index exists. However new transactions get the benefit of the new index immediately because they will always follow the HOT update chain to the end. (The old tuples with the possibly different keys will never be visible to them.) The tricky case arises with queries executed in the same transaction as CREATE INDEX. In the case of a new table created within the same transaction (such as with pg_dump), the index will be usable because there will never be any HOT update chains so the indcreatexid will never be set. Also in the case of a read-committed transaction new queries will be able to use the index. A serializable transaction building an index on an existing table with HOT updates cannot not use the index. CREATE INDEX CONCURRENTLY ------------------------- In the concurrent case we take a different approach. We create the pg_index entry immediately, before we scan the table. pg_index is marked as "not ready for inserts". Then we commit and wait for any transactions which have the table open to finish. This ensures that no _new_ HOT updates will change the key value for our new index. After waiting for transactions which had the table open, we build the index with a snapshot. Because we waited for anyone who existed before the index was created, any tuples seen in the snapshot will have only valid forward-growing HOT chains. (They might still have older HOT updates behind them which are broken.) As above, we point the index pointers at the Root of the HOT-update chain but we use the key from the end of the chain. We mark the index open for inserts (but still not ready for reads) then we again wait for transactions which have the table open. Then we take a second reference snapshot and validate the index. This searches for tuples missing from the index --- but it again has to look up the root of the HOT chains and search for those tuples in the index. Then we wait until _every_ transaction in progress in the validate_index reference snapshot is finished. This ensures that nobody is alive any longer who could possibly see any of the tuples in a broken HOT chain. Glossary -------- Broken HOT Chain A HOT chain in which the key value for an index has changed. This is not allowed to occur normally but if a new index is created it can happen. In that case various strategies are used to ensure that no transaction for which the older tuples are visible can use the index. Cold update A normal, non-HOT update. DEAD_CHAIN (HEAPTUPLE_DEAD_CHAIN) New return value for HeapTupleSatisfiesVacuum, which means that the tuple is not visible to anyone, but it has been HOT updated so we cannot remove it yet because the following tuples in the chain would become inaccessible from indexes. Heap-only tuple A heap tuple with no index pointers. Marked with HEAP_ONLY_TUPLE flag. HOT update An UPDATE where the new tuple becomes a heap-only-tuple, and no index entries are made. HOT-updated tuple An updated tuple, so that the next tuple in the chain is a heap-only tuple. Marked with HEAP_HOT_UPDATE flag. Redirecting line pointer A line pointer that points to another line pointer. lp_len is set to a magic value (ITEMID_REDIRECTED), and lp_off is the OffsetNumber of the line pointer it points to. Redirected dead line pointer A stub line pointer, that does not point to anything, but cannot be removed or reused yet because there is index pointers to it. Semantically same as a dead tuple. Root tuple The first tuple in a HOT update chain that indexes point to. Update chain A chain of updated tuples, so that each tuple's ctid points to the next tuple in the chain. A HOT update chain is an update chain that consists of a root tuple and one or more heap-only tuples. An update chain can contain both HOT and non-HOT (cold) updated tuples.
"Bruce Momjian" <bruce@momjian.us> writes: > Heikki Linnakangas wrote: > >> Here's an updated version of the README I posted earlier. It now >> reflects the changes to how pruning works. > > I have taken this, and Pavan's documentation about CREATE INDEX, and > worked up an updated README. Comments? Corrections? You should also take the appendix to Heikki's README about CREATE INDEX that I wrote. > > I plan to put this in src/backend/access/heap/README.HOT. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Bruce Momjian wrote: > Heikki Linnakangas wrote: >> Tom Lane wrote: >>> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: >>>> Please see the version 14 of HOT patch attached. >>> I expected to find either a large new README, or some pretty substantial >>> additions to existing README files, to document how this all works. >> Here's an updated version of the README I posted earlier. It now >> reflects the changes to how pruning works. > > I have taken this, and Pavan's documentation about CREATE INDEX, and > worked up an updated README. Comments? Corrections? Thanks, that's much better. I made some corrections, patch attached. I clarified the terminology a bit: "row", "row version" and "tuple" were mixed in some places. "Tuple" and "row version" are synonyms in my mind, so they can be used interchangeably, but "row" is not. Row is a higher level concept and refers to what a user sees when he does a "SELECT * FROM foo". There can be multiple row versions, or tuples, behind a single row. I also changed "ctid" to "line pointer" in most places. ctid is a field on a tuple, I don't think we use that term to refer to line pointers anywhere. > I plan to put this in src/backend/access/heap/README.HOT. Sounds good. I didn't look at the create index stuff. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** README.HOT.Bruce 2007-09-04 12:36:19.000000000 +0100 --- ./-root-HOT 2007-09-04 12:38:20.000000000 +0100 *************** *** 14,24 **** -------------------- PostgreSQL's MVCC system makes single-page vacuums (pruning) quite ! difficult because rows must remain after UPDATE or DELETE until all transaction snapshots that were active during the command have completed. Traditionally, VACUUM must be run at a later time which sequentially scans the table and collects information about all obsolete ! rows. VACUUM also removes index entries for obsolete rows. Unfortunately, VACUUM can be an expensive operation because of its full table scan and index cleanup requirement. --- 14,24 ---- -------------------- PostgreSQL's MVCC system makes single-page vacuums (pruning) quite ! difficult because tuples must remain after UPDATE or DELETE until all transaction snapshots that were active during the command have completed. Traditionally, VACUUM must be run at a later time which sequentially scans the table and collects information about all obsolete ! row versions. VACUUM also removes index entries for obsolete rows. Unfortunately, VACUUM can be an expensive operation because of its full table scan and index cleanup requirement. *************** *** 35,111 **** get a new index entry and is marked with the HEAP_ONLY_TUPLE flag. (The old row is marked as HEAP_HOT_UPDATE.) This allows the space taken by UPDATEd row versions to be reused during a single-page vacuum (pruning) ! when they is no longer visible to any running transactions. This is possible because there is only one index entry for the entire UPDATE chain on the heap page. Internally, things are a bit more complicated: Index points to 1 ! ctid [1] [2] [111111111]->[2222222222] ! In the above diagram, the index points to ctid 1, and is marked as ! HEAP_HOT_UPDATE. Row versions 2 is a HOT UPDATE, meaning it has no ! index row pointing to it, and is marked as HEAP_HOT_UPDATE. Later, even ! if row 1 is no longer visible to any transaction, its ctid pointer cannot be removed by pruning because concurrent index/heap lookup activity might be happening on the page and removing it might interfere ! with other backends. However, the heap space for row 1 can be reused: Index points to 1 ! ctid [1]->[2] [2222222222] ! In this case the ctid pointer 1 points to ctid 2, which points to heap ! row version 2. ! If row 2 is updated to version 3, it looks like this: [Index points to 1] -------------------------------------------------------- ! ctid [1]->[2] [3] [2222222222]->[3333333333] The arrow from 2 to 3 is part of the UPDATE chain already present on all update rows. ! At some later time when no transaction can see row 2 in its snapshot, ! the space taken by heap row 2 _and_ its ctid can be reused during a pruning, e.g. Index points to 1 ! ctid [1]------>[3] [3333333333] ! Notice that HEAP_HOT_UPDATE row 1 now points to row 3, and row 2 is now ! gone. Again, this is possible because row 2 did not have an index ! entry. Pruning occurs when a row is UPDATEd and there is no free space on the ! page containing the old row. Pruning scans the entire page looking for ! HEAP_HOT_UPDATE and HEAP_ONLY_TUPLE rows that can be removed. Row version 4 would look like this: Index points to 1 ! ctid [1]------>[3] [4] [3333333333]->[4444444444] and when row 3 is no longer visible, this: Index points to 1 ! ctid [1]----------->[4] [4444444444] ! As you can see, ctid 1 has to remain, but the space taken by a ctid is ! small compare to a heap row. The requirements for doing a HOT update is that none of the indexed columns are changed. That is checked at execution time, comparing the --- 35,113 ---- get a new index entry and is marked with the HEAP_ONLY_TUPLE flag. (The old row is marked as HEAP_HOT_UPDATE.) This allows the space taken by UPDATEd row versions to be reused during a single-page vacuum (pruning) ! when they are no longer visible to any running transactions. This is possible because there is only one index entry for the entire UPDATE chain on the heap page. Internally, things are a bit more complicated: Index points to 1 ! lp [1] [2] [111111111]->[2222222222] ! In the above diagram, the index points to line pointer 1, and tuple 1 is marked as ! HEAP_HOT_UPDATE. Row version 2 is a HOT tuple, meaning it has no ! index row pointing to it, and is marked as HEAP_ONLY_TUPLE. Later, even ! if tuple 1 is no longer visible to any transaction, its line pointer cannot be removed by pruning because concurrent index/heap lookup activity might be happening on the page and removing it might interfere ! with other backends. However, the space occupied by tuple 1 can be reused: Index points to 1 ! lp [1]->[2] [2222222222] ! In this case the line pointer 1 points to line pointer 2, which points to heap ! row version 2. Line pointer 1 is called a redirecting line pointer, because ! it points to another line pointer instead of a tuple. ! If the row 2 updated again, to version 3, it looks like this: [Index points to 1] -------------------------------------------------------- ! lp [1]->[2] [3] [2222222222]->[3333333333] The arrow from 2 to 3 is part of the UPDATE chain already present on all update rows. ! At some later time when no transaction can see tuple 2 in its snapshot, ! the space taken by tuple 2 _and_ its line pointer can be reused during a pruning, e.g. Index points to 1 ! lp [1]------>[3] [3333333333] ! Notice that the line pointer 1 now points to line pointer 3, and tuple 2 ! and its line pointer is now gone. Again, this is possible because tuple ! 2 did not have an index entry. Pruning occurs when a row is UPDATEd and there is no free space on the ! page containing the old row. (XXX: not accurate. See "pruning" chapter ! for the rules on when we prune) Pruning scans the entire page looking for ! HEAP_ONLY_TUPLE tuples that can be removed. Row version 4 would look like this: Index points to 1 ! lp [1]------>[3] [4] [3333333333]->[4444444444] and when row 3 is no longer visible, this: Index points to 1 ! lp [1]----------->[4] [4444444444] ! As you can see, line pointer 1 has to remain, but the space taken by a ! line pointer is small compare to a heap tuple. The requirements for doing a HOT update is that none of the indexed columns are changed. That is checked at execution time, comparing the *************** *** 118,124 **** When doing an index scan, whenever we reach a non-visible tuple, we need to check if the tuple is HEAP_HOT_UPDATE. If so, we need to follow the ! ctid pointer until we reach a visible one, or one that has not been HOT-updated. Sequential scans (and bitmap heap scans with a lossy bitmap) do not need --- 120,126 ---- When doing an index scan, whenever we reach a non-visible tuple, we need to check if the tuple is HEAP_HOT_UPDATE. If so, we need to follow the ! line pointer until we reach a visible one, or one that has not been HOT-updated. Sequential scans (and bitmap heap scans with a lossy bitmap) do not need *************** *** 134,143 **** Pruning removes more than just dead HOT tuples. Other dead tuples, such as those from DELETEs and aborted transactions, are truncated and leave ! behind only a dead line pointer. In the illustration below, ctid 1 is ! dead and points to no heap row. ! ctid [1D] [2] [2222222222] --- 136,145 ---- Pruning removes more than just dead HOT tuples. Other dead tuples, such as those from DELETEs and aborted transactions, are truncated and leave ! behind only a dead line pointer. In the illustration below, line pointer ! 1 is dead and points to no heap row. ! lp [1D] [2] [2222222222]
"Bruce Momjian" <bruce@momjian.us> writes: > I have taken this, and Pavan's documentation about CREATE INDEX, and > worked up an updated README. Comments? Corrections? Oh, you I think the CREATE INDEX documentation you refer to was actually the one I suggested. A few tweaks: > (If we find any HOT-updated tuples with RECENTLY_DEAD or > DELETE_IN_PROGRESS we ignore it assuming that we will also come across > the _end_ of the update chain and index that instead.) There's more to this. We build a mapping telling us the Root tuple for each tuple in the page. Then when we scan tuples looking for the Head of each HOT chain (ie, a tuple that wasn't HOT updated) and index the corresponding Root from the map using the key value from the Head tuple. We treat DELETE_IN_PROGRESS the same way we treat RECENTLY_DEAD (and INSERT_IN_PROGRESS the same as LIVE) because we assume it's been deleted (or inserted) by our own transaction. So while it's not actually committed yet we can assume it is since if its transaction aborts the index creation itself will be aborted. Other transactions cannot be deleting or inserting tuples without having committed or aborted already because we have a lock on the table and the other transactions normally keep their locks until they exit. NOTE: This is something likely to change. Current discussions are leading towards handling DELETE_IN_PROGRESS and INSERT_IN_PROGRESS from other transactions. We would do this by waiting until the transactions owning those tuples exit. This would allow us to index tables being used by transactions which release their locks early to work. In particular this happens for system tables. > The tricky case arises with queries executed in the same transaction as > CREATE INDEX. In the case of a new table created within the same > transaction (such as with pg_dump), the index will be usable because > there will never be any HOT update chains so the indcreatexid will never > be set. This is unclear and perhaps misleading. I think it needs to be more like "In the case of a new table in which rows were inserted but none updated (such as with pg_dump) the index will be usable because ..." > Also in the case of a read-committed transaction new queries will be able to > use the index. A serializable transaction building an index on an existing > table with HOT updates cannot not use the index. I don't think this is clear and I'm not sure it's right. Currently the transaction that actually did the CREATE INDEX has to follow the same rules as other transactions. This means if there were any visible hot updated tuples and the index is therefore marked with our xid in indcreatexid we will *not* be able to use it in the same transaction as our xid is never in our serializable snapshot. This is true even if we're not in serializable mode as we cannot know what earlier snapshots are still in use and may be used with the new plan. NOTE: This again is something likely to change. In many cases it ought to be possible to have the transaction use the index it just built even if there were visible HOT updated tuples in it. In particular in READ COMMITTED transactions which have no outstanding commands using early snapshots then subsequent planning ought to be able to use the index. Even if outstanding commands are using old snapshots if we can be sure they can't use the new plan then it would still be safe to use the index in the new plan. Also in SERIALIZABLE mode those same statements hold for temporary tables. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 9/4/07, Gregory Stark <stark@enterprisedb.com> wrote:
The latest patch (v15) posted does this for system tables. We still
error-out for non-system tables, just the way we do today.
Right.
NOTE: This is something likely to change. Current discussions are leading
towards handling DELETE_IN_PROGRESS and INSERT_IN_PROGRESS from other
transactions. We would do this by waiting until the transactions owning those
tuples exit. This would allow us to index tables being used by transactions
which release their locks early to work. In particular this happens for system
tables.
The latest patch (v15) posted does this for system tables. We still
error-out for non-system tables, just the way we do today.
> The tricky case arises with queries executed in the same transaction as
> CREATE INDEX. In the case of a new table created within the same
> transaction (such as with pg_dump), the index will be usable because
> there will never be any HOT update chains so the indcreatexid will never
> be set.
This is unclear and perhaps misleading. I think it needs to be more like "In
the case of a new table in which rows were inserted but none updated (such as
with pg_dump) the index will be usable because ..."
Right.
> Also in the case of a read-committed transaction new queries will be able to
> use the index. A serializable transaction building an index on an existing
> table with HOT updates cannot not use the index.
I don't think this is clear and I'm not sure it's right.
Currently the transaction that actually did the CREATE INDEX has to follow the
same rules as other transactions. This means if there were any visible hot
updated tuples and the index is therefore marked with our xid in indcreatexid
we will *not* be able to use it in the same transaction as our xid is never in
our serializable snapshot. This is true even if we're not in serializable mode
as we cannot know what earlier snapshots are still in use and may be used with
the new plan.
NOTE: This again is something likely to change. In many cases it ought to be
possible to have the transaction use the index it just built even if there
were visible HOT updated tuples in it.
It would be great if I can achieve that. But we haven't been able to
find a solution that would work in all cases and yet not too complicated.
My guess is any such solution will involve breaking the existing
HOT chains.
Right now we support one of the most common use case where
a table is created, loaded with data and one or more indexes are
created - all in the same transaction. In this case, the index will
be usable in the same transaction. And may be we should be
able do something better for READ COMMITTED transactions.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/4/07, Bruce Momjian <bruce@momjian.us> wrote:
Thanks Bruce, Heikki, Greg for helping me with the documentation.
It would be worth mentioning that columns appearing in predicates
of partial indexes and expressions of expression indexes are also
checked. If any of these columns are changed, HOT update is not done.
A lazy vacuum is required to reclaim redirect-dead line pointers.
With the latest patch, we have reverted it back to the original value.
It might be worth mentioning that vacuum full also removes
redirected line pointers by making them directly point to
the first tuple in the HOT chain. We can do so, because vacuum
full works with an exclusive lock on the relation.
One change that is worth mentioning is that with HOT it needs vacuum strength
lock in the first phase (currently it works with SHARE lock if no tuples need
freezing or EXCLUSIVE lock otherwise). We can improve it a bit by first
checking if there is really a need for pruning and then only go for cleanup
lock. But thats probably not worth the efforts (atleast for large tables where
we should usually get cleanup lock rather easily).
As the latest patch stands, we track dead-space in the relation and trigger
autovacuuum based on the percentage of dead space in the table. We
don't have any mechanism to account for index bloat yet. Autoanalyze
does not change.
Thanks,
I have taken this, and Pavan's documentation about CREATE INDEX, and
worked up an updated README. Comments? Corrections?
Thanks Bruce, Heikki, Greg for helping me with the documentation.
The requirements for doing a HOT update is that none of the indexed
columns are changed. That is checked at execution time, comparing the
binary representation of the old and new values.
It would be worth mentioning that columns appearing in predicates
of partial indexes and expressions of expression indexes are also
checked. If any of these columns are changed, HOT update is not done.
When the last live tuple in an update chain becomes dead (after a DELETE
or a cold update), the redirecting line pointer is marked as redirected
dead. That allows us to immediately reuse the heap space (but not the
line pointer itself).
A lazy vacuum is required to reclaim redirect-dead line pointers.
To limit the damage in the worst case, and to
keep numerous arrays as well as the bitmaps in bitmap scans reasonably
sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is
arbitrarily capped at twice its previous maximum.
With the latest patch, we have reverted it back to the original value.
VACUUM FULL
-----------
It might be worth mentioning that vacuum full also removes
redirected line pointers by making them directly point to
the first tuple in the HOT chain. We can do so, because vacuum
full works with an exclusive lock on the relation.
VACUUM
------
There is little change to regular vacuum. It removes dead HOT tuples,
like pruning does, and cleans up any redirected dead line pointers.
One change that is worth mentioning is that with HOT it needs vacuum strength
lock in the first phase (currently it works with SHARE lock if no tuples need
freezing or EXCLUSIVE lock otherwise). We can improve it a bit by first
checking if there is really a need for pruning and then only go for cleanup
lock. But thats probably not worth the efforts (atleast for large tables where
we should usually get cleanup lock rather easily).
Statistics
----------
XXX: How do HOT-updates affect statistics? How often do we need to run
autovacuum?
As the latest patch stands, we track dead-space in the relation and trigger
autovacuuum based on the percentage of dead space in the table. We
don't have any mechanism to account for index bloat yet. Autoanalyze
does not change.
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > > "Bruce Momjian" <bruce@momjian.us> writes: > > > I have taken this, and Pavan's documentation about CREATE INDEX, and > > worked up an updated README. Comments? Corrections? > > Oh, you I think the CREATE INDEX documentation you refer to was actually the > one I suggested. Sorry, I see now it was you who wrote that section, not Pavan. > A few tweaks: > > > (If we find any HOT-updated tuples with RECENTLY_DEAD or > > DELETE_IN_PROGRESS we ignore it assuming that we will also come across > > the _end_ of the update chain and index that instead.) > > There's more to this. We build a mapping telling us the Root tuple for each > tuple in the page. Then when we scan tuples looking for the Head of each HOT > chain (ie, a tuple that wasn't HOT updated) and index the corresponding Root > from the map using the key value from the Head tuple. > > We treat DELETE_IN_PROGRESS the same way we treat RECENTLY_DEAD (and > INSERT_IN_PROGRESS the same as LIVE) because we assume it's been deleted (or > inserted) by our own transaction. So while it's not actually committed yet we > can assume it is since if its transaction aborts the index creation itself > will be aborted. Other transactions cannot be deleting or inserting tuples > without having committed or aborted already because we have a lock on the > table and the other transactions normally keep their locks until they exit. Updated text: Because we have a lock on the table, any RECENTLY_DEAD, DELETE_IN_PROGRESS, and INSERT_IN_PROGRESS tuples belong to our own transction. Therefore we can consider them committed since if the CREATE INDEX commits, they will be committed, and if it aborts the index is discarded. > NOTE: This is something likely to change. Current discussions are leading > towards handling DELETE_IN_PROGRESS and INSERT_IN_PROGRESS from other > transactions. We would do this by waiting until the transactions owning those > tuples exit. This would allow us to index tables being used by transactions > which release their locks early to work. In particular this happens for system > tables. OK, we will just have to update the README then. > > The tricky case arises with queries executed in the same transaction as > > CREATE INDEX. In the case of a new table created within the same > > transaction (such as with pg_dump), the index will be usable because > > there will never be any HOT update chains so the indcreatexid will never > > be set. > > This is unclear and perhaps misleading. I think it needs to be more like "In > the case of a new table in which rows were inserted but none updated (such as > with pg_dump) the index will be usable because ..." Done. > > Also in the case of a read-committed transaction new queries will be able to > > use the index. A serializable transaction building an index on an existing > > table with HOT updates cannot not use the index. > > I don't think this is clear and I'm not sure it's right. > > Currently the transaction that actually did the CREATE INDEX has to follow the > same rules as other transactions. This means if there were any visible hot > updated tuples and the index is therefore marked with our xid in indcreatexid > we will *not* be able to use it in the same transaction as our xid is never in > our serializable snapshot. This is true even if we're not in serializable mode > as we cannot know what earlier snapshots are still in use and may be used with > the new plan. Updated paragraph: The CREATE INDEX transaction cannot not use the index if HOT rows exist in the table. Fortunately, many CREATE INDEX transactions use new tables in which rows are inserted but not updated (such as with pg_dump). Such tables have no HOT rows so the index will be usable because indcreatexid will never be set. > NOTE: This again is something likely to change. In many cases it ought to be > possible to have the transaction use the index it just built even if there > were visible HOT updated tuples in it. > > In particular in READ COMMITTED transactions which have no outstanding > commands using early snapshots then subsequent planning ought to be able to > use the index. Even if outstanding commands are using old snapshots if we can > be sure they can't use the new plan then it would still be safe to use the > index in the new plan. Also in SERIALIZABLE mode those same statements hold > for temporary tables. OK, we can update the README at that point. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Pavan Deolasee wrote: > It would be worth mentioning that columns appearing in predicates > of partial indexes and expressions of expression indexes are also > checked. If any of these columns are changed, HOT update is not done. Added now: Tuples are checked against expression and partial indexes to be sure no referenced columns change. Something about those was in the original version but not the new one I used. > > When the last live tuple in an update chain becomes dead (after a DELETE > > or a cold update), the redirecting line pointer is marked as redirected > > dead. That allows us to immediately reuse the heap space (but not the > > line pointer itself). > > > > A lazy vacuum is required to reclaim redirect-dead line pointers. Updated: When the last live tuple in an update chain becomes dead (after a DELETE or a cold update), the redirecting line pointer is marked as redirected dead. That allows us to immediately reuse the heap space, while VACUUM can reclaim the line pointer space. > To limit the damage in the worst case, and to > > keep numerous arrays as well as the bitmaps in bitmap scans reasonably > > sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is > > arbitrarily capped at twice its previous maximum. > > > > With the latest patch, we have reverted it back to the original value. Updated: We have effectively implemented the "truncate dead tuples to just line pointer" idea that has been proposed and rejected before because of fear of line pointer bloat. To limit the damage in the worst case, and to keep numerous arrays as well as the bitmaps in bitmap scans reasonably sized, the maximum number of line pointers (MaxHeapTuplesPerPage) is arbitrarily capped. > VACUUM FULL > > ----------- > > > > It might be worth mentioning that vacuum full also removes > redirected line pointers by making them directly point to > the first tuple in the HOT chain. We can do so, because vacuum > full works with an exclusive lock on the relation. Updated: To make vacuum full work, any DEAD tuples in the middle of an update chain need to be removed (see comments at the top of heap_prune_hotchain_hard() for details). Vacuum full performs a more aggressive pruning that not only removes dead tuples at the beginning of an update chain, but scans the whole chain and removes any intermediate dead tuples as well. It also removes redirected line pointers by making them directly point to the first tuple in the HOT chain. This is possible because vacuum full works with an exclusive lock on the relation. > VACUUM > > ------ > > > > There is little change to regular vacuum. It removes dead HOT tuples, > > like pruning does, and cleans up any redirected dead line pointers. > > > > One change that is worth mentioning is that with HOT it needs vacuum > strength > lock in the first phase (currently it works with SHARE lock if no tuples > need > freezing or EXCLUSIVE lock otherwise). We can improve it a bit by first > checking if there is really a need for pruning and then only go for cleanup > lock. But thats probably not worth the efforts (atleast for large tables > where > we should usually get cleanup lock rather easily). > > > > Statistics > > ---------- > > > > XXX: How do HOT-updates affect statistics? How often do we need to run > > autovacuum? > > > > As the latest patch stands, we track dead-space in the relation and trigger > autovacuuum based on the percentage of dead space in the table. We > don't have any mechanism to account for index bloat yet. Autoanalyze > does not change. OK. Current README.HOT version: ftp://momjian.us/pub/postgresql/mypatches/README.HOT -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > It would be worth mentioning that columns appearing in predicates > of partial indexes and expressions of expression indexes are also > checked. If any of these columns are changed, HOT update is not done. In discussion a question arose. I don't think we currently compare these columns before when we're building an index and find a visible hot update. We just set hot_visible even if the chain might still be a valid hot chain for the new index right? Should we consider checking the columns in that case too? Or would it be too much extra overhead? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Bruce Momjian <bruce@momjian.us> writes: > Current README.HOT version: > ftp://momjian.us/pub/postgresql/mypatches/README.HOT I've done some further editorializing and wordsmithing on this, and marked with XXX some areas that seem like they're still red flags (or at least inadequately described). regards, tom lane $PostgreSQL$ Heap Only Tuples (HOT) Introduction ------------ The Heap Only Tuple (HOT) feature eliminates redundant index entries and allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples without performing a table-wide vacuum. It does this by allowing single-page vacuuming, also called "defragmentation". Note: there is a Glossary at the end of this document that may be helpful for first-time readers. Technical Challenges -------------------- Page-at-a-time vacuuming is normally impractical because of the costs of finding and removing the index entries that link to the tuples to be reclaimed. Standard vacuuming scans the indexes to ensure all such index entries are removed, amortizing the index scan cost across as many dead tuples as possible; this approach does not scale down well to the case of reclaiming just a few tuples. In principle one could recompute the index keys and do standard index searches to find the index entries, but this is risky in the presence of possibly-buggy user-defined functions in functional indexes. An allegedly immutable function that in fact is not immutable might prevent us from re-finding an index entry (and we cannot throw an error for not finding it, in view of the fact that dead index entries are sometimes reclaimed early). That would lead to a seriously corrupt index, in the form of entries pointing to tuple slots that by now contain some unrelated content. In any case we would prefer to be able to do vacuuming without invoking any user-written code. HOT solves this problem for a restricted but useful special case: where a tuple is repeatedly updated in ways that do not change its indexed columns. (Here, "indexed column" means any column referenced at all in an index definition, including for example columns that are tested in a partial-index predicate but are not stored in the index.) An additional property of HOT is that it reduces index size by avoiding the creation of identically-keyed index entries. This improves search speeds. UPDATE Chains With a Single Index Entry --------------------------------------- Without HOT, every version of a row in an UPDATE chain has its own index entries, even if all indexed columns are the same. With HOT, a new tuple placed on the same page and with all indexed columns the same as its parent row version does not get new index entries. This means there is only one index entry for the entire UPDATE chain on the heap page. An index-less tuple is marked with the HEAP_ONLY_TUPLE flag. The prior row version is marked HEAP_HOT_UPDATED, and (as always in an UPDATE chain) its t_ctid field links forward to the newer version. For example: Index points to 1 lp [1] [2] [111111111]->[2222222222] In the above diagram, the index points to line pointer 1, and tuple 1 is marked as HEAP_HOT_UPDATED. Tuple 2 is a HOT tuple, meaning it has no index row pointing to it, and is marked as HEAP_ONLY_TUPLE. Although tuple 2 is not directly referenced by the index, it can still be found by an index search: after traversing from the index to tuple 1, the index search proceeds forward to child tuples as long as it sees the HEAP_HOT_UPDATED flag set. Since we restrict the HOT chain to lie within a single page, this requires no additional page fetches and doesn't introduce much performance penalty. Eventually, tuple 1 will no longer be visible to any transaction. At that point its space could be reclaimed, but its line pointer cannot, since the index still links to that line pointer and we still need to be able to find tuple 2 in an index search. HOT handles this by turning line pointer 1 into a "redirecting line pointer", which links to tuple 2 but has no actual tuple attached. This state of affairs looks like Index points to 1 lp [1]->[2] [2222222222] If now the row is updated again, to version 3, the page looks like this: Index points to 1 lp [1]->[2] [3] [2222222222]->[3333333333] At some later time when no transaction can see tuple 2 in its snapshot, tuple 2 and its line pointer can be pruned entirely: Index points to 1 lp [1]------>[3] [3333333333] This is safe because no index entry points to line pointer 2. Subsequent insertions into the page can now recycle both line pointer 2 and the space formerly used by tuple 2. If an update changes any indexed column, or there is not room on the same page for the new tuple, then the HOT chain ends: the last member has a regular t_ctid link to the next version and is not marked HEAP_HOT_UPDATED. (In principle we could continue a HOT chain across pages, but this would destroy the desired property of being able to reclaim space with just page-local manipulations. Anyway, we don't want to have to chase through multiple heap pages to get from an index entry to the desired tuple.) If further updates occur, the next version could become the root of a new HOT chain. Line pointer 1 has to remain as long as there is any non-dead member of the chain on the page. When there is not, it is marked "redirected dead". This lets us reclaim the last child line pointer and associated tuple immediately. The next regular VACUUM pass can reclaim the index entries pointing at the line pointer and then the line pointer itself. Since a line pointer is small compared to a tuple, this does not represent an undue space cost. Note: we can use a "redirected dead" line pointer for any DELETEd tuple, whether it was part of a HOT chain or not. This allows space reclamation in advance of running VACUUM for plain DELETEs as well as HOT updates. The requirement for doing a HOT update is that none of the indexed columns are changed. This is checked at execution time by comparing the binary representation of the old and new values. We insist on bitwise equality rather than using datatype-specific equality routines. The main reason to avoid the latter is that there might be multiple notions of equality for a datatype, and we don't know exactly which one is relevant for the indexes at hand. We assume that bitwise equality guarantees equality for all purposes. Index/Sequential Scans ---------------------- When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose xmax is not aborted, we need to follow its t_ctid link and check that entry as well; possibly repeatedly until we reach the end of the HOT chain. (When using an MVCC snapshot it is possible to optimize this a bit: there can be at most one visible tuple in the chain, so we can stop when we find it. This rule does not work for non-MVCC snapshots, though.) Sequential scans do not need to pay attention to the HOT flags because they scan every item pointer on the page anyway. The same goes for a bitmap heap scan with a lossy bitmap. Pruning ------- HOT pruning means updating item pointers so that HOT chains are reduced in length, by collapsing out line pointers for intermediate dead tuples. Although this makes those line pointers available for re-use, it does not immediately make the space occupied by their tuples available. Defragmentation --------------- Defragmentation centralizes unused space. After we have converted root line pointers to redirected line pointers and pruned away any dead intermediate line pointers, the tuples they linked to are free space. But unless that space is adjacent to the central "hole" on the page (the pd_lower-to-pd_upper area) it will not be used by tuple insertion. Defragmentation moves the surviving tuples to coalesce all the free space into one "hole". This is done with the same PageRepairFragmentation function that regular VACUUM uses. When can/should we prune or defragment? --------------------------------------- This is the most interesting question in HOT implementation, since there is no simple right answer: we must use heuristics to determine when it's most efficient to perform pruning and/or defragmenting. We cannot prune or defragment unless we can get a "buffer cleanup lock" on the target page; otherwise, pruning might destroy line pointers that other backends have live references to, and defragmenting might move tuples that other backends have live pointers to. Thus the general approach must be to heuristically decide if we should try to prune or defragment, and if so try to acquire the buffer cleanup lock without blocking. If we succeed we can proceed with our housekeeping work. If we cannot get the lock (which should not happen often, except under very heavy contention) then the housekeeping has to be postponed till some other time. The worst-case consequence of this is only that an UPDATE cannot be made HOT but has to link to a new tuple version placed on some other page, for lack of centralized space on the original page. Ideally we would do defragmenting only when we are about to attempt heap_update on a HOT-safe tuple. The difficulty with this approach is that the update query has certainly got a pin on the old tuple, and therefore our attempt to acquire a buffer cleanup lock will always fail. (This corresponds to the idea that we don't want to move the old tuple out from under where the query's HeapTuple pointer points. It might be possible to finesse that, but it seems fragile.) Pruning, however, is potentially useful even when we are not about to insert a new tuple, since shortening a HOT chain reduces the cost of subsequent index searches. The currently proposed heuristic is to prune and defrag when accessing a page with HOT updated tuples and with less than a certain threshold of free space (typically less than the average tuple size for that table). XXX there is much argument about this. We have effectively implemented the "truncate dead tuples to just line pointer" idea that has been proposed and rejected before because of fear of line pointer bloat: we might end up with huge numbers of line pointers and just a few actual tuples on a page. To limit the damage in the worst case, and to keep various work arrays as well as the bitmaps in bitmap scans reasonably sized, the maximum number of line pointers per page (MaxHeapTuplesPerPage) is arbitrarily capped. VACUUM ------ There is little change to regular vacuum. It removes dead HOT tuples, like pruning does, and cleans up any redirected dead line pointers. In lazy vacuum, we must not freeze a tuple that is in the middle of an update chain. That can happen when a tuple has xmin > xmax; it is the same scenario that requires "hard pruning" in VACUUM FULL. Freezing such tuples will break the check that xmin and xmax matches when following the chain. It is not a problem without HOT, because the preceding tuple in the chain must be dead as well so no one will try to follow the chain, but with HOT the preceding tuple would be DEAD_CHAIN, and someone might still need to follow the chain to find the live tuple. We avoid that by just not freezing such tuples. They can be frozen eventually, when the xmax of the preceding tuple is < OldestXmin as well. XXX doesn't the above completely break the anti-wraparound guarantees? And why couldn't we avoid the problem by pruning the previous tuple, which is surely dead? VACUUM FULL ----------- To make vacuum full work, any DEAD tuples in the middle of an update chain need to be removed (see comments at the top of heap_prune_hotchain_hard() for details). Vacuum full performs a more aggressive pruning that not only removes dead tuples at the beginning of an update chain, but scans the whole chain and removes any intermediate dead tuples as well. It also removes redirected line pointers by making them directly point to the first tuple in the HOT chain. This causes a user-visible change in the tuple's CTID, but since VACUUM FULL has always moved tuple CTIDs, that should not break anything. XXX any extra complexity here needs justification --- a lot of it. Statistics ---------- XXX: How do HOT-updates affect statistics? How often do we need to run autovacuum and autoanalyze? As the latest patch stands, we track dead-space in the relation and trigger autovacuuum based on the percentage of dead space in the table. We don't have any mechanism to account for index bloat yet. Autoanalyze does not change. CREATE INDEX ------------ CREATE INDEX presents a problem for HOT updates. While the existing HOT chains all have the same index values, the new index might invalidate the HOT chain because the columns in the new index might change within a pre-existing HOT chain. To address this issue, regular (non-concurrent) CREATE INDEX makes the new index usable only by transactions newer than the CREATE INDEX command. This prevents transactions that can see the inconsistent HOT chains from trying to use the new index and getting incorrect results. New transactions can only see the rows visible after the index was created, hence the HOT chains are consistent for them. Entries in the new index point to root tuples (tuples with current index pointers) so that our index uses the same index pointers as all other indexes on the table. However the row we want to index is actually at the *end* of the chain, ie, the most recent live tuple on the HOT chain. That is the one we compute the index entry values for, but the TID we put into the index is that of the root tuple. Since transactions that will be allowed to use the new index cannot see any of the older tuple versions in the chain, the fact that they might not match the index entry isn't a problem. (Such transactions will check the tuple visibility information of the older versions and ignore them, without ever looking at their contents, so the content inconsistency is OK.) Subsequent updates to the live tuple will be allowed to extend the HOT chain only if they are HOT-safe for all the indexes. Because we have ShareLock on the table, any DELETE_IN_PROGRESS or INSERT_IN_PROGRESS tuples should have come from our own transaction. Therefore we can consider them committed since if the CREATE INDEX commits, they will be committed, and if it aborts the index is discarded. An exception to this is that early lock release is customary for system catalog updates, and so we might find such tuples when reindexing a system catalog. In that case we deal with it by waiting for the source transaction to commit or roll back. (We could do that for user tables too, but since the case is unexpected we prefer to throw an error.) Practically, we prevent old transactions from using the new index by putting our transaction id in pg_index.indcreatexid after building the index (but only if live HOT chains exist in the table). Queries can use a new index only after indcreatexid is below their TransactionXmin horizon. This means in particular that the transaction creating the index will be unable to use the index. We alleviate that problem somewhat by not setting indcreatexid unless the table actually contains HOT chains with RECENTLY_DEAD members. (In 8.4 we should be able to improve on that, at least for non-serializable transactions, because we expect to be able to advance TransactionXmin intratransaction.) Another unpleasant consequence is that it is now risky to use SnapshotAny in an index scan: if the index was created more recently than the last vacuum, it's possible that some of the visited tuples do not match the index entry they are linked to. This does not seem to be a fatal objection, since there are few users of SnapshotAny and most use seqscans. The only exception at this writing is CLUSTER, which is okay because it does not require perfect ordering of the indexscan readout (and especially so because CLUSTER tends to write recently-dead tuples out of order anyway). CREATE INDEX CONCURRENTLY ------------------------- In the concurrent case we take a different approach. We create the pg_index entry immediately, before we scan the table. The pg_index entry is marked as "not ready for inserts". Then we commit and wait for any transactions which have the table open to finish. This ensures that no *new* HOT updates will change the key value for our new index, because all transactions will see the existence of the index and will respect its constraint on which updates can be HOT. After waiting for transactions which had the table open, we build the index with a snapshot. Because we waited for anyone who existed before the index was created, any tuples seen in the snapshot will have only valid forward-growing HOT chains. (They might still have older HOT updates behind them which are broken.) As above, we point the index pointer at the root of the HOT-update chain but we use the key from the end of the chain. We mark the index open for inserts (but still not ready for reads) then we again wait for transactions which have the table open. Then we take a second reference snapshot and validate the index. This searches for tuples missing from the index --- but it again has to look up the root of the HOT chains and search for those tuples in the index. Then we wait until every transaction that could have a snapshot older than the second reference snapshot is finished. This ensures that nobody is alive any longer who could possibly see any of the older tuples in a broken HOT chain. (This condition is actually stronger than what is needed to protect the broken HOT chains --- the problem we are dealing with is tuples inserted after we took our first reference snapshot. Those will be part of valid HOT-chains, but might not be in the index.) XXX the above being the case, why should we bother with all this extra mechanism? Seems we could dispense with "not ready for inserts". Limitations and Restrictions ---------------------------- It is worth noting that HOT forever forecloses alternative approaches to vacuuming, specifically the recompute-the-index-keys approach alluded to in Technical Challenges above. It'll be tough to recompute the index keys for a root line pointer you don't have data for anymore ... Glossary -------- Broken HOT Chain A HOT chain in which the key value for an index has changed. This is not allowed to occur normally but if a new index is created it can happen. In that case various strategies are used to ensure that no transaction for which the older tuples are visible can use the index. Cold update A normal, non-HOT update, in which index entries are made for the new version of the tuple. HEAPTUPLE_DEAD_CHAIN New return value for HeapTupleSatisfiesVacuum, which means that the tuple is not visible to anyone, but it has been HOT updated so we cannot remove it yet because the following tuples in the chain would become inaccessible from indexes. Heap-only tuple A heap tuple with no index pointers, which can only be reached from indexes indirectly through its ancestral root tuple. Marked with HEAP_ONLY_TUPLE flag. HOT-safe A proposed tuple update is said to be HOT-safe if it changes none of the tuple's indexed columns. It will only become an actual HOT update if we can find room on the same page for the new tuple version. HOT update An UPDATE where the new tuple becomes a heap-only tuple, and no new index entries are made. HOT-updated tuple An updated tuple, for which the next tuple in the chain is a heap-only tuple. Marked with HEAP_HOT_UPDATED flag. Indexed column A column used in an index definition. The column might not actually be stored in the index --- it could be used in a functional index's expression, or used in a partial index predicate. HOT treats all these cases alike. Redirecting line pointer A line pointer that points to another line pointer and has no associated tuple. Its lp_len is set to a magic value (ITEMID_REDIRECTED), and lp_off is the OffsetNumber of the line pointer it points to. Redirected dead line pointer A stub line pointer, that does not point to anything, but cannot be removed or reused yet because there are index pointers to it. Semantically same as a dead tuple. Root tuple The first tuple in a HOT update chain; the one that indexes point to. Update chain A chain of updated tuples, in which each tuple's ctid points to the next tuple in the chain. A HOT update chain is an update chain (or portion of an update chain) that consists of a root tuple and one or more heap-only tuples. A complete update chain can contain both HOT and non-HOT (cold) updated tuples.
On 9/12/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thanks a lot.
Bruce Momjian <bruce@momjian.us> writes:
> Current README.HOT version:
> ftp://momjian.us/pub/postgresql/mypatches/README.HOT
I've done some further editorializing and wordsmithing on this,
and marked with XXX some areas that seem like they're still red flags
(or at least inadequately described).
Thanks a lot.
CREATE INDEX CONCURRENTLY
-------------------------
In the concurrent case we take a different approach. We create the
pg_index entry immediately, before we scan the table. The pg_index entry
is marked as "not ready for inserts". Then we commit and wait for any
transactions which have the table open to finish. This ensures that no
*new* HOT updates will change the key value for our new index, because all
transactions will see the existence of the index and will respect its
constraint on which updates can be HOT.
After waiting for transactions which had the table open, we build the
index with a snapshot. Because we waited for anyone who existed before
the index was created, any tuples seen in the snapshot will have only
valid forward-growing HOT chains. (They might still have older HOT
updates behind them which are broken.) As above, we point the index
pointer at the root of the HOT-update chain but we use the key from the
end of the chain.
We mark the index open for inserts (but still not ready for reads) then
we again wait for transactions which have the table open. Then we take
a second reference snapshot and validate the index. This searches for
tuples missing from the index --- but it again has to look up the root
of the HOT chains and search for those tuples in the index.
Then we wait until every transaction that could have a snapshot older than
the second reference snapshot is finished. This ensures that nobody is
alive any longer who could possibly see any of the older tuples in a
broken HOT chain. (This condition is actually stronger than what is
needed to protect the broken HOT chains --- the problem we are dealing
with is tuples inserted after we took our first reference snapshot.
Those will be part of valid HOT-chains, but might not be in the index.)
XXX the above being the case, why should we bother with all this extra
mechanism? Seems we could dispense with "not ready for inserts".
The "not ready for inserts" logic is necessary because we have separated
the steps of creating the catalog entry for the new index and actual bottom-up
building of the index into separate transactions. My worry was once the catalog
entry is committed, an index insert won't work with the concurrent index build.
I guess I also need to explain why we need to separate the catalog entry
creation and index building activities into separate transactions. The primary
purpose is to handle existing broken HOT chains and avoid any new
broken chains being created while we are building the index.
Without this additional step, while we are building the index in the first phase
using a snapshot, its possible that a broken HOT chain may get created
forwarding to the tuple that we index. For example:
A table with two columns a and b. Say, we already have an index on 'a'
and creating a new index on 'b'. Say, there is a tuple (0, 1): (1, 11) in
the table where (0, 1) in the line pointer of the tuple.
This is the visible tuple when we took snapshot in the first phase and so
we shall index this tuple with key 11 and TID (0, 1). But before the first phase
finishes, the tuple is updated to (0, 2): (1, 22). This would be a HOT update
resulting in a broken chain. The root line pointer for the new tuple is still (0, 1).
CIC now waits for the updating transaction,
takes a fresh snapshot and starts the second phase. In the second phase,
it must index the new tuple (1, 22) using the key 22 and TID (0, 1) because
thats the root line pointer for the chain. Now there is already an index entry
in the new index with key 11 and TID (0, 1). Inserting another index entry with
a different key and the same TID would certainly corrupt the index.
We solve this issue by first creating a catalog entry for the new index and
committing the transaction. We then wait
out for any conflicting transactions before taking snapshot for building the
index in the first phase. This arrangement guarantees that the snapshot
used in the first phase would always satisfy tuples at the head of the
existing HOT chains. At the same time, no new broken HOT chains are
created while we are building the index because the new index
(even though not yet ready) is consulted for all subsequent UPDATEs.
Hope this helps.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/12/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
VACUUM
------
There is little change to regular vacuum. It removes dead HOT tuples,
like pruning does, and cleans up any redirected dead line pointers.
In lazy vacuum, we must not freeze a tuple that is in the middle of an
update chain. That can happen when a tuple has xmin > xmax; it is the
same scenario that requires "hard pruning" in VACUUM FULL. Freezing such
tuples will break the check that xmin and xmax matches when following
the chain. It is not a problem without HOT, because the preceding tuple
in the chain must be dead as well so no one will try to follow the
chain, but with HOT the preceding tuple would be DEAD_CHAIN, and someone
might still need to follow the chain to find the live tuple. We avoid
that by just not freezing such tuples. They can be frozen eventually,
when the xmax of the preceding tuple is < OldestXmin as well.
XXX doesn't the above completely break the anti-wraparound guarantees?
And why couldn't we avoid the problem by pruning the previous tuple,
which is surely dead?
We preserve anti-wraparound guarantees by not letting the relfrozenxid
advance past the minimum of cut-off xid and xmin/xmax of not-yet-frozen
tuples. Given that this is required to address corner case of DEAD tuple
following a RD tuple, the final relfrozenxid would be very close to the
cut-off xid. Isn't it ?
We could have actually pruned the preceding RD tuples (as we do in
vacuum full), but we were worried about missing some corner case
where someone may still want to follow the chain from the RD tuple.
We don't have any such concern with vacuum full because we run
with exclusive lock on the table. But if we agree that there is no
problem with pruning RD tuple preceding a DEAD tuple, we can
actually prune that tuple as well.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/12/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
We hard prune the chains and also clear up redirect line pointers
because doing so is safe within VACUUM FULL and it reduces addition
complexity in the actual VACUUM FULL work.
When we move tuples and tuple chains, we don't try to preserve
their HOT properties. So when tuples in a HOT chain are moved,
we reset their HEAP_ONLY_TUPLE and HEAP_HOT_UPDATED flags
and each tuple has its own index entry. This requires us to some
more book keeping work in terms on number of indexed tuples expected
etc because they are checked at the end of the index scan.
VACUUM FULL
-----------
To make vacuum full work, any DEAD tuples in the middle of an update
chain need to be removed (see comments at the top of
heap_prune_hotchain_hard() for details). Vacuum full performs a more
aggressive pruning that not only removes dead tuples at the beginning of
an update chain, but scans the whole chain and removes any intermediate
dead tuples as well. It also removes redirected line pointers by making
them directly point to the first tuple in the HOT chain. This causes
a user-visible change in the tuple's CTID, but since VACUUM FULL has
always moved tuple CTIDs, that should not break anything.
XXX any extra complexity here needs justification --- a lot of it.
We hard prune the chains and also clear up redirect line pointers
because doing so is safe within VACUUM FULL and it reduces addition
complexity in the actual VACUUM FULL work.
When we move tuples and tuple chains, we don't try to preserve
their HOT properties. So when tuples in a HOT chain are moved,
we reset their HEAP_ONLY_TUPLE and HEAP_HOT_UPDATED flags
and each tuple has its own index entry. This requires us to some
more book keeping work in terms on number of indexed tuples expected
etc because they are checked at the end of the index scan.
Statistics
----------
XXX: How do HOT-updates affect statistics? How often do we need to run
autovacuum and autoanalyze?
Auotovacuum needs to be run much less frequently with HOT. This is because
defragmentation reclaims dead space in a page, thus reducing total dead
space in a table. Right now we don't update FSM information about the page
after defragmenting it, so a UPDATE on a different page can still cause
relation extension even though there is free space in some other page.
The rational for not updating FSM is to let subsequent UPDATEs on the page
to use the freed up space. But one can argue that we should let the free
space to be used for other UPDATEs/INSERTs after leaving fillfactor worth
of space.
Another significant change regarding autovacuum is that we now track
the total dead space in the table instead of number of dead tuples. This
seems like a better approach because it takes into account varying tuple sizes
into account. The tracked dead space is increased whenever we update/delete
a tuple (or insert is aborted) and reduced when a page is defragmented.
autovacuum_vacuum_scale_factor considers the percentage of dead space
to the size of the relation whereas autovacuum_vacuum_threshold
considers the absolute amount of dead space in terms of blocks.
Every UPDATE (HOT or COLD) contributes to the autoanalyze stats and
defragmentation/pruning has no effect on autoanalyze. IOW autoanalyze
would work just the way it does today. One change that is worh mentioning
and discussing is that we don't follow HOT chains while fetching tuples during
autoanalyze and autoanalyze would consider all such tuples as DEAD.
In the worst case when all the tuples in the table are reachable via
redirected line pointers, this would confuse autoanalyze since it would
consider all tuples in the table as DEAD.
I think we should change this to follow HOT chain in analyze. Since we
fetch using SnapshotNow, if there is a live tuple at the end of the
chain, analyze would use that. Otherwise the tuple is considered as
DEAD.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On 9/12/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> XXX the above being the case, why should we bother with all this extra >> mechanism? Seems we could dispense with "not ready for inserts". > Without this additional step, while we are building the index in the first > phase > using a snapshot, its possible that a broken HOT chain may get created > forwarding to the tuple that we index. For example: > A table with two columns a and b. Say, we already have an index on 'a' > and creating a new index on 'b'. Say, there is a tuple (0, 1): (1, 11) in > the table where (0, 1) in the line pointer of the tuple. > This is the visible tuple when we took snapshot in the first phase and so > we shall index this tuple with key 11 and TID (0, 1). But before the first > phase > finishes, the tuple is updated to (0, 2): (1, 22). This would be a HOT > update > resulting in a broken chain. The root line pointer for the new tuple is > still (0, 1). Got it. I'll put something about this into the README. regards, tom lane
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On 9/12/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> XXX doesn't the above completely break the anti-wraparound guarantees? >> And why couldn't we avoid the problem by pruning the previous tuple, >> which is surely dead? > We preserve anti-wraparound guarantees by not letting the relfrozenxid > advance past the minimum of cut-off xid and xmin/xmax of not-yet-frozen > tuples. Given that this is required to address corner case of DEAD tuple > following a RD tuple, the final relfrozenxid would be very close to the > cut-off xid. Isn't it ? > We could have actually pruned the preceding RD tuples (as we do in > vacuum full), but we were worried about missing some corner case > where someone may still want to follow the chain from the RD tuple. This seems all wrong to me. We'd only be considering freezing a tuple whose xmin precedes the global xmin. If it has a predecessor, that predecessor has xmax equal to this one's xmin, therefore also preceding global xmin, therefore it would be seen as DEAD not RECENTLY_DEAD. So we should never need to freeze a tuple that isn't the start of its HOT chain. Also, if you find a DEAD tuple after a RECENTLY_DEAD one, you can certainly prune both, because what this tells you is that both tuples are in fact dead to all observers. regards, tom lane
On 9/12/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
hm.. What you are saying is right. I fail to recollect any other scenario that
had forced me to think freezing is a problem with HOT.
This seems all wrong to me. We'd only be considering freezing a tuple
whose xmin precedes the global xmin. If it has a predecessor, that
predecessor has xmax equal to this one's xmin, therefore also preceding
global xmin, therefore it would be seen as DEAD not RECENTLY_DEAD.
So we should never need to freeze a tuple that isn't the start of its
HOT chain.
hm.. What you are saying is right. I fail to recollect any other scenario that
had forced me to think freezing is a problem with HOT.
Also, if you find a DEAD tuple after a RECENTLY_DEAD one, you can
certainly prune both, because what this tells you is that both tuples
are in fact dead to all observers.
I agree. I ran a long test with this change and there doesn't seem to be
any issue. So lets prune everything including the latest DEAD tuple. That
would let us take out the changes related to freezing. I also think that
should let us remove the DEAD_CHAIN concept, but let me check.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
On 9/12/07, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
This is all crap. I was under the impression that heap_release_fetch()
would never fetch a HOT tuple directly, but thats not true. analyze fetches
all tuples in the chosen block, so it would ultimately fetch the visible tuple.
A tuple is counted DEAD only if its t_data is set to non-NULL. For redirected
line pointer heap_release_fetch() will set t_data to NULL and hence these
line pointers are (rightly) not counted as DEAD. This is the right thing to do
because lazy vacuum can not remove redirected line pointers.
One change that is worh mentioning
and discussing is that we don't follow HOT chains while fetching tuples during
autoanalyze and autoanalyze would consider all such tuples as DEAD.
In the worst case when all the tuples in the table are reachable via
redirected line pointers, this would confuse autoanalyze since it would
consider all tuples in the table as DEAD.
This is all crap. I was under the impression that heap_release_fetch()
would never fetch a HOT tuple directly, but thats not true. analyze fetches
all tuples in the chosen block, so it would ultimately fetch the visible tuple.
A tuple is counted DEAD only if its t_data is set to non-NULL. For redirected
line pointer heap_release_fetch() will set t_data to NULL and hence these
line pointers are (rightly) not counted as DEAD. This is the right thing to do
because lazy vacuum can not remove redirected line pointers.
I think we should change this to follow HOT chain in analyze.
We need not follow the chain, but we should check for redirect-dead line
pointers and count them as dead rows.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com