Thread: HOT synced with HEAD

HOT synced with HEAD

From
Tom Lane
Date:
Attached is as far as I've gotten with reviewing the HOT patch; I see
that Pavan is still fooling with it so we'd better re-sync.

* Applies against CVS HEAD, passes regression tests except for 2
  expected failures (CREATE INDEX CONCURRENTLY failure behavior change
  and pgstat views changes)

* pg_index rearranged to use xmin as the index age column, avoiding
  need for a special hack to freeze a custom column

* Many API cleanups, in particular I didn't like what had been done
  to heap_fetch and thought it better to split out the chain-following
  logic

* Corner-case bug fixes, comment improvements, etc

There's still a lot of code in it that I haven't read ...

            regards, tom lane


Attachment

Re: HOT synced with HEAD

From
"Pavan Deolasee"
Date:


On 9/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Attached is as far as I've gotten with reviewing the HOT patch; I see
that Pavan is still fooling with it so we'd better re-sync.


I am done with most of the items on my plate. I was not sure how
far you have gone with the patch, so was trying to finish as many
items as possible myself. Now that I know you have started refactoring
the code, I would try not to change it unless its urgent. And when its
required, I will send a add-on patch just the way I did earlier.

Let me know if you want me to focus of something at this point.
 


* Many API cleanups, in particular I didn't like what had been done
  to heap_fetch and thought it better to split out the chain-following
  logic


I liked those changes. I am wondering if we could have avoided
duplicating the chain following code in heap_hot_search_buffer and
index_getnext. But I trust your judgment.

I also liked the way you reverted the API changes to various index build
methods.

I would test the patch tomorrow in detail.

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT synced with HEAD

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> I liked those changes. I am wondering if we could have avoided
> duplicating the chain following code in heap_hot_search_buffer and
> index_getnext.

I wasn't totally thrilled with duplicating that code, but the callers of
heap_hot_search are only interested in finding a single visible tuple,
whereas index_getnext has to find them all on successive visits.
Exposing a capability to do that in heap_hot_search seemed to make its
API ugly enough that duplicated code was better.  (I really didn't like
the original setup where it was supposed to find the next tuple after
the one passed in, thereby forcing the caller to handle the first
chain member...)

BTW, I'm in process of taking out the separate HEAPTUPLE_DEAD_CHAIN
return value from HeapTupleSatisfiesVacuum.  After looking through
all the callers I concluded that this complicated life more than it
helped.  The places that actually want to make the distinction can
check HeapTupleIsHotUpdated easily enough for themselves, and in
several of the places that don't want to make the distinction it's
notationally cumbersome to deal with it.  For instance, in the patch
as I posted it last night, both heap_hot_search and index_getnext
probably fail to detect all-dead HOT chains, because some of the
chain members are likely to be HEAPTUPLE_DEAD_CHAIN rather than plain
DEAD, and they're not coping.

            regards, tom lane

Re: HOT synced with HEAD

From
"Pavan Deolasee"
Date:


On 9/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:


BTW, I'm in process of taking out the separate HEAPTUPLE_DEAD_CHAIN
return value from HeapTupleSatisfiesVacuum.


I agree. I myself suggested doing so earlier in the discussion (I actually
even removed this before I sent out the add-on patch last night, but then
reverted back because I realized at least it is required at one place)
The place where I thought its required is to avoid marking an index tuple dead
even though the corresponding root tuple is dead and the root tuple was
HOT updated. But seems like you have already put in a different mechanism
to handle that. So we should be able to get rid of HEAPTUPLE_DEAD_CHAIN.

Thanks,
Pavan




--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT synced with HEAD

From
Tom Lane
Date:
BTW, I am still looking for a reason for the hard-prune logic to live.
It seems to complicate matters far more than it's worth --- in
particular the way that the WAL replay representation is set up seems
confusing and fragile.  (If prune_hard is set, the "redirect" entries
mean something completely different.)  There was some suggestion that
VACUUM FULL has to have it, but unless I see proof of that I'm thinking
of taking it out.

            regards, tom lane

Re: HOT synced with HEAD

From
"Pavan Deolasee"
Date:


On 9/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
BTW, I am still looking for a reason for the hard-prune logic to live.
It seems to complicate matters far more than it's worth --- in
particular the way that the WAL replay representation is set up seems
confusing and fragile.  (If prune_hard is set, the "redirect" entries
mean something completely different.)  There was some suggestion that
VACUUM FULL has to have it, but unless I see proof of that I'm thinking
of taking it out.


The notion of hard prune has changed since we recently decided to always
prune the chain upto and including the latest DEAD tuple (which includes
the preceding RECENTLY_DEAD tuples). Earlier hard_prune involved
pruning upto the latest DEAD tuple. The only extra thing hard_prune
now does is that it clears the redirected line pointers (I hope I have
fixed the comments to reflect this change; but my apologies if I haven't)

But that seems like a worthy thing to do to me. One because thats the
cheapest (and may be the easiest) way of getting rid of redirected line
pointers and two because that helps us the book-keeping in VACUUM
FULL. You are already finding that complex - with the redirected line
pointers it might be even more complex.

If you are worried about the WAL part, may be we can fix that some
other way.

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT synced with HEAD

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 9/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, I'm in process of taking out the separate HEAPTUPLE_DEAD_CHAIN
>> return value from HeapTupleSatisfiesVacuum.
>>
> I agree. I myself suggested doing so earlier in the discussion (I actually
> even removed this before I sent out the add-on patch last night, but then
> reverted back because I realized at least it is required at one place)
> The place where I thought its required is to avoid marking an index tuple
> dead
> even though the corresponding root tuple is dead and the root tuple was
> HOT updated. But seems like you have already put in a different mechanism
> to handle that. So we should be able to get rid of HEAPTUPLE_DEAD_CHAIN.

Yeah, actually this depends in part on having HeapTupleIsHotUpdated
include a check on XMIN_INVALID; otherwise testing that wouldn't be a
full substitute for what tqual.c had been doing.

Something else I was just looking at: in the pruning logic, SetRedirect
and SetDead operations are done at the same time that we record them for
the eventual WAL record creation, but SetUnused operations are
postponed and only done at the end.  This seems a bit odd/nonorthogonal.
Furthermore it's costing extra code complexity: if you were to SetUnused
immediately then you wouldn't need that bitmap thingy to prevent you
from doing it twice.  I think that the reason it's like that is probably
because of the problem of potential multiple visits to a DEAD heap-only
tuple, but it looks to me like that's not really an issue given the
current form of the testing for aborted tuples, which I have as

        if (HeapTupleSatisfiesVacuum(tuple->t_data, global_xmin, buffer)
            == HEAPTUPLE_DEAD && !HeapTupleIsHotUpdated(tuple))
            heap_prune_record_unused(nowunused, nunused, unusedbitmap,
                                     rootoffnum);

ISTM that if HeapTupleIsHotUpdated is false, we could simply SetUnused
immediately, because any potential chain members after this one must
be dead too, and they'll get reaped by this same bit of code --- there
is no need to preserve the chain.  (The only case we're really covering
here is XMIN_INVALID, and any later chain members must certainly be
XMIN_INVALID as well.)  When the HOT-chain is later followed, we'll
detect chain break anyway, so I see no need to postpone clearing the
item pointer.

            regards, tom lane

Re: HOT synced with HEAD

From
"Pavan Deolasee"
Date:


On 9/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Something else I was just looking at: in the pruning logic, SetRedirect
and SetDead operations are done at the same time that we record them for
the eventual WAL record creation, but SetUnused operations are
postponed and only done at the end.  This seems a bit odd/nonorthogonal.
Furthermore it's costing extra code complexity: if you were to SetUnused
immediately then you wouldn't need that bitmap thingy to prevent you
from doing it twice.  I think that the reason it's like that is probably
because of the problem of potential multiple visits to a DEAD heap-only
tuple, but it looks to me like that's not really an issue given the
current form of the testing for aborted tuples, which I have as

        if (HeapTupleSatisfiesVacuum(tuple->t_data, global_xmin, buffer)
            == HEAPTUPLE_DEAD && !HeapTupleIsHotUpdated(tuple))
            heap_prune_record_unused(nowunused, nunused, unusedbitmap,
                                     rootoffnum);

ISTM that if HeapTupleIsHotUpdated is false, we could simply SetUnused
immediately, because any potential chain members after this one must
be dead too, and they'll get reaped by this same bit of code --- there
is no need to preserve the chain.  (The only case we're really covering
here is XMIN_INVALID, and any later chain members must certainly be
XMIN_INVALID as well.)  When the HOT-chain is later followed, we'll
detect chain break anyway, so I see no need to postpone clearing the
item pointer.



So are you suggesting we go back to the earlier way of handling
aborted tuples separately ? But then we can not do that by simply
checking for !HeaptupleIsHotUpdated. There could be several aborted
tuples at the end of the chain of which all but one are marked HotUpdated.
Or are you suggesting we also check for XMIN_INVALID for detecting
aborted tuples ?

If we handle aborted tuples differently, then we don't need that extra
bitmap and can in fact set the line pointer unused immediately. Otherwise,
as you rightly pointed out, we might break a chain before we prune
it.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT synced with HEAD

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> So are you suggesting we go back to the earlier way of handling
> aborted tuples separately ? But then we can not do that by simply
> checking for !HeaptupleIsHotUpdated. There could be several aborted
> tuples at the end of the chain of which all but one are marked HotUpdated.
> Or are you suggesting we also check for XMIN_INVALID for detecting
> aborted tuples ?

Yeah.  As the code stands, anything that's XMIN_INVALID will be
considered not-HotUpdated (look at the macro...).  So far I've seen no
place where there is any value in following a HOT chain past such a
tuple --- do you see any?  Every descendant tuple must be XMIN_INVALID
as well ...

            regards, tom lane

Re: HOT synced with HEAD

From
Tom Lane
Date:
Hmm ... so all that logic to prune just one tuple chain is dead code,
because heap_page_prune_defrag() ignores its pruneoff argument and
always passes InvalidOffsetNumber down to heap_page_prune().

While this is certainly a fairly trivial bug, it makes me wonder whether
the prune-one-chain logic has ever been active and whether there is any
real evidence for having it at all.  Was this error introduced in some
recent refactoring, or has it always been like that?  Given the way that
the logic works now, in particular that we always insist on doing
defrag, what point is there in not pruning all the chains when we can?

            regards, tom lane

Re: HOT synced with HEAD

From
"Pavan Deolasee"
Date:


On 9/17/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hmm ... so all that logic to prune just one tuple chain is dead code,
because heap_page_prune_defrag() ignores its pruneoff argument and
always passes InvalidOffsetNumber down to heap_page_prune().

While this is certainly a fairly trivial bug, it makes me wonder whether
the prune-one-chain logic has ever been active and whether there is any
real evidence for having it at all.  Was this error introduced in some
recent refactoring, or has it always been like that?


This was introduced in the recent refactoring (the whole discussion of pruning/defragmentation started after that post) Earlier we used to prune
single chain during index scans. In fact, I intentionally left the code that
way because we are still debating the issue and I wanted to keep the
infrastructure there even if there are no users of it right now.
 

Given the way that
the logic works now, in particular that we always insist on doing
defrag, what point is there in not pruning all the chains when we can?


Yeah, unless we separate pruning and defragmentation, there is
no real value for single chain pruning. Defragmentation is a costly
operation and we would want to prune as many chains before calling
it.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT synced with HEAD

From
"Pavan Deolasee"
Date:


On 9/17/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah.  As the code stands, anything that's XMIN_INVALID will be
considered not-HotUpdated (look at the macro...).  So far I've seen no
place where there is any value in following a HOT chain past such a
tuple --- do you see any? 

No, I don't think we would ever need to follow a HOT chain past
the aborted tuple. The only thing that worries about this setup though
is the dependency on hint bits being set properly. But the places
where this would be used right now for detecting aborted dead tuples,
apply HeapTupleSatisfiesVacuum on the tuple before checking
for HeapTupleIsHotUpdated, so we are fine. Or should we just check
for XMIN_INVALID explicitly at those places ?


Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: HOT synced with HEAD

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> No, I don't think we would ever need to follow a HOT chain past
> the aborted tuple. The only thing that worries about this setup though
> is the dependency on hint bits being set properly. But the places
> where this would be used right now for detecting aborted dead tuples,
> apply HeapTupleSatisfiesVacuum on the tuple before checking
> for HeapTupleIsHotUpdated, so we are fine.

Practically all the places that check that have just done a tqual.c
test, so they can count on the INVALID bits to be up-to-date.  If not,
it's still OK, it just means that they might uselessly advance to the
next (former) chain member.  There is always a race condition in these
sorts of things: for instance, a tuple could go from INSERT_IN_PROGRESS
to DEAD at any instant, if its inserting transaction rolls back.  So you
have to have adequate defenses in place anyway, like the xmin/xmax
comparison.

> Or should we just check for XMIN_INVALID explicitly at those places ?

I went back and forth on that, but on balance a single macro seems better.

Meanwhile I've started looking at the vacuum code, and it seems that v16
has made that part of the patch significantly worse.  VACUUM will fail
to count tuples that are removed by pruning, which seems like something
it should report somehow.  And you've introduced a race condition: as
I just mentioned, it's perfectly possible that the second call of
HeapTupleSatisfiesVacuum gets a different answer than what the prune
code saw, especially in lazy VACUUM (in VACUUM FULL it'd suggest that
someone released lock early ... but we do have to cope with that).

The comments you added seem to envision a more invasive patch that gets
rid of the second HeapTupleSatisfiesVacuum pass altogether, but I'm not
sure how practical that is, and am not real inclined to try to do it
right now anyway ...

            regards, tom lane

Re: HOT synced with HEAD

From
"Pavan Deolasee"
Date:


On 9/17/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:


Meanwhile I've started looking at the vacuum code, and it seems that v16
has made that part of the patch significantly worse.  VACUUM will fail
to count tuples that are removed by pruning, which seems like something
it should report somehow.

I understand. I did not give real weight to it because I thought we
anyways remove tuples elsewhere during pruning. But I agree
that the heap_page_prune_defrag in the VACUUM code path
is doing so on behalf of vacuum and hence we should credit
that to VACUUM.


And you've introduced a race condition: as
I just mentioned, it's perfectly possible that the second call of
HeapTupleSatisfiesVacuum gets a different answer than what the prune
code saw, especially in lazy VACUUM (in VACUUM FULL it'd suggest that
someone released lock early ... but we do have to cope with that).


Hmm.. you are right. Those extra notices I added are completely
unnecessary and wrong.
 

The comments you added seem to envision a more invasive patch that gets
rid of the second HeapTupleSatisfiesVacuum pass altogether, but I'm not
sure how practical that is, and am not real inclined to try to do it
right now anyway ...

I agree. I just wanted to leave a hint there that such a possibility exists
if someone really wants to optimize, now or later.

Thanks,
Pavan
 

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com