Thread: HOT WIP Patch - version 2

HOT WIP Patch - version 2

From
"Pavan Deolasee"
Date:

Reposting - looks like the message did not get through in the first
attempt. My apologies if multiple copies are received.


This is the next version of the HOT WIP patch. Since the last patch that
I sent out, I have implemented the HOT-update chain pruning mechanism.

When following a HOT-update chain from the index fetch, if we notice that
the root tuple is dead and it is HOT-updated, we try to prune the chain to
the smallest possible length. To do that, the share lock is upgraded to an
exclusive lock and the tuple chain is followed till we find a live/recently-dead
tuple. At that point, the root t_ctid is made point to that tuple. In order to
preserve the xmax/xmin chain, the xmax of the root tuple is also updated
to xmin of the found tuple. Since this xmax is also < RecentGlobalXmin
and is a committed transaction, the visibility of the root tuple still remains
the same.

The intermediate heap-only tuples are  removed from the HOT-update chain.
The HOT-updated status of these tuples is cleared and their respective
t_ctid are made point to themselves. These tuples are not reachable now
and ready for vacuuming. This entire action is logged in a single
WAL record.

During vacuuming, we keep track of number of root tuples vacuumed.
If this count is zero, then the index cleanup step is skipped. This
would avoid unnecessary index scans whenever possible.

This patch should apply cleanly on current CVS head and pass all regression
tests. I am still looking for review comments from the first WIP patch. If anyone
has already looked through it and is interested in the incremental changes,
please let me know. I can post that.

Whats Next ?
-----------------

ISTM that  the basic  HOT-updates and ability to prune the HOT-update chain,
should help us reduce the index bloat, limit the overhead of ctid following in
index fetch and efficiently vacuum heap-only tuples. IMO the next important
but rather less troublesome thing to tackle is to reuse space within a block
without complete vacuum of the table. This would help us do much more
HOT-updates and thus further reduce index/heap bloat.

I am thinking of reusing the DEAD heap-only tuples which gets removed from
the HOT-update chain as part of pruning operation. Since these tuples, once
removed from the chain, are neither reachable nor have any index references,
could be readily used for storing newer versions of the same or other rows in
the block. How about setting LP_DELETE on these tuples as part of the
prune operation ? LP_DELETE is unused for heap tuples, if I am not
mistaken. Other information like length and offset are is maintained as it is.
When we run out space for update-within-the-block, we traverse
through all the line pointers looking for LP_DELETEd items. If any of these
items have space large enough to store the new tuple, that item is reused.
Does anyone see any issue with doing this ? Also, any suggestions
about doing it in a better way ?

If the page gets really fragmented, we can try to grab a VACUUM-strength
lock on the page and de-fragment it. The lock is tried conditionally to avoid
any deadlocks. This is done in the heap_update() code path, so would add
some overhead, but may still prove better than putting the tuple in a
different block and having corresponding index insert(s). Also, since we are
more concerned about the large tables, the chances of being able to upgrade
the exclusive lock to vacuum-strength lock are high. Comments ?

If there are no objections, I am planning to work on the first part
while Nikhil would take up the second task of block level retail-vacuum.
Your comments on these issues and the patch are really appreciated.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com
Attachment

Re: [HACKERS] HOT WIP Patch - version 2

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2007-02-20 kell 12:08, kirjutas Pavan Deolasee:
>
> Reposting - looks like the message did not get through in the first
> attempt. My apologies if multiple copies are received.
>
>
> This is the next version of the HOT WIP patch. Since the last patch
> that
> I sent out, I have implemented the HOT-update chain pruning mechanism.
>
> When following a HOT-update chain from the index fetch, if we notice
> that
> the root tuple is dead and it is HOT-updated, we try to prune the
> chain to
> the smallest possible length. To do that, the share lock is upgraded
> to an
> exclusive lock and the tuple chain is followed till we find a
> live/recently-dead
> tuple. At that point, the root t_ctid is made point to that tuple. In
> order to
> preserve the xmax/xmin chain, the xmax of the root tuple is also
> updated
> to xmin of the found tuple. Since this xmax is also <
> RecentGlobalXmin
> and is a committed transaction, the visibility of the root tuple still
> remains
> the same.

What do you do, if there are no live tuples on the page ? will this
un-HOTify the root and free all other tuples in HOT chain ?

>
> The intermediate heap-only tuples are  removed from the HOT-update
> chain.
> The HOT-updated status of these tuples is cleared and their respective
> t_ctid are made point to themselves. These tuples are not reachable
> now and ready for vacuuming.

Does this mean, that they are now indistinguishable from ordinary
tuples ?

Maybe they could be freed right away instead of changing HOT-updated
status and ctid ?

>  This entire action is logged in a single
> WAL record.
>
> During vacuuming, we keep track of number of root tuples vacuumed.
> If this count is zero, then the index cleanup step is skipped. This
> would avoid unnecessary index scans whenever possible.
>
> This patch should apply cleanly on current CVS head and pass all
> regression
> tests. I am still looking for review comments from the first WIP
> patch. If anyone
> has already looked through it and is interested in the incremental
> changes,
> please let me know. I can post that.
>
> Whats Next ?
> -----------------
>
> ISTM that  the basic  HOT-updates and ability to prune the HOT-update
> chain,
> should help us reduce the index bloat, limit the overhead of ctid
> following in
> index fetch and efficiently vacuum heap-only tuples. IMO the next
> important
> but rather less troublesome thing to tackle is to reuse space within a
> block
> without complete vacuum of the table. This would help us do much more
> HOT-updates and thus further reduce index/heap bloat.
>
> I am thinking of reusing the DEAD heap-only tuples which gets removed
> from
> the HOT-update chain as part of pruning operation. Since these tuples,
> once
> removed from the chain, are neither reachable nor have any index
> references,
> could be readily used for storing newer versions of the same or other
> rows in
> the block. How about setting LP_DELETE on these tuples as part of the
> prune operation ? LP_DELETE is unused for heap tuples, if I am not
> mistaken. Other information like length and offset are is maintained
> as it is.

Seems like a good idea.

> When we run out space for update-within-the-block, we traverse
> through all the line pointers looking for LP_DELETEd items. If any of
> these
> items have space large enough to store the new tuple, that item is
> reused.
> Does anyone see any issue with doing this ? Also, any suggestions
> about doing it in a better way ?

IIRC the size is determined by the next tuple pointer, so you can store
new data without changing tuple pointer only if they are exactly the
same size.

> If the page gets really fragmented, we can try to grab a
> VACUUM-strength
> lock on the page and de-fragment it. The lock is tried conditionally
> to avoid
> any deadlocks. This is done in the heap_update() code path, so would
> add
> some overhead, but may still prove better than putting the tuple in a
> different block and having corresponding index insert(s). Also, since
> we are
> more concerned about the large tables, the chances of being able to
> upgrade
> the exclusive lock to vacuum-strength lock are high. Comments ?

I'm not sure about the "we are more concerned about the large tables"
part. I see it more as a device for high-update tables. This may not
always be the same as "large", so there should be some fallbacks for
case where you can't get the lock. Maybe just give up and move to
another page ?

> If there are no objections, I am planning to work on the first part
> while Nikhil would take up the second task of block level
> retail-vacuum.
> Your comments on these issues and the patch are really appreciated.
>
> Thanks,
> Pavan
>
> --
>
> EnterpriseDB     http://www.enterprisedb.com
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: [HACKERS] HOT WIP Patch - version 2

From
"Pavan Deolasee"
Date:

On 2/20/07, Hannu Krosing <hannu@skype.net> wrote:
Ühel kenal päeval, T, 2007-02-20 kell 12:08, kirjutas Pavan Deolasee:


What do you do, if there are no live tuples on the page ? will this
un-HOTify the root and free all other tuples in HOT chain ?

Yes. The HOT-updated status of the root and all intermediate
tuples is cleared and their respective ctid pointers are made
point to themselves. The index entry will be marked LP_DELETE
as with the normal case. VACUUM can subsequently reclaimed these
tuples, along with the index entry.
 
>
> The intermediate heap-only tuples are  removed from the HOT-update
> chain.
> The HOT-updated status of these tuples is cleared and their respective
> t_ctid are made point to themselves. These tuples are not reachable
> now and ready for vacuuming.

Does this mean, that they are now indistinguishable from ordinary
tuples ?

No. HEAP_ONLY_TUPLE flag is still set on these tuples. So you
can distinguish those tuples.

Maybe they could be freed right away instead of changing HOT-updated
status and ctid ?

Yeah, thats a good idea. I am thinking of setting LP_DELETE flag on them
while pruning. The tuple then can be reused for next in-page HOT-update.



> When we run out space for update-within-the-block, we traverse
> through all the line pointers looking for LP_DELETEd items. If any of
> these
> items have space large enough to store the new tuple, that item is
> reused.
> Does anyone see any issue with doing this ? Also, any suggestions
> about doing it in a better way ?

IIRC the size is determined by the next tuple pointer, so you can store
new data without changing tuple pointer only if they are exactly the
same size.

There is a lp_len field in the line pointer to store the length of the
tuple. ISTM that we can reduce that while reusing the line pointer. But
that would create a permanent hole in the page.


> we are
> more concerned about the large tables, the chances of being able to
> upgrade
> the exclusive lock to vacuum-strength lock are high. Comments ?

I'm not sure about the "we are more concerned about the large tables"
part. I see it more as a device for high-update tables. This may not
always be the same as "large", so there should be some fallbacks for
case where you can't get the lock. Maybe just give up and move to
another page ?


Oh, yes. I agree. The fallback option of doing COLD update always
exists.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: [HACKERS] HOT WIP Patch - version 2

From
Bruce Momjian
Date:
Pavan Deolasee wrote:
> When following a HOT-update chain from the index fetch, if we notice that
> the root tuple is dead and it is HOT-updated, we try to prune the chain to
> the smallest possible length. To do that, the share lock is upgraded to an
> exclusive lock and the tuple chain is followed till we find a
> live/recently-dead
> tuple. At that point, the root t_ctid is made point to that tuple. In order

I assume you meant recently-dead here, rather than live/recently-dead,
because we aren't going to change live ctids, right?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] HOT WIP Patch - version 2

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> ... Yes. The HOT-updated status of the root and all intermediate
> tuples is cleared and their respective ctid pointers are made
> point to themselves.

Doesn't that destroy the knowledge that they form a tuple chain?
While it might be that no one cares any longer, it would seem more
reasonable to leave 'em chained together.

            regards, tom lane

Re: [HACKERS] HOT WIP Patch - version 2

From
"Pavan Deolasee"
Date:

On 2/20/07, Bruce Momjian <bruce@momjian.us> wrote:
Pavan Deolasee wrote:
> When following a HOT-update chain from the index fetch, if we notice that
> the root tuple is dead and it is HOT-updated, we try to prune the chain to
> the smallest possible length. To do that, the share lock is upgraded to an
> exclusive lock and the tuple chain is followed till we find a
> live/recently-dead
> tuple. At that point, the root t_ctid is made point to that tuple. In order

I assume you meant recently-dead here, rather than live/recently-dead,
because we aren't going to change live ctids, right?

No, I meant live or recently-dead (in fact, anything  other than  HEAPTUPLE_DEAD
or HEAPTUPLE_DEAD_CHAIN).

We are not changing the tids here, but only pruning the HOT-update chain.
After pruning, the root->t_ctid points to the oldest tuple that might be
visible to any backend. The live tuples are still identified by their
original tid and index reachable from the root tuple.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: [HACKERS] HOT WIP Patch - version 2

From
Bruce Momjian
Date:
Pavan Deolasee wrote:
> On 2/20/07, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > Pavan Deolasee wrote:
> > > When following a HOT-update chain from the index fetch, if we notice
> > that
> > > the root tuple is dead and it is HOT-updated, we try to prune the chain
> > to
> > > the smallest possible length. To do that, the share lock is upgraded to
> > an
> > > exclusive lock and the tuple chain is followed till we find a
> > > live/recently-dead
> > > tuple. At that point, the root t_ctid is made point to that tuple. In
> > order
> >
> > I assume you meant recently-dead here, rather than live/recently-dead,
> > because we aren't going to change live ctids, right?
>
>
> No, I meant live or recently-dead (in fact, anything  other than
> HEAPTUPLE_DEAD
> or HEAPTUPLE_DEAD_CHAIN).
>
> We are not changing the tids here, but only pruning the HOT-update chain.
> After pruning, the root->t_ctid points to the oldest tuple that might be
> visible to any backend. The live tuples are still identified by their
> original tid and index reachable from the root tuple.

I am confused.  Where is the root->t_ctid stored?

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] HOT WIP Patch - version 2

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Pavan Deolasee wrote:
>> When following a HOT-update chain from the index fetch, if we notice that
>> the root tuple is dead and it is HOT-updated, we try to prune the chain to
>> the smallest possible length. To do that, the share lock is upgraded to an
>> exclusive lock and the tuple chain is followed till we find a
>> live/recently-dead
>> tuple. At that point, the root t_ctid is made point to that tuple. In order

> I assume you meant recently-dead here, rather than live/recently-dead,
> because we aren't going to change live ctids, right?

"Recently dead" means "still live to somebody", so those tids better not
change either.  But I don't think that's what he meant.  I'm more
worried about the deadlock possibilities inherent in trying to upgrade a
buffer lock.  We do not have deadlock detection for LWLocks.

            regards, tom lane

Re: [HACKERS] HOT WIP Patch - version 2

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Pavan Deolasee wrote:
> >> When following a HOT-update chain from the index fetch, if we notice that
> >> the root tuple is dead and it is HOT-updated, we try to prune the chain to
> >> the smallest possible length. To do that, the share lock is upgraded to an
> >> exclusive lock and the tuple chain is followed till we find a
> >> live/recently-dead
> >> tuple. At that point, the root t_ctid is made point to that tuple. In order
>
> > I assume you meant recently-dead here, rather than live/recently-dead,
> > because we aren't going to change live ctids, right?
>
> "Recently dead" means "still live to somebody", so those tids better not
> change either.  But I don't think that's what he meant.  I'm more
> worried about the deadlock possibilities inherent in trying to upgrade a
> buffer lock.  We do not have deadlock detection for LWLocks.

I am guessing he is going to have to release the lock, then ask for an
exclusive one.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] HOT WIP Patch - version 2

From
"Pavan Deolasee"
Date:

On 2/20/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> ... Yes. The HOT-updated status of the root and all intermediate
> tuples is cleared and their respective ctid pointers are made
> point to themselves.

Doesn't that destroy the knowledge that they form a tuple chain?
While it might be that no one cares any longer, it would seem more
reasonable to leave 'em chained together.


I see your point, but as you mentioned do we really care ? The chain
needs to be broken so that the intermediate DEAD tuples can be
vacuumed. We can't vacuum them normally because they could
be a part of live HOT-update chain. Resetting the HOT-updated
status of the root tuple helps to mark the index entry LP_DELETE
once the entire HOT-update chain is dead.

Also, if we decide to reuse the heap-only tuples without even
vacuuming, breaking the chain is a better option since we then
guarantee no references to the heap-only DEAD tuples.

Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: [HACKERS] HOT WIP Patch - version 2

From
"Pavan Deolasee"
Date:

On 2/20/07, Bruce Momjian <bruce@momjian.us> wrote:
Tom Lane wrote:
>
> "Recently dead" means "still live to somebody", so those tids better not
> change either.  But I don't think that's what he meant.  I'm more
> worried about the deadlock possibilities inherent in trying to upgrade a
> buffer lock.  We do not have deadlock detection for LWLocks.

I am guessing he is going to have to release the lock, then ask for an
exclusive one.

Yes, thats what is done. Since we try to prune the HOT-update chain
even in the SELECT path, we upgrade the lock only if we are sure
that there is atleast one tuple that can be removed from the chain
or the root needs to be fixed (broken ctid chain for some reason).
 
Thanks,
Pavan

--

EnterpriseDB     http://www.enterprisedb.com

Re: [HACKERS] HOT WIP Patch - version 2

From
mark@mark.mielke.cc
Date:
On Tue, Feb 20, 2007 at 08:31:45PM +0530, Pavan Deolasee wrote:
> I see your point, but as you mentioned do we really care ? The chain
> needs to be broken so that the intermediate DEAD tuples can be
> vacuumed. We can't vacuum them normally because they could
> be a part of live HOT-update chain. Resetting the HOT-updated
> status of the root tuple helps to mark the index entry LP_DELETE
> once the entire HOT-update chain is dead.
> ...

For some reason this paragraph raised a query in my mind. Will we
be able to toggle this new "hot update" code at configure time, so
that we can measure what sort of effect this change has once it is
complete?

Even if only during the early testing cycles for the next release, I
think it would be useful.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/


Re: [HACKERS] HOT WIP Patch - version 2

From
"Simon Riggs"
Date:
On Tue, 2007-02-20 at 09:48 +0200, Hannu Krosing wrote:

> I'm not sure about the "we are more concerned about the large tables"
> part. I see it more as a device for high-update tables. This may not
> always be the same as "large", so there should be some fallbacks for
> case where you can't get the lock. Maybe just give up and move to
> another page ?

Every design favours one or other of the various use cases.

The worst performance is caused by large tables with random updates,
because that causes substantially more I/O than a smaller table.

A table with substantially more updaters than rows will be unlikely to
ever yield a vacuum-level lock on the block, so the table will
inevitably grow. But because its fairly small, it won't grow that much
before an autovacuum is triggered to clean it up. The index entries will
still be minimised in this case.

The case of a small number of rows being very heavily updated in an
otherwise very large table will not be well optimised by this simplified
version of HOT. However, that case can still benefit from a Dead Space
Map approach. In view of other work on DSM it was felt that simplifying
HOT was the right thing to do. So DSM is still required.

If no other DSM approaches happen, it should be possible to implement an
80/20 version of DSM by simply running a VACUUM using the current FSM
implementation as the input blockids. In many cases that will yield a
good proportion of the benefits of a full VACUUM. I hope that will be
agreed if there isn't any other agreement on a full DSM solution; it
would be a pity to ignore such a low complexity solution.

Note also that Alvaro's multi-vacuum solution will also be required to
allow VACUUMs to be effective against heavily updated, yet smaller
tables.

So the comment about "more concerned with large tables" is really a
trade-off to allow a simpler solution, yet in an area that minimises the
performance disadvantages.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [HACKERS] HOT WIP Patch - version 2

From
"Jim C. Nasby"
Date:
There's a fair amount of added work to be done when updating tuples.
Will it be possible to postpone some of that to the bgwriter in a later
version? I realize that sometimes you'll still want to do the work up
front, like if it means we can stay on the same page instead of going
cold...

On Tue, Feb 20, 2007 at 12:08:14PM +0530, Pavan Deolasee wrote:
> Reposting - looks like the message did not get through in the first
> attempt. My apologies if multiple copies are received.
>
>
> This is the next version of the HOT WIP patch. Since the last patch that
> I sent out, I have implemented the HOT-update chain pruning mechanism.
>
> When following a HOT-update chain from the index fetch, if we notice that
> the root tuple is dead and it is HOT-updated, we try to prune the chain to
> the smallest possible length. To do that, the share lock is upgraded to an
> exclusive lock and the tuple chain is followed till we find a
> live/recently-dead
> tuple. At that point, the root t_ctid is made point to that tuple. In order
> to
> preserve the xmax/xmin chain, the xmax of the root tuple is also updated
> to xmin of the found tuple. Since this xmax is also < RecentGlobalXmin
> and is a committed transaction, the visibility of the root tuple still
> remains
> the same.
>
> The intermediate heap-only tuples are  removed from the HOT-update chain.
> The HOT-updated status of these tuples is cleared and their respective
> t_ctid are made point to themselves. These tuples are not reachable now
> and ready for vacuuming. This entire action is logged in a single
> WAL record.
>
> During vacuuming, we keep track of number of root tuples vacuumed.
> If this count is zero, then the index cleanup step is skipped. This
> would avoid unnecessary index scans whenever possible.
>
> This patch should apply cleanly on current CVS head and pass all regression
> tests. I am still looking for review comments from the first WIP patch. If
> anyone
> has already looked through it and is interested in the incremental changes,
> please let me know. I can post that.
>
> Whats Next ?
> -----------------
>
> ISTM that  the basic  HOT-updates and ability to prune the HOT-update chain,
>
> should help us reduce the index bloat, limit the overhead of ctid following
> in
> index fetch and efficiently vacuum heap-only tuples. IMO the next important
> but rather less troublesome thing to tackle is to reuse space within a block
>
> without complete vacuum of the table. This would help us do much more
> HOT-updates and thus further reduce index/heap bloat.
>
> I am thinking of reusing the DEAD heap-only tuples which gets removed from
> the HOT-update chain as part of pruning operation. Since these tuples, once
> removed from the chain, are neither reachable nor have any index references,
> could be readily used for storing newer versions of the same or other rows
> in
> the block. How about setting LP_DELETE on these tuples as part of the
> prune operation ? LP_DELETE is unused for heap tuples, if I am not
> mistaken. Other information like length and offset are is maintained as it
> is.
> When we run out space for update-within-the-block, we traverse
> through all the line pointers looking for LP_DELETEd items. If any of these
> items have space large enough to store the new tuple, that item is reused.
> Does anyone see any issue with doing this ? Also, any suggestions
> about doing it in a better way ?
>
> If the page gets really fragmented, we can try to grab a VACUUM-strength
> lock on the page and de-fragment it. The lock is tried conditionally to
> avoid
> any deadlocks. This is done in the heap_update() code path, so would add
> some overhead, but may still prove better than putting the tuple in a
> different block and having corresponding index insert(s). Also, since we are
> more concerned about the large tables, the chances of being able to upgrade
> the exclusive lock to vacuum-strength lock are high. Comments ?
>
> If there are no objections, I am planning to work on the first part
> while Nikhil would take up the second task of block level retail-vacuum.
> Your comments on these issues and the patch are really appreciated.
>
> Thanks,
> Pavan
>
> --
>
> EnterpriseDB     http://www.enterprisedb.com


>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster


--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)