Thread: HOT WIP Patch - version 3.2
Please see the attached WIP HOT patch - version 3.2. It now
implements the logic for reusing heap-only dead tuples. When a
HOT-update chain is pruned, the heap-only tuples are marked
LP_DELETE. The lp_offset and lp_len fields in the line pointer are
maintained.
When a backend runs out of free space in a page when doing an
UPDATE, it searches the line pointers to find a slot
which is marked LP_DELETEd and has enough space to accommodate
the new tuple. If such a slot is found, its reused. We might
waste some space if the slot is larger than the tuple, but
that gets reclaimed at VACUUM time.
During VACUUM, tuple-chains are pruned, one page at a time.
This is necessary so that chain is pruned even if the tuple
is not accessed after the update (normally pruning happens
when the tuple is index fetched).
This patch also implements the logic to convert dead root
tuple into stub. This is done by changing the lp_len of the
root tuple slot to sizeof (HeapTupleHeaderData).
The space released can only be reclaimed at the time of VACUUM.
Open Issues:
-----------------
There are quite a few:
- What do we do with the LP_DELETEd tuples at the VACUUM time ?
In this patch, we are collecting them and vacuuming like
any other dead tuples. But is that the best thing to do ?
- While searching for a LP_DELETEd tuple, we start from the
first offset and return the first slot which is big enough
to store the tuple. Is there a better search algorithm
(sorting/randomizing) ? Should we go for best-fit instead
of first-fit ?
- Should we have metadata on the heap page to track the
number of LP_DELETEd tuples, number of HOT-update chains in the
page and any other information that can help us optimize
search/prune operations ?
- There are some interesting issues in the VACUUMing area. How
do we count the dead tuples ? Should we count HOT-updated
tuples in the dead count ? If we do so, I noticed that
VACUUM gets triggered on very small tables like "tellers"
in pgbench and takes several minutes to finish because
it waits very very long for VACUUM-strength lock on the
index pages. Index is just a page or two in this case.
Whats Next ?
------------------
ISTM that the next important item to do is handling of dead root
tuples.
There are several designs proposed to handle this, but I don't
think we have consensus on any one design. I am thinking
going ahead with my proposal of line-pointer indirection.
I am not passing judgement on other proposed designs, but for the
lack of consensus I am choosing the simplest one right now and
which seems good to me :)
Are there any serious objections to this approach ? This would waste
4 bytes per HOT-updated tuple chain for the life-time of the
chain, but would provide an excellent opportunity to reuse the
dead root tuple just like any other heap-only dead tuple.
The "stub" approach would waste 4 bytes of line pointer + 24 bytes
of header and it requires index swing to remove the stub.
The cyclic tuple chain would temporarily waste space upto size of the
root tuple until it gets reused by another update of the same row. Plus
there are concerns about additional header field for back pointer and
complexity involved in locating visible tuple.
Based on these observations, I am inclined towards line-pointer
redirection approach. I haven't heard any for/against comments on
this yet. Please let me know if there are objections before I
start working on it.
Thanks,
Pavan
--
EnterpriseDB http://www.enterprisedb.com
Attachment
Pavan Deolasee wrote: > - What do we do with the LP_DELETEd tuples at the VACUUM time ? > In this patch, we are collecting them and vacuuming like > any other dead tuples. But is that the best thing to do ? Since they don't need index cleanups, it's a waste of maintenance_work_mem to keep track of them in the dead tuples array. Let's remove them in the 1st phase. That means trading the shared lock for a vacuum-level lock on pages with LP_DELETEd tuples. Or if we want to get fancy, we could skip LP_DELETEd tuples in the 1st phase for pages that had dead tuples on them, and scan and remove them in the 2nd phase when we have to acquire the vacuum-level lock anyway. > - While searching for a LP_DELETEd tuple, we start from the > first offset and return the first slot which is big enough > to store the tuple. Is there a better search algorithm > (sorting/randomizing) ? Should we go for best-fit instead > of first-fit ? Best-fit seems better to me. It's pretty cheap to scan for LP_DELETEd line pointers, but wasting space can lead to cold updates and get much more expensive. You could also prune the chains on the page to make room for the update, and if you can get a vacuum lock you can also defrag the page. > - Should we have metadata on the heap page to track the > number of LP_DELETEd tuples, number of HOT-update chains in the > page and any other information that can help us optimize > search/prune operations ? I don't think the CPU overhead is that significant; we only need to do the search/prune when a page gets full. We can add flags later if we feel like it, but let's keep it simple for now. > - There are some interesting issues in the VACUUMing area. How > do we count the dead tuples ? Should we count HOT-updated > tuples in the dead count ? If we do so, I noticed that > VACUUM gets triggered on very small tables like "tellers" > in pgbench and takes several minutes to finish because > it waits very very long for VACUUM-strength lock on the > index pages. Index is just a page or two in this case. Yeah, that's not good. HOT updates shouldn't increase the n_dead_tuples pgstat counter. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 2/27/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
I liked the idea of not collecting the LP_DELETEd tuples in the first
pass. We also prune the HOT-update chains in the page in the first
pass, may be that can also be moved to second pass. We need to
carefully work on the race conditions involved in the VACUUM, pruning
and tuple reuse though.
Ok. I will give it a shot once the basic things are ready.
Yes, thats a good suggestion as well. I am already doing that in the
patch I am working on right now.
I am making good progress with the line-pointer redirection stuff.
Its showing tremendous value in keeping the table and index size
in control. But we need to check for the CPU overhead as well
and if required optimize there.
Thanks,
Pavan
Pavan Deolasee wrote:
> - What do we do with the LP_DELETEd tuples at the VACUUM time ?
> In this patch, we are collecting them and vacuuming like
> any other dead tuples. But is that the best thing to do ?
Since they don't need index cleanups, it's a waste of
maintenance_work_mem to keep track of them in the dead tuples array.
Let's remove them in the 1st phase. That means trading the shared lock
for a vacuum-level lock on pages with LP_DELETEd tuples. Or if we want
to get fancy, we could skip LP_DELETEd tuples in the 1st phase for pages
that had dead tuples on them, and scan and remove them in the 2nd phase
when we have to acquire the vacuum-level lock anyway.
I liked the idea of not collecting the LP_DELETEd tuples in the first
pass. We also prune the HOT-update chains in the page in the first
pass, may be that can also be moved to second pass. We need to
carefully work on the race conditions involved in the VACUUM, pruning
and tuple reuse though.
> - While searching for a LP_DELETEd tuple, we start from the
> first offset and return the first slot which is big enough
> to store the tuple. Is there a better search algorithm
> (sorting/randomizing) ? Should we go for best-fit instead
> of first-fit ?
Best-fit seems better to me. It's pretty cheap to scan for LP_DELETEd
line pointers, but wasting space can lead to cold updates and get much
more expensive.
Ok. I will give it a shot once the basic things are ready.
You could also prune the chains on the page to make room for the update,
and if you can get a vacuum lock you can also defrag the page.
Yes, thats a good suggestion as well. I am already doing that in the
patch I am working on right now.
> - Should we have metadata on the heap page to track the
> number of LP_DELETEd tuples, number of HOT-update chains in the
> page and any other information that can help us optimize
> search/prune operations ?
I don't think the CPU overhead is that significant; we only need to do
the search/prune when a page gets full. We can add flags later if we
feel like it, but let's keep it simple for now.
I am making good progress with the line-pointer redirection stuff.
Its showing tremendous value in keeping the table and index size
in control. But we need to check for the CPU overhead as well
and if required optimize there.
Pavan
--
EnterpriseDB http://www.enterprisedb.com