Re: UNDO and in-place update - Mailing list pgsql-hackers

From Robert Haas
Subject Re: UNDO and in-place update
Date
Msg-id CA+TgmoZ0gt8XXng1eDstdJo17N_gvfrmdBVpOqVu6CjFCx0whQ@mail.gmail.com
Whole thread Raw
In response to Re: UNDO and in-place update  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: UNDO and in-place update
Re: UNDO and in-place update
List pgsql-hackers
On Tue, Nov 29, 2016 at 12:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think we need to avoid putting the visibility information in the
>> index because that will make the index much bigger.
>
> I agree that there will be an increase in index size, but it shouldn't
> be much if we have transaction information (and that too only active
> transactions) at page level instead of at tuple level.

It depends on the workload.  If you have something where the number of
transactions per page is small, like a bulk load, that scheme may work
out great.  But if you have something like a pgbench workload with an
extra index on the abalance column, you could have a different
transaction for almost every tuple.  Of course the question of how
many of those will be recent is a good one; if it's not too many,
maybe it's not that bad.

> I think the
> number of concurrent writes on the index will be lesser as compared to
> the heap.  There are a couple of benefits of having visibility
> information in the index.
>
> a. Heap and index could be independently cleaned up and in most cases
> by foreground transactions only.  The work for vacuum will be quite
> less as compared to now.  I think this will help us in avoiding the
> bloat to a great degree.
>
> b. Improved index-only scans, because now we don't need visibility map
> of the heap to check tuple visibility.
>
> c. Some speed improvements for index scans can also be expected
> because with this we don't need to perform heap fetch for invisible
> index tuples.

I see what you're going for, but I'm not sure it's worth it.  I mean,
say you just have one bit per index tuple.  If it's set, the heap
tuple is definitely all-visible; if it's clear, then the tuple may or
may not be visible to your snapshot.  You insert tuples with the bit
clear, and clear the bit again when the tuple is deleted or the
indexed column is updated.  You set the bit when you follow the index
pointer and discover that the tuple is all-visible, so that next time
you don't need to follow it again.  Of the advantages you mention
above, this still allows (a), but less efficiently.  It allows (b).
It does not allow (c).  So, it's not as good.  But, it saves you the
overhead of storing xmin and/or xmax for each tuple in the page.  Even
if you have an optimization to list the XIDs once per page and then
refer to them from the tuples, you are going to need a byte in each
tuple for the xmin-index and a byte for the xmax-index, or something
like that, plus the space for the XID list itself.  That's not a ton
of space, maybe, but it's definitely more.  It's also a significantly
bigger change to the tuple format, I think.  What I'm proposing could
be done by repurposing the LP_* bits available in the line pointer.
So I think the questions are (1) will the extra space consumption hurt
us more than the benefit we will get? and (2) is it really worth the
work to change the page format to that degree?

As you can probably tell, I'm leaning toward "no".  I think a bit
per-tuple -- whether we call it delete-mark or definitely-all-visible
-- should be enough to get most of the benefit that is available here
for substantially less work and no extra space.

> I think with proposed design there is a good chance that for many of
> the index scans, there is a need for the extra hop to decide
> visibility of index tuple. I think as of today, during index scan if
> the corresponding heap page is not-all-visible, then we check the
> corresponding heap tuple to decide about the visibility of index
> tuple, whereas with proposed design, I think it first needs to check
> heap page and then TPD, if there is more than one transaction
> operating on page.

There's a couple of possible designs here, but there is the
possibility for extra hops in some cases.  But there are things we can
do to mitigate that.

1. If each tuple has a definitely-all-visible bit, you can check that
first; if it's set, the tuple is definitely all-visible you don't need
to do anything further.

2. If we still maintain a visibility map, you can check that next.  If
the bit is set for the target page, then the tuple is definitely
all-visible and you don't need to do anything further.

3. Otherwise, you need to look at the heap page.  But if the heap
page's UNDO pointer is invalid, then the whole page is all-visible, so
you're done.  (You can also set the visibility map bit and/or the
index tuple's definitely-all-visible bit to avoid having to recheck
the heap page in the future.)

4. Each tuple will have a bit indicating whether it's been recently
modified; that is, whether the transaction whose UNDO log is
referenced by the page's UNDO pointer did anything to that tuple; or
if the page points to a TPD entry, whether any of the transactions in
the TPD modified that tuple.  If that bit is not set for that
particular tuple, it's all-visible and you're done.

5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.

If we put in the visibility information in the index as you are
proposing, then we can always resolve the visibility of the index
tuple at step 1; we just test xmin and xmax against the snapshot.  For
index-only scans, that's great, because we never need to consult the
heap at all.  For regular index scans, it's not nearly as great,
because we're still going to need to consult the TPD and UNDO logs to
determine which version of the tuple is visible to that snapshot.  You
can avoid looking at the heap in the case where no version of the
tuple is visible to the scan, but that's probably mostly going to
happen only in cases where the transaction is still in flight so in
most cases the heap will be in shared_buffers anyway and you won't
save very much.

> Such a design will work, but I think this is more work to mark the
> tuple as Dead as compared to current system.

Anything that allows update-in-place requires touching the index
entries when they are invalidated, right?  That's the main cost.

>>>>  That's OK, but we're in real trouble if
>>>> step-3 inserted an additional index tuple pointing to that chain
>>>> rather than simply noting that one already exists.  If it adds an
>>>> additional one, then we'll visit two index tuples for that TID.  Each
>>>> time, the visibility information in UNDO will cause us to return the
>>>> correct tuple, but we've erred by returning it twice (once per index
>>>> entry) rather than just once.
>>>
>>> Why will scan traverse the UNDO chain if it starts after step-3 commit?
>>
>> Why wouldn't it?  I think that if an index scan hits a page with an
>> UNDO chain, it always need to traverse the whole thing.
>
> If you notice the scan has started after all the writes have
> committed, so what is the guarantee that there will be a UNDO chain?
> Yes, it could be there if there is some live transaction which started
> before step, otherwise, I am not sure if we can guarantee that UNDO
> chain will be present.

The UNDO chain has to remain until all transactions that modified the
page are all-visible, not just until the transactions are committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Vladimir Rusinov
Date:
Subject: Re: pg_xlogdump follow into the future
Next
From: Robert Haas
Date:
Subject: Re: Hash Indexes