Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date
Msg-id CAH2-WzkLEHUEPfRU+u3_DcYL7mYgUgpR8z=o-3RdLvVWQoPr1A@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think unique indexes may benefit from deduplication not only because
> > of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> > in unique indexes.  And those duplicates can take significant space.
>
> I agree that we should definitely have an open mind about unique
> indexes, even with non-NULL values. If we can prevent a page split by
> deduplicating the contents of a unique index page, then we'll probably
> win. Why not try? This will need to be tested.

I thought about this some more. I believe that the LP_DEAD bit setting
within _bt_check_unique() is generally more important than the more
complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even
though the _bt_check_unique() thing can only be used with unique
indexes. Also, I have often thought that we don't do enough to take
advantage of the special characteristics of unique indexes -- they
really are quite different. I believe that other database systems do
this in various ways. Maybe we should too.

Unique indexes are special because there can only ever be zero or one
tuples of the same value that are visible to any possible MVCC
snapshot. Within the index AM, there is little difference between an
UPDATE by a transaction and a DELETE + INSERT of the same value by a
transaction. If there are 3 or 5 duplicates within a unique index,
then there is a strong chance that VACUUM could reclaim some of them,
given the chance. It is worth going to a little effort to find out.

In a traditional serial/bigserial primary key, the key space that is
typically "owned" by the left half of a rightmost page split describes
a range of about ~366 items, with few or no gaps for other values that
didn't exist at the time of the split (i.e. the two pivot tuples on
each side cover a range that is equal to the number of items itself).
If the page ever splits again, the chances of it being due to non-HOT
updates is perhaps 100%. Maybe VACUUM just didn't get around to the
index in time, or maybe there is a long running xact, or whatever. If
we can delay page splits in indexes like this, then we could easily
prevent them from *ever* happening.

Our first line of defense against page splits within unique indexes
will probably always be LP_DEAD bits set within _bt_check_unique(),
because it costs so little -- same as today. We could also add a
second line of defense: deduplication -- same as with non-unique
indexes with the patch. But we can even add a third line of defense on
top of those two: more aggressive reclaiming of posting list space, by
going to the heap to check the visibility status of earlier posting
list entries. We can do this optimistically when there is no LP_DEAD
bit set, based on heuristics.

The high level principle here is that we can justify going to a small
amount of extra effort for the chance to avoid a page split, and maybe
even more than a small amount. Our chances of reversing the split by
merging pages later on are almost zero. The two halves of the split
will probably each get dirtied again and again in the future if we
cannot avoid it, plus we have to dirty the parent page, and the old
sibling page (to update its left link). In general, a page split is
already really expensive. We could do something like amortize the cost
of accessing the heap a second time for tuples that we won't have
considered setting the LP_DEAD bit on within _bt_check_unique() by
trying the *same* heap page a *second* time where possible (distinct
values are likely to be nearby on the same page). I think that an
approach like this could work quite well for many workloads. You only
pay a cost (visiting the heap an extra time) when it looks like you'll
get a benefit (not splitting the page).

As you know, Andres already changed nbtree to get an XID for conflict
purposes on the primary by visiting the heap a second time (see commit
558a9165e08), when we need to actually reclaim LP_DEAD space. I
anticipated that we could extend this to do more clever/eager/lazy
cleanup of additional items before that went in, which is a closely
related idea. See:


https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7

I know that this is a bit hand-wavy; the details certainly need to be
worked out. However, it is not so different to the "ghost bit" design
that other systems use with their non-unique indexes (though this idea
applies specifically to unique indexes in our case). The main
difference is that we're going to the heap rather than to UNDO,
because that's where we store our visibility information. That doesn't
seem like such a big difference -- we are also reasonably confident
that we'll find that the TID is dead, even without LP_DEAD bits being
set, because we only do the extra stuff with unique indexes. And, we
do it lazily.

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Brad DeJong
Date:
Subject: Re: Comment typo in tableam.h
Next
From: Bruce Momjian
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)