Re: Enabling B-Tree deduplication by default - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Enabling B-Tree deduplication by default
Date
Msg-id CAH2-Wz=0wvCrSpMOrHJpQXTFPPA9uh7pVpV2F5tZoiJAvy_LJQ@mail.gmail.com
Whole thread Raw
In response to Re: Enabling B-Tree deduplication by default  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Enabling B-Tree deduplication by default  (Peter Geoghegan <pg@bowt.ie>)
Re: Enabling B-Tree deduplication by default  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Jan 16, 2020 at 10:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 15, 2020 at 6:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > There are some outstanding questions about how B-Tree deduplication
> > [1] should be configured, and whether or not it should be enabled by
> > default. I'm starting this new thread in the hopes of generating
> > discussion on these high level questions.
>
> It seems like the issue here is that you're pretty confident that
> deduplication will be a win for unique indexes, but not so confident
> that this will be true for non-unique indexes.

Right.

> I don't know that I understand why.

The main reason that I am confident about unique indexes is that we
only do a deduplication pass in a unique index when we observe that
the incoming tuple (the one that might end up splitting the page) is a
duplicate of some existing tuple. Checking that much is virtually
free, since we already have the information close at hand today (we
cache the _bt_check_unique() binary search bounds for reuse within
_bt_findinsertloc() today). This seems to be an excellent heuristic,
since we really only want to target unique index leaf pages where all
or almost all insertions must be duplicates caused by non-HOT updates
-- this category includes all the pgbench indexes, and includes all of
the unique indexes in TPC-C. Whereas with non-unique indexes, we
aren't specifically targeting version churn (though it will help with
that too).

Granted, the fact that the incoming tuple happens to be a duplicate is
not a sure sign that the index is in this informal "suitable for
deduplication" category of mine. The incoming duplicate could just be
a once off. Even still, it's extremely unlikely to matter -- a failed
deduplication pass really isn't that expensive anyway, since it takes
place just before we split the page (we'll need the page in L1 cache
anyway). If we consistently attempt deduplication in a unique index,
then we're virtually guaranteed to consistently benefit from it.

In general, the way that deduplication is only considered at the point
where we'd otherwise have to split the page buys *a lot*. The idea of
delaying page splits by doing something like load balancing or
compression in a lazy fashion has a long history -- it was not my
idea. I'm not talking about the LP_DEAD bit set deletion stuff here --
this goes back to the 1970s.

> It does seem odd to me to treat them differently, but it's possible
> that this is a reflection of my own lack of understanding. What do
> other database systems do?

Other database systems treat unique indexes very differently, albeit
in a way that we're not really in a position to take too much away
from -- other than the general fact that unique indexes can be thought
of as very different things.

In general, the unique indexes in other systems are expected to be
unique in every sense, even during an "UPDATE foo SET unique_key =
unique_key + 1" style query. Index tuples are slightly smaller in a
unique index compared to an equivalent non-unique index in the case of
one such system. Also, that same system has something called a "unique
index scan" that can only be used with a unique index (and only when
all columns appear in the query qual).

> I wonder whether we could avoid the downside of having only one
> LP_DEAD bit per line pointer by having a bit per TID within the
> compressed tuples themselves. I assume you already thought about that,
> though.

So far, this lack of LP_DEAD bit granularity issue is only a
theoretical problem. I haven't been able to demonstrate in any
meaningful way. Setting LP_DEAD bits is bound to be correlated, and we
only deduplicate to avoid a page split.

Just last night I tried a variant pgbench workload with a tiny
accounts table, an extremely skewed Zipf distribution, and lots of
clients relative to the size of the machine. I used a non-unique index
instead of a unique index, since that is likely to be where the patch
was weakest (no _bt_check_unique() LP_DEAD bit setting that way). The
patch still came out ahead of the master branch by about 3%. It's very
hard to prove that there is no real downside to having only one
LP_DEAD bit per posting list tuple, since absence of evidence isn't
evidence of absence. I believe that it's much easier to make the
argument that it's okay to one have one LP_DEAD bit per posting list
within unique indexes specifically, though (because we understand that
there can be no duplicates in the long run there).

Throughout this work, and the v12 B-Tree work, I consistently made
conservative decisions about space accounting in code like
nbtsplitloc.c (the new nbtdedup.c code has to think about space in
about the same way). My intuition is that space accounting is one area
where we really ought to be conservative, since it's so hard to test.
That's the main reason why I find the idea of having LP_DEAD bits
within posting list tuples unappealing, whatever the benefits may be
-- it adds complexity in the one area that I really don't want to add
complexity.

> What are the characteristics of this system if you have an index that
> is not declared as UNIQUE but actually happens to be UNIQUE?

I believe that the only interesting characteristic is that it is
append-only, and has no reads. The variant of the insert benchmark
that does updates and deletes will still come out ahead, because then
version churn comes in to play -- just like with the pgbench unique
indexes (even though these aren't unique indexes).

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: SlabCheck leaks memory into TopMemoryContext
Next
From: David Steele
Date:
Subject: Re: making the backend's json parser work in frontend code