Thread: pgsql: Add deduplication to nbtree.

pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
Add deduplication to nbtree.

Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method.  The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs.  Deduplication is only applied at
the point where a leaf page split would otherwise be required.  New
posting list tuples are formed by merging together existing duplicate
tuples.  The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed.  Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.

The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach.  Most individual inserts of index tuples have
exactly the same overhead as before.  The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits.  The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).

Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key).  The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective.  This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.

A new index storage parameter (deduplicate_items) controls the use of
deduplication.  The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible.  This decision will be
reviewed at the end of the Postgres 13 beta period.

There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values.  The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely.  Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").

Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.

No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12).  However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from.  This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.

Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
    https://postgr.es/m/55E4051B.7020209@postgrespro.ru
    https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/0d861bbb702f8aa05c2a4e3f1650e7e8df8c8c27

Modified Files
--------------
contrib/amcheck/verify_nbtree.c           | 231 +++++++--
doc/src/sgml/btree.sgml                   | 201 +++++++-
doc/src/sgml/charset.sgml                 |   9 +-
doc/src/sgml/citext.sgml                  |   7 +-
doc/src/sgml/func.sgml                    |   9 +-
doc/src/sgml/ref/create_index.sgml        |  44 +-
src/backend/access/common/reloptions.c    |  10 +
src/backend/access/index/genam.c          |   4 +
src/backend/access/nbtree/Makefile        |   1 +
src/backend/access/nbtree/README          | 133 ++++-
src/backend/access/nbtree/nbtdedup.c      | 830 ++++++++++++++++++++++++++++++
src/backend/access/nbtree/nbtinsert.c     | 388 ++++++++++++--
src/backend/access/nbtree/nbtpage.c       | 246 ++++++++-
src/backend/access/nbtree/nbtree.c        | 171 +++++-
src/backend/access/nbtree/nbtsearch.c     | 272 +++++++++-
src/backend/access/nbtree/nbtsort.c       | 193 ++++++-
src/backend/access/nbtree/nbtsplitloc.c   |  39 +-
src/backend/access/nbtree/nbtutils.c      | 196 +++++--
src/backend/access/nbtree/nbtxlog.c       | 268 +++++++++-
src/backend/access/rmgrdesc/nbtdesc.c     |  22 +-
src/backend/storage/page/bufpage.c        |  11 +-
src/bin/psql/tab-complete.c               |   4 +-
src/include/access/nbtree.h               | 433 +++++++++++++---
src/include/access/nbtxlog.h              | 117 ++++-
src/include/access/rmgrlist.h             |   2 +-
src/include/access/xlog_internal.h        |   2 +-
src/test/regress/expected/btree_index.out |  20 +-
src/test/regress/sql/btree_index.sql      |  22 +-
28 files changed, 3553 insertions(+), 332 deletions(-)


Re: pgsql: Add deduplication to nbtree.

From
Laurenz Albe
Date:
On Wed, 2020-02-26 at 21:06 +0000, Peter Geoghegan wrote:
> Add deduplication to nbtree.

This is great! Thanks!

Are no changes to the "pageinspect" contrib required?

Yours,
Laurenz Albe




Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Thu, Feb 27, 2020 at 9:25 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> This is great! Thanks!

Thanks!

> Are no changes to the "pageinspect" contrib required?

There are. Those will be pushed either today or tomorrow.

-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Thu, Feb 27, 2020 at 9:26 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Are no changes to the "pageinspect" contrib required?
>
> There are. Those will be pushed either today or tomorrow.

Attached is a draft patch for this, with updated documentation. I'd
like to push this tomorrow (Saturday), but if you could take a look
over it first, that would be great.

Thanks
-- 
Peter Geoghegan

Attachment

Re: pgsql: Add deduplication to nbtree.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> Add deduplication to nbtree.

Coverity isn't very happy with the coding in _bt_update_posting():

*** CID 1460433:  Memory - corruptions  (ARRAY_VS_SINGLETON)
/srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtdedup.c: 723 in _bt_update_posting()
717         {
718             if (d < vacposting->ndeletedtids && vacposting->deletetids[d] == i)
719             {
720                 d++;
721                 continue;
722             }
>>>     CID 1460433:  Memory - corruptions  (ARRAY_VS_SINGLETON)
>>>     Using "htids" as an array.  This might corrupt or misinterpret adjacent memory locations.
723             htids[ui++] = *BTreeTupleGetPostingN(origtuple, i);
724         }
725         Assert(ui == nhtids);
726         Assert(d == vacposting->ndeletedtids);
727         Assert(nhtids == 1 || _bt_posting_valid(itup));

I can see its point: asserting after the fact that you didn't clobber
memory isn't a terribly safe coding method, especially in a production
build where you won't even have the asserts.  Not sure if there's a
better way though.

            regards, tom lane



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Sun, Mar 1, 2020 at 10:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I can see its point: asserting after the fact that you didn't clobber
> memory isn't a terribly safe coding method, especially in a production
> build where you won't even have the asserts.  Not sure if there's a
> better way though.

I found it slightly more elegant to treat itup->t_tid as a degenerate
1 element posting list here, but I'm not particularly attached to that
approach. The loop is only truly necessary when dealing with a posting
list tuple.

Do you think that _bt_update_posting() should avoid this loop when
itup is just a plain tuple, that lacks a posting list? I can do it
that way if you prefer.

-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Mar 1, 2020 at 10:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I can see its point: asserting after the fact that you didn't clobber
>> memory isn't a terribly safe coding method, especially in a production
>> build where you won't even have the asserts.  Not sure if there's a
>> better way though.

> I found it slightly more elegant to treat itup->t_tid as a degenerate
> 1 element posting list here, but I'm not particularly attached to that
> approach. The loop is only truly necessary when dealing with a posting
> list tuple.
> Do you think that _bt_update_posting() should avoid this loop when
> itup is just a plain tuple, that lacks a posting list? I can do it
> that way if you prefer.

Hm.  That would probably be enough to shut up Coverity, but I'm unsure
whether it'd really be an improvement from the legibility and safety
viewpoints.  Do you want to try coding it that way and see what it
comes out like?

Note that we do have the ability to just dismiss the Coverity complaint,
if we decide that there's no better way to write the function.  But
it's usually worth looking to see if it could be written more cleanly.

            regards, tom lane



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Sun, Mar 1, 2020 at 11:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm.  That would probably be enough to shut up Coverity, but I'm unsure
> whether it'd really be an improvement from the legibility and safety
> viewpoints.

I noticed that _bt_update_posting() behaves as if the origtuple might
not be a posting list tuple at the point that keysize is calculated,
despite generally depending on it being a posting list tuple (which it
asserts by way of its "_bt_posting_valid(origtuple)" assertion). The
final tuple might not be a posting list, but the original one must be
(if it isn't, then nbtree VACUUM should be deleting it outright in the
traditional way, rather than updating it). I should fix that, either
way.

> Do you want to try coding it that way and see what it
> comes out like?

Sure.

-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Sun, Mar 1, 2020 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Do you want to try coding it that way and see what it
> > comes out like?
>
> Sure.

Attached patch shows how this could work. I prefer my original
approach, but I can see the argument for doing it this way.

If we keep my original approach, we should still add a new
"ItemPointerIsValid(&itup->t_tid)" assertion that covers the plain
tupe case in a way that mirrors the current "_bt_posting_valid(itup)"
assert.

-- 
Peter Geoghegan

Attachment

Re: pgsql: Add deduplication to nbtree.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> Attached patch shows how this could work. I prefer my original
> approach, but I can see the argument for doing it this way.

This does seem a bit duplicative ... and shouldn't both code paths
include a final "Assert(d == vacposting->ndeletedtids)"?  So maybe
we're better off just rejecting the Coverity complaint.

> If we keep my original approach, we should still add a new
> "ItemPointerIsValid(&itup->t_tid)" assertion that covers the plain
> tupe case in a way that mirrors the current "_bt_posting_valid(itup)"
> assert.

Another thing that maybe bears closer scrutiny is the size calculation.
It seems a bit confused as to whether the offset of the posting list
within the tuple, or the total tuple size, or both, needs to be
MAXALIGN'd.

            regards, tom lane



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Sun, Mar 1, 2020 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Attached patch shows how this could work. I prefer my original
> > approach, but I can see the argument for doing it this way.
>
> This does seem a bit duplicative ... and shouldn't both code paths
> include a final "Assert(d == vacposting->ndeletedtids)"?

No, because the "nhtids == 1" loop has a "break" when the first and
only TID that we need to keep around is hit.

> Another thing that maybe bears closer scrutiny is the size calculation.
> It seems a bit confused as to whether the offset of the posting list
> within the tuple, or the total tuple size, or both, needs to be
> MAXALIGN'd.

That's not quite true in my opinion. BTreeTupleSetPosting() has an
Assert() of its own about alignment. ISTM that it's reasonable for
_bt_update_posting() to assume that BTreeTupleGetPostingOffset() will
return a MAXALIGN()'d offset. Note also that _bt_form_posting() is
explicit about what it expects around alignment. This is
_bt_update_posting()'s sibling function. The _bt_update_posting()
calculation is explicitly documented as being derived from the one in
_bt_form_posting().

I am happy to add parallel-to-_bt_form_posting() assertions about
alignment to _bt_form_posting(), to nail it down completely. Plus I'll
add the assertion I suggested already. Once I commit a patch with
these two new assertions, I think that we can consider the matter
resolved. Does that seem reasonable to you?

-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Sun, Mar 1, 2020 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I am happy to add parallel-to-_bt_form_posting() assertions about
> alignment to _bt_form_posting(), to nail it down completely. Plus I'll
> add the assertion I suggested already. Once I commit a patch with
> these two new assertions, I think that we can consider the matter
> resolved. Does that seem reasonable to you?

I was thinking of the approach taken in the attached patch. It more or
less copies over the assertions from _bt_form_posting() that did not
appear in  _bt_update_posting().

-- 
Peter Geoghegan

Attachment

Re: pgsql: Add deduplication to nbtree.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Mar 1, 2020 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> I am happy to add parallel-to-_bt_form_posting() assertions about
>> alignment to _bt_form_posting(), to nail it down completely. Plus I'll
>> add the assertion I suggested already. Once I commit a patch with
>> these two new assertions, I think that we can consider the matter
>> resolved. Does that seem reasonable to you?

> I was thinking of the approach taken in the attached patch. It more or
> less copies over the assertions from _bt_form_posting() that did not
> appear in  _bt_update_posting().

OK by me.

            regards, tom lane



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Sun, Mar 1, 2020 at 8:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I was thinking of the approach taken in the attached patch. It more or
> > less copies over the assertions from _bt_form_posting() that did not
> > appear in  _bt_update_posting().
>
> OK by me.

Pushed. Thanks.

-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Tom Lane
Date:
Another issue I just noticed is that src/tools/pginclude/cpluspluscheck
complains thusly:

./src/include/access/nbtree.h: In function 'void BTreeTupleSetPosting(IndexTupleData*, int, int)':
./src/include/access/nbtree.h:384: warning: comparison between signed and unsigned integer expressions

I suppose this can be silenced with an appropriate cast, and doing so
would seem like a good idea.

            regards, tom lane



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Mon, Mar 2, 2020 at 9:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I suppose this can be silenced with an appropriate cast, and doing so
> would seem like a good idea.

I pushed a commit that silenced the cpluspluscheck warning just now.

Thanks
-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Andres Freund
Date:
Hi,

On 2020-03-01 16:09:37 -0800, Peter Geoghegan wrote:
> On Sun, Mar 1, 2020 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I am happy to add parallel-to-_bt_form_posting() assertions about
> > alignment to _bt_form_posting(), to nail it down completely. Plus I'll
> > add the assertion I suggested already. Once I commit a patch with
> > these two new assertions, I think that we can consider the matter
> > resolved. Does that seem reasonable to you?
> 
> I was thinking of the approach taken in the attached patch. It more or
> less copies over the assertions from _bt_form_posting() that did not
> appear in  _bt_update_posting().

It seems, based on discussion on this thread and private inquiry to PG
that we're going to silence the warning, as there's not a good
resolution? I've for now marked the issue as a false positive, and added
a reference to this thread. But I think we should just mark it as
ignored in that case?

Is it perhaps possible to silence the warnign with somethign along the
lines of
Assert(nhtids + vacposting->ndeletedtids == BTreeTupleGetNPosting(origtuple))
I don't know this code, but it looks like that'd have to be true?
Perhaps that'd be enough to silence coverity too?

- Andres



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Sun, Mar 29, 2020 at 3:15 PM Andres Freund <andres@anarazel.de> wrote:
> Is it perhaps possible to silence the warnign with somethign along the
> lines of
> Assert(nhtids + vacposting->ndeletedtids == BTreeTupleGetNPosting(origtuple))
> I don't know this code, but it looks like that'd have to be true?
> Perhaps that'd be enough to silence coverity too?

It would have to be true. It's a tautology. That is, the value of
nhtids comes from "vacposting->ndeletedtids" and
"BTreeTupleGetNPosting(origtuple)" anyway, and we don't mutate any of
that state in _bt_update_posting().

Wouldn't it at least be necessary to Assert() something about the
final tuple, and/or other work state?

-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Andres Freund
Date:
Hi,

On 2020-03-29 15:19:50 -0700, Peter Geoghegan wrote:
> On Sun, Mar 29, 2020 at 3:15 PM Andres Freund <andres@anarazel.de> wrote:
> > Is it perhaps possible to silence the warnign with somethign along the
> > lines of
> > Assert(nhtids + vacposting->ndeletedtids == BTreeTupleGetNPosting(origtuple))
> > I don't know this code, but it looks like that'd have to be true?
> > Perhaps that'd be enough to silence coverity too?
> 
> It would have to be true. It's a tautology. That is, the value of
> nhtids comes from "vacposting->ndeletedtids" and
> "BTreeTupleGetNPosting(origtuple)" anyway, and we don't mutate any of
> that state in _bt_update_posting().

Right. It doesn't look like coverity understands that currently though.


> Wouldn't it at least be necessary to Assert() something about the
> final tuple, and/or other work state?

I don't see why? The assert might not help, but if it does, I don't
think it'd need more than that check? It depends a bit on what
coverity's precise logic here is. If its ARRAY_VS_SINGLETON checker
allows index (i.e. ui) 0 but not 1, then I think the suggested assert
could help it recognize that that's unreachable.

After staring more at the coverity trace, it looks however like
ARRAY_VS_SINGLETON might not allow any array like access, not even 0?
Seems like it assumes that BTreeTupleGetNPosting(origtuple) is at least
two, and that in the first loop d < vacposting->ndeletedtids and
vacposting->deletetids[d] == i are true, but in the second iteration
assumes d < vacposting->ndeletedtids is true but
vacposting->deletetids[d] == i. Since it doesn't record a third
iteration, it seems like it ought to see ui == 0 at that point.

I'll mark it as ignored.

Greetings,

Andres Freund



Re: pgsql: Add deduplication to nbtree.

From
Peter Eisentraut
Date:
On 2020-02-26 22:06, Peter Geoghegan wrote:
> Add deduplication to nbtree.

AddressSanitizer has a use-after-scope complaint related to this patch.

The following change fixes it:

diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a70b64d964..8246ab4f18 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1289,6 +1289,7 @@ _bt_insertonpg(Relation rel,
             xl_btree_metadata xlmeta;
             uint8        xlinfo;
             XLogRecPtr    recptr;
+            uint16        upostingoff;
 
             xlrec.offnum = newitemoff;
 
@@ -1354,7 +1355,7 @@ _bt_insertonpg(Relation rel,
                  * must reconstruct final itup (as well as nposting) using
                  * _bt_swap_posting().
                  */
-                uint16        upostingoff = postingoff;
+                upostingoff = postingoff;
 
                 XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16));
                 XLogRegisterBufData(0, (char *) origitup,

Please check.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Thu, Apr 30, 2020 at 11:33 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> AddressSanitizer has a use-after-scope complaint related to this patch.
>
> The following change fixes it:

Your fix looks good to me. Please go ahead and commit it.

Thanks!
-- 
Peter Geoghegan



Re: pgsql: Add deduplication to nbtree.

From
Peter Geoghegan
Date:
On Thu, Apr 30, 2020 at 11:37 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > The following change fixes it:
>
> Your fix looks good to me. Please go ahead and commit it.

Actually, never mind. I just pushed your fix myself a moment ago.

Thanks again
-- 
Peter Geoghegan