Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Date
Msg-id CAH2-Wzn-2TogDRMSojmVa2biqUGznHASD8mhWDbgVhrvtxAPuQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I haven't measured how these changes affect WAL size yet.
> > Do you have any suggestions on how to automate testing of new WAL records?
> > Is there any suitable place in regression tests?
>
> I don't know about the regression tests (I doubt that there is a
> natural place for such a test), but I came up with a rough test case.
> I more or less copied the approach that you took with the index build
> WAL reduction patches, though I also figured out a way of subtracting
> heapam WAL overhead to get a real figure. I attach the test case --
> note that you'll need to use the "land" database with this. (This test
> case might need to be improved, but it's a good start.)

I used a test script similar to the "nbtree_wal_test.sql" test script
I posted on September 11th today. I am concerned about the WAL
overhead for cases that don't benefit from the patch (usually because
they turn off deduplication altogether). The details of the index
tested were different this time, though. I used an index that had the
smallest possible tuple size: 16 bytes (this is the smallest possible
size on 64-bit systems, but that's what almost everybody uses these
days). So any index with one or two int4 columns (or one int8 column)
will generally have 16 byte IndexTuples, at least when there are no
NULLs in the index. In general, 16 byte wide tuples are very, very
common.

What I saw suggests that we will need to remove the new "postingoff"
field from xl_btree_insert. (We can create a new XLog record for leaf
page inserts that also need to split a posting list, without changing
much else.)

The way that *alignment* of WAL records affects these common 16 byte
IndexTuple cases is the real problem. Adding "postingoff" to
xl_btree_insert increases the WAL required for INSERT_LEAF records by
two bytes (sizeof(OffsetNumber)), as you'd expect -- pg_waldump output
shows that they're 66 bytes, whereas they're only 64 bytes on the
master branch. That doesn't sound that bad, but once you consider the
alignment of whole records, it's really an extra 8 bytes. That is
totally unacceptable. The vast majority of nbtree WAL records are
bound to be INSERT_LEAF records, so as things stand we have added
(almost) 12.5% space overhead to nbtree for these common cases, that
don't benefit.

I haven't really looked into other types of WAL record just yet. The
real world overhead that we're adding to xl_btree_vacuum records is
something that I will have to look into separately. I'm already pretty
sure that adding two bytes to xl_btree_split is okay, though, because
they're far less numerous than xl_btree_insert records, and aren't
affected by alignment in the same way (they're already several hundred
bytes in almost all cases).

I also noticed something positive: The overhead of xl_btree_dedup WAL
records seems to be very low with indexes that have hundreds of
logical tuples for each distinct integer value. We don't seem to have
a problem with "deduplication thrashing".

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Partial path row estimates
Next
From: Maciek Sakrejda
Date:
Subject: Re: JIT performance bug/regression & JIT EXPLAIN