Thread: [PROPOSAL] Effective storage of duplicates in B-tree index.

[PROPOSAL] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

31 August 2015, 07:41:22

Hi, hackers!<br /> I'm going to begin work on effective storage of duplicate keys in B-tree index.<br /> The main idea
isto implement posting lists and posting trees for B-tree index pages as it's already done for GIN.<br /><br /> In a
nutshell,effective storing of duplicates in GIN is organised as follows.<br /> Index stores single index tuple for each
uniquekey. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows
havingthe same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.<br /> You
canfind wonderful detailed descriptions in gin <a
href="https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README">readme</a>and <a
href="http://www.cybertec.at/gin-just-an-index-type/">articles</a>.<br/> It also makes possible to apply compression
algorithmto posting list/tree and significantly decrease index size. Read more in <a
href="http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf">presentation(part 1)</a>.<br /><br /> Now
newB-tree index tuple must be inserted for each table row that we index. <br /> It can possibly cause page split.
Becauseof MVCC even unique index could contain duplicates.<br /> Storing duplicates in posting list/tree helps to avoid
superfluoussplits.<br /><br /> So it seems to be very useful improvement. Of course it requires a lot of changes in
B-treeimplementation, so I need approval from community.<br /><br /> 1. Compatibility.<br /> It's important to save
compatibilitywith older index versions.<br /> I'm going to change BTREE_VERSION to 3.<br /> And use new (posting)
featuresfor v3, saving old implementation for v2.<br /> Any objections?<br /><br /> 2. There are several tricks to
handlenon-unique keys in B-tree.<br /> More info in btree <a
href="https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/README">readme</a>(chapter -
Differencesto the Lehman & Yao algorithm).<br /> In the new version they'll become useless. Am I right?<br /><br />
3.Microvacuum.<br /> Killed items are marked LP_DEAD and could be deleted from separate page at time of insertion.<br
/>Now it's fine, because each item corresponds with separate TID. But posting list implementation requires another way.
I'vegot two ideas:<br /> First is to mark LP_DEAD only those tuples where all TIDs are not visible.<br /> Second is to
addLP_DEAD flag to each TID in posting list(tree). This way requires a bit more space, but allows to do microvacuum of
postinglist/tree.<br /> Which one is better?<br /><pre class="moz-signature" cols="72">-- 
 
Anastasia Lubennikova
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company</pre>

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Tomas Vondra

Date:

31 August 2015, 15:27:02

Hi,

On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
> Hi, hackers!
> I'm going to begin work on effective storage of duplicate keys in B-tree
> index.
> The main idea is to implement posting lists and posting trees for B-tree
> index pages as it's already done for GIN.
>
> In a nutshell, effective storing of duplicates in GIN is organised as
> follows.
> Index stores single index tuple for each unique key. That index tuple
> points to posting list which contains pointers to heap tuples (TIDs). If
> too many rows having the same key, multiple pages are allocated for the
> TIDs and these constitute so called posting tree.
> You can find wonderful detailed descriptions in gin readme
> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
> It also makes possible to apply compression algorithm to posting
> list/tree and significantly decrease index size. Read more in
> presentation (part 1)
> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>
> Now new B-tree index tuple must be inserted for each table row that we
> index.
> It can possibly cause page split. Because of MVCC even unique index
> could contain duplicates.
> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>
> So it seems to be very useful improvement. Of course it requires a lot
> of changes in B-tree implementation, so I need approval from community.

In general, index size is often a serious issue - cases where indexes 
need more space than tables are not quite uncommon in my experience. So 
I think the efforts to lower space requirements for indexes are good.

But if we introduce posting lists into btree indexes, how different are 
they from GIN? It seems to me that if I create a GIN index (using 
btree_gin), I do get mostly the same thing you propose, no?

Sure, there are differences - GIN indexes don't handle UNIQUE indexes, 
but the compression can only be effective when there are duplicate rows. 
So either the index is not UNIQUE (so the b-tree feature is not needed), 
or there are many updates.

Which brings me to the other benefit of btree indexes - they are 
designed for high concurrency. How much is this going to be affected by 
introducing the posting lists?

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Alexander Korotkov

Date:

01 September 2015, 09:32:23

Hi, Tomas!

On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.

In general, index size is often a serious issue - cases where indexes need more space than tables are not quite uncommon in my experience. So I think the efforts to lower space requirements for indexes are good.

But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I create a GIN index (using btree_gin), I do get mostly the same thing you propose, no?

Yes, In general GIN is a btree with effective duplicates handling + support of splitting single datums into multiple keys.

This proposal is mostly porting duplicates handling from GIN to btree.

Sure, there are differences - GIN indexes don't handle UNIQUE indexes,

The difference between btree_gin and btree is not only UNIQUE feature.

1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's not feasible to add gingettuple to GIN. At least with same semantics as it is in btree.

2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't have composite keys.

3) btree_gin can't effectively handle range searches. "a < x < b" would be hangle as "a < x" intersect "x < b". That is extremely inefficient. It is possible to fix. However, there is no clear proposal how to fit this case into GIN interface, yet.

but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the b-tree feature is not needed), or there are many updates.

From my observations users can use btree_gin only in some cases. They like compression, but can't use btree_gin mostly because of #1.

Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this going to be affected by introducing the posting lists?

I'd notice that current duplicates handling in PostgreSQL is hack over original btree. It is designed so in btree access method in PostgreSQL, not btree in general.

Posting lists shouldn't change concurrency much. Currently, in btree you have to lock one page exclusively when you're inserting new value.

When posting list is small and fits one page you have to do similar thing: exclusive lock of one page to insert new value.

When you have posting tree, you have to do exclusive lock on one page of posting tree.

One can say that concurrency would became worse because index would become smaller and number of pages would became smaller too. Since number of pages would be smaller, backends are more likely concur for the same page. But this argument can be user against any compression and for any bloat.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Tomas Vondra

Date:

01 September 2015, 15:41:52


On 09/01/2015 11:31 AM, Alexander Korotkov wrote:
...
>
> Yes, In general GIN is a btree with effective duplicates handling +
> support of splitting single datums into multiple keys.
> This proposal is mostly porting duplicates handling from GIN to btree.
>
>     Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
>
>
> The difference between btree_gin and btree is not only UNIQUE feature.
> 1) There is no gingettuple in GIN. GIN supports only bitmap scans. And
> it's not feasible to add gingettuple to GIN. At least with same
> semantics as it is in btree.
> 2) GIN doesn't support multicolumn indexes in the way btree does.
> Multicolumn GIN is more like set of separate singlecolumn GINs: it
> doesn't have composite keys.
> 3) btree_gin can't effectively handle range searches. "a < x < b" would
> be hangle as "a < x" intersect "x < b". That is extremely inefficient.
> It is possible to fix. However, there is no clear proposal how to fit
> this case into GIN interface, yet.
>
>     but the compression can only be effective when there are duplicate
>     rows. So either the index is not UNIQUE (so the b-tree feature is
>     not needed), or there are many updates.
>
>  From my observations users can use btree_gin only in some cases. They
> like compression, but can't use btree_gin mostly because of #1.

Thanks for the explanation! I'm not that familiar with GIN internals, 
but this mostly matches my understanding. I have only mentioned UNIQUE 
because the lack of gettuple() method seems obvious - and it works fine 
when GIN indexes are used as "bitmap indexes".

But you're right - we can't do index only scans on GIN indexes, which is 
a huge benefit of btree indexes.

>
>     Which brings me to the other benefit of btree indexes - they are
>     designed for high concurrency. How much is this going to be affected
>     by introducing the posting lists?
>
>
> I'd notice that current duplicates handling in PostgreSQL is hack over
> original btree. It is designed so in btree access method in PostgreSQL,
> not btree in general.
> Posting lists shouldn't change concurrency much. Currently, in btree you
> have to lock one page exclusively when you're inserting new value.
> When posting list is small and fits one page you have to do similar
> thing: exclusive lock of one page to insert new value.
> When you have posting tree, you have to do exclusive lock on one page of
> posting tree.

OK.

>
> One can say that concurrency would became worse because index would
> become smaller and number of pages would became smaller too. Since
> number of pages would be smaller, backends are more likely concur for
> the same page. But this argument can be user against any compression and
> for any bloat.

Which might be a problem for some use cases, but I assume we could add 
an option disabling this per-index. Probably having it "off" by default, 
and only enabling the compression explicitly.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

01 September 2015, 18:23:51

On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Now new B-tree index tuple must be inserted for each table row that we
> index.
> It can possibly cause page split. Because of MVCC even unique index could
> contain duplicates.
> Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.

> So it seems to be very useful improvement. Of course it requires a lot of
> changes in B-tree implementation, so I need approval from community.
>
> 1. Compatibility.
> It's important to save compatibility with older index versions.
> I'm going to change BTREE_VERSION to 3.
> And use new (posting) features for v3, saving old implementation for v2.
> Any objections?

It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.

> 2. There are several tricks to handle non-unique keys in B-tree.
> More info in btree readme (chapter - Differences to the Lehman & Yao
> algorithm).
> In the new version they'll become useless. Am I right?

I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.

See also, my opinion of how they claim to not need read locks [1].
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.

Anyway, the way that I always imagined this would work is a layer
"below" the current implementation. In other words, you could easily
have prefix compression with a prefix that could end at a point within
a reference IndexTuple. It could be any arbitrary point in the second
or subsequent attribute, and would not "care" about the structure of
the IndexTuple when it comes to where attributes begin and end, etc
(although, in reality, in probably would end up caring, because of the
complexity -- not caring is the ideal only, at least to me). As
Alexander pointed out, GIN does not care about composite keys.

That seems quite different to a GIN posting list (something that I
know way less about, FYI). So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.

> 3. Microvacuum.
> Killed items are marked LP_DEAD and could be deleted from separate page at
> time of insertion.
> Now it's fine, because each item corresponds with separate TID. But posting
> list implementation requires another way. I've got two ideas:
> First is to mark LP_DEAD only those tuples where all TIDs are not visible.
> Second is to add LP_DEAD flag to each TID in posting list(tree). This way
> requires a bit more space, but allows to do microvacuum of posting
> list/tree.

No real opinion on this point, except that I agree that doing
something is necessary.

Couple of further thoughts on this general topic:

* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.

You should start thinking about how to deal with this in a world where
the physical size could actually be quite variable. The solution is
probably to simply pretend that every IndexTuple is its original size.
This applies to both prefix compression and duplicate suppression, I
suppose.

* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.

[1]
http://www.postgresql.org/message-id/flat/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com#CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com

-- 
Peter Geoghegan

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

03 September 2015, 15:35:38

01.09.2015 21:23, Peter Geoghegan:
> On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index could
>> contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
> I'm glad someone is thinking about this, because it is certainly
> needed. I thought about working on it myself, but there is always
> something else to do. I should be able to assist with review, though.
Thank you)
>> So it seems to be very useful improvement. Of course it requires a lot of
>> changes in B-tree implementation, so I need approval from community.
>>
>> 1. Compatibility.
>> It's important to save compatibility with older index versions.
>> I'm going to change BTREE_VERSION to 3.
>> And use new (posting) features for v3, saving old implementation for v2.
>> Any objections?
> It might be better to just have a flag bit for pages that are
> compressed -- there are IIRC 8 free bits in the B-Tree page special
> area flags variable. But no real opinion on this from me, yet. You
> have plenty of bitspace to work with to mark B-Tree pages, in any
> case.
>
Hmm.. If we are talking about storing duplicates in posting lists (and 
trees) as in GIN, I don't see a way how to apply it to separate pages, 
while not applying to others. Look some notes below .

>> 2. There are several tricks to handle non-unique keys in B-tree.
>> More info in btree readme (chapter - Differences to the Lehman & Yao
>> algorithm).
>> In the new version they'll become useless. Am I right?
> I think that the L&Y algorithm makes assumptions for the sake of
> simplicity, rather than because they really believed that there were
> real problems. For example, they say that deletion can occur offline
> or something along those lines, even though that's clearly
> impractical. They say that because they didn't want to write a paper
> about deletion within B-Trees, I suppose.
>
> See also, my opinion of how they claim to not need read locks [1].
> Also, note that despite the fact that the GIN README mentions "Lehman
> & Yao style right links", it doesn't actually do the L&Y trick of
> avoiding lock coupling -- the whole point of L&Y -- so that remark is
> misleading. This must be why B-Tree has much better concurrency than
> GIN in practice.

Yes, thanks for extensive explanation.
I mean such tricks as moving right in _bt_findinsertloc(), for example.

/*----------     * If we will need to split the page to put the item on this page,     * check whether we can put the
tuplesomewhere to the right,     * instead.  Keep scanning right until we     *        (a) find a page with enough free
space,    *        (b) reach the last page where the tuple can legally go, or     *        (c) get tired of searching.
  * (c) is not flippant; it is important because if there are many     * pages' worth of equal keys, it's better to
splitone of the early     * pages than to scan all the way to the end of the run of equal keys     * on every insert.
Weimplement "get tired" as a random choice,     * since stopping after scanning a fixed number of pages wouldn't work
 * well (we'd never reach the right-hand side of previously split     * pages).  Currently the probability of moving
rightis set at 0.99,     * which may seem too high to change the behavior much, but it does an     * excellent job of
preventingO(N^2) behavior with many equal keys.     *----------     */

If there is no multiple tuples with the same key, we shouldn't care 
about it at all. It would be possible to skip these steps in "effective 
B-tree implementation". That's why I want to change btree_version.

>   So I'm really talking about a slightly
> different thing -- prefix compression, rather than handling
> duplicates. Whether or not you should do prefix compression instead of
> deduplication is certainly not clear to me, but it should be
> considered. Also, I always imagined that prefix compression would use
> the highkey as the thing that is offset for each "real" IndexTuple,
> because it's there anyway, and that's simple. However, I suppose that
> that means that duplicate handling can't really work in a way that
> makes duplicates have a fixed cost, which may be a particularly
> important property to you.

You're right, that is two different techniques.
1. Effective storing of duplicates, which I propose, works with equal 
keys. And allow us to delete repeats.
Index tuples are stored like this:

IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) | 
IndexTupleData + Attrs (key)

If all Attrs are equal, it seems reasonable not to repeat them. So we 
can store it in following structure:

MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData

It is a posting list. It doesn't require significant changes in index 
page layout, because we can use ordinary IndexTupleData for meta 
information. Each IndexTupleData has fixed size, so it's easy to handle 
posting list as an array.

2. Prefix compression handles different keys and somehow compresses them.
I think that it will require non-trivial changes in btree index tuples 
representation.  Furthermore, any compression leads to extra 
computations. Now, I don't have clear idea about how to implement this 
technique.

> * Currently, B-Tree must be able to store at least 3 items on each
> page, for the benefit of the L&Y algorithm. You need room for 1
> "highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
> page is redundant if you cannot get to any child page based on the
> scanKey value differing one way or the other (so 2 downlinks are a
> sensible minimum), plus a highkey is usually needed (just not on the
> rightmost page). As you probably know, we enforce this by making sure
> every IndexTuple is no more than 1/3 of the size that will fit.
That is the point where too big posting list transforms to a posting 
tree. But I think, that in the first patch, I'll do it another way. Just 
by splitting long posting list into 2 lists of appropriate length.

> * Since everything is aligned within B-Tree, it's probably worth
> considering the alignment boundaries when doing prefix compression, if
> you want to go that way. We can probably imagine a world where
> alignment is not required for B-Tree, which would work on x86
> machines, but I can't see it happening soon. It isn't worth
> compressing unless it compresses enough to cross an "alignment
> boundary", where we're not actually obliged to store as much data on
> disk. This point may be obvious, not sure.

That is another reason, why I doubt prefix compression, whereas 
effective duplicate storage hasn't this problem.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

27 September 2015, 23:11:56

On Thu, Sep 3, 2015 at 8:35 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>> * Since everything is aligned within B-Tree, it's probably worth
>> considering the alignment boundaries when doing prefix compression, if
>> you want to go that way. We can probably imagine a world where
>> alignment is not required for B-Tree, which would work on x86
>> machines, but I can't see it happening soon. It isn't worth
>> compressing unless it compresses enough to cross an "alignment
>> boundary", where we're not actually obliged to store as much data on
>> disk. This point may be obvious, not sure.
>
> That is another reason, why I doubt prefix compression, whereas effective
> duplicate storage hasn't this problem.

Okay. That sounds reasonable. I think duplicate handling is a good project.

A good learning tool for Postgres B-Trees -- or at least one of the
better ones -- is my amcheck tool. See:

https://github.com/petergeoghegan/postgres/tree/amcheck

This is a tool for verifying B-Tree invariants hold, which is loosely
based on pageinspect. It checks that certain conditions hold for
B-Trees. A simple example is that all items on each page be in the
correct, logical order. Some invariants checked are far more
complicated, though, and span multiple pages or multiple levels. See
the source code for exact details. This tool works well when running
the regression tests (see stress.sql -- I used it with pgbench), with
no problems reported last I checked. It often only needs light locks
on relations, and single shared locks on buffers. (Buffers are copied
to local memory for the tool to operate on, much like
contrib/pageinspect).

While I have yet to formally submit amcheck to a CF (I once asked for
input on the goals for the project on -hackers), the comments are
fairly comprehensive, and it wouldn't be too hard to adopt this to
guide your work on duplicate handling. Maybe it'll happen for 9.6.
Feedback appreciated.

The tool calls _bt_compare() for many things currently, but doesn't
care about many lower level details, which is (very roughly speaking)
the level that duplicate handling will work at. You aren't actually
proposing to change anything about the fundamental structure that
B-Tree indexes have, so the tool could be quite useful and low-effort
for debugging your code during development.

Debugging this stuff is sometimes like keyhole surgery. If you could
just see at/get to the structure that you care about, it would be 10
times easier. Hopefully this tool makes it easier to identify problems.

-- 
Peter Geoghegan

Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

27 September 2015, 23:23:44

On Sun, Sep 27, 2015 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Debugging this stuff is sometimes like keyhole surgery. If you could
> just see at/get to the structure that you care about, it would be 10
> times easier. Hopefully this tool makes it easier to identify problems.

I should add that the way that the L&Y technique works, and the way
that Postgres code is generally very robust/defensive can make direct
testing a difficult thing. I have seen cases where a completely messed
up B-Tree still gave correct results most of the time, and was just
slower. That can happen, for example, because the "move right" thing
results in a degenerate linear scan of the entire index. The
comparisons in the internal pages were totally messed up, but it
"didn't matter" once a scan could get to leaf pages and could move
right and find the value that way.

I wrote amcheck because I thought it was scary how B-Tree indexes
could be *completely* messed up without it being obvious; what hope is
there of a test finding a subtle problem in their structure, then?
Testing the invariants directly seemed like the only way to have a
chance of not introducing bugs when adding new stuff to the B-Tree
code. I believe that adding optimizations to the B-Tree code will be
important in the next couple of years, and there is no other way to
approach it IMV.

-- 
Peter Geoghegan

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

28 January 2016, 14:07:11

31.08.2015 10:41, Anastasia Lubennikova:

Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i	B-tree Old	B-tree New	GIN
1	214,234375	87,7109375	10,2109375
10	214,234375	87,7109375	10,71875
100	214,234375	87,4375	15,640625
1000	214,234375	86,2578125	31,296875
10000	214,234375	78,421875	104,3046875
100000	214,234375	65,359375	49,078125
1000000	214,234375	90,140625	106,8203125
10000000	214,234375	214,234375	534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

28 January 2016, 15:13:13

On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

28 January 2016, 16:13:05

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

btree_compression_1.0(rebased).patch

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

28 January 2016, 17:03:40

On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing

Scale: master / patch

100: 10657ms / 13555ms (rechecked and got 9745ms)

500: 56909ms / 56985ms

Size

Scale: master / patch

100: 214MB / 87MB (40.7%)

500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

28 January 2016, 17:09:43

On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote:
> I'm surprised that efficiencies can't be realised beyond this point.  Your results show a sweet spot at around 1000 /
10000000,with it getting slightly worse beyond that.  I kind of expected a lot of efficiency where all the values are
thesame, but perhaps that's due to my lack of understanding regarding the way they're being stored. 

I think that you'd need an I/O bound workload to see significant
benefits. That seems unsurprising. I believe that random I/O from
index writes is a big problem for us.



--
Peter Geoghegan

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

28 January 2016, 17:14:44

On 28 January 2016 at 17:09, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote:
>> I'm surprised that efficiencies can't be realised beyond this point.  Your results show a sweet spot at around 1000
/10000000, with it getting slightly worse beyond that.  I kind of expected a lot of efficiency where all the values are
thesame, but perhaps that's due to my lack of understanding regarding the way they're being stored. 
>
> I think that you'd need an I/O bound workload to see significant
> benefits. That seems unsurprising. I believe that random I/O from
> index writes is a big problem for us.

I was thinking more from the point of view of the index size.  An
index containing 10 million duplicate values is around 40% of the size
of an index with 10 million unique values.

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

28 January 2016, 20:33:15

On 28 January 2016 at 17:03, Thom Brown <thom@linux.com> wrote:

On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

Okay, now for some badness. I've restored a database containing 2 tables, one 318MB, another 24kB. The 318MB table contains 5 million rows with a sequential id column. I get a problem if I try to delete many rows from it:

# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory

The query completes, but I get this message a lot before it does.

This happens even if I drop the primary key and foreign key constraints, so somehow the memory usage has massively increased with this patch.

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

29 January 2016, 13:46:32

28.01.2016 20:03, Thom Brown:

On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

28.01.2016 18:12, Thom Brown:

On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:

31.08.2015 10:41, Anastasia Lubennikova:
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.

In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).

Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.

I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.

Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.

Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).

i B-tree Old B-tree New GIN
1 214,234375 87,7109375 10,2109375
10 214,234375 87,7109375 10,71875
100 214,234375 87,4375 15,640625
1000 214,234375 86,2578125 31,296875
10000 214,234375 78,421875 104,3046875
100000 214,234375 65,359375 49,078125
1000000 214,234375 90,140625 106,8203125
10000000 214,234375 214,234375 534,0625

You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.

I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.

Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;

And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.

This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?

Thank you for the notice. New patch is attached.

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

Thank you for the prompt reply. I see what you're confused about. I'll try to clarify it.

First of all, what is implemented in the patch is not actually compression. It's more about index page layout changes to compact ItemPointers (TIDs).
Instead of TID+key, TID+key, we store now META+key+List_of_TIDs (also known as Posting list).

before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

TID (N item pointers, posting list offset) - this is the meta information. So, we have to store this meta information in addition to useful data.

Next point is the requirement of having minimum three tuples in a page. We need at least two tuples to point the children and the highkey as well.
This requirement leads to the limitation of the max index tuple size.

/* * Maximum size of a btree index entry, including its tuple header. * * We actually need to be able to fit three items on every page, * so restrict any one item to 1/3 the per-page available space. */
#define BTMaxItemSize(page) \	MAXALIGN_DOWN((PageGetPageSize(page) - \				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)

Although, I thought just now that this size could be increased for compressed tuples, at least for leaf pages.

That's the reason, why we have to store more meta information than meets the eye.

For example, we have 100000 of duplicates with the same key. It seems that compression should be really significant.
Something like 1 Meta + 1 key instead of 100000 keys --> 6 bytes (size of meta TID) + keysize instead of 600000.
But, we have to split one huge posting list into the smallest ones to fit it into the index page.

It depends on the key size, of course. As I can see form pageisnpect the index on the single integer key have to split the tuples into the pieces with the size 2704 containing 447 TIDs in one posting list.
So we have 1 Meta + 1 key instead of 447 keys. As you can see, that is really less impressive than expected.

There is an idea of posting trees in GIN. Key is stored just once, and posting list which doesn't fit into the page becomes a tree.
You can find incredible article about it here http://www.cybertec.at/2013/03/gin-just-an-index-type/
But I think, that it's not the best way for the btree am, because it’s not supposed to handle concurrent insertions.

As I mentioned before I'm going to implement prefix compression of posting list, which must be efficient and quite simple, since it's already implemented in GIN. You can find the presentation about it here https://www.pgcon.org/2014/schedule/events/698.en.html

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Aleksander Alekseev

Date:

29 January 2016, 15:50:02

I tested this patch on x64 and ARM servers for a few hours today. The
only problem I could find is that INSERT works considerably slower after
applying a patch. Beside that everything looks fine - no crashes, tests
pass, memory doesn't seem to leak, etc.

> Okay, now for some badness.  I've restored a database containing 2
> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
> rows with a sequential id column.  I get a problem if I try to delete
> many rows from it:
> # delete from contacts where id % 3 != 0 ;
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory

I didn't manage to reproduce this. Thom, could you describe exact steps
to reproduce this issue please?

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

29 January 2016, 16:02:16

On 29 January 2016 at 15:47, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
> I tested this patch on x64 and ARM servers for a few hours today. The
> only problem I could find is that INSERT works considerably slower after
> applying a patch. Beside that everything looks fine - no crashes, tests
> pass, memory doesn't seem to leak, etc.
>
>> Okay, now for some badness.  I've restored a database containing 2
>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>> rows with a sequential id column.  I get a problem if I try to delete
>> many rows from it:
>> # delete from contacts where id % 3 != 0 ;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>
> I didn't manage to reproduce this. Thom, could you describe exact steps
> to reproduce this issue please?

Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
-r0), which creates an instance with a custom config, which is as
follows:

shared_buffers = 8MB
max_connections = 7
wal_level = 'hot_standby'
cluster_name = 'primary'
max_wal_senders = 3
wal_keep_segments = 6

Then create a pgbench data set (I didn't originally use pgbench, but
you can get the same results with it):

createdb -p 5530 pgbench
pgbench -p 5530 -i -s 100 pgbench

And delete some stuff:

thom@swift:~/Development/test$ psql -p 5530 pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5530/pgbench

# DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
WARNING:  out of shared memory
...
WARNING:  out of shared memory
WARNING:  out of shared memory
DELETE 6666667
Time: 22218.804 ms

There were 358 lines of that warning message.  I don't get these
messages without the patch.

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

29 January 2016, 16:50:56

29.01.2016 19:01, Thom Brown:
> On 29 January 2016 at 15:47, Aleksander Alekseev
> <a.alekseev@postgrespro.ru> wrote:
>> I tested this patch on x64 and ARM servers for a few hours today. The
>> only problem I could find is that INSERT works considerably slower after
>> applying a patch. Beside that everything looks fine - no crashes, tests
>> pass, memory doesn't seem to leak, etc.
Thank you for testing. I rechecked that, and insertions are really very
very very slow. It seems like a bug.
>>> Okay, now for some badness.  I've restored a database containing 2
>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>> rows with a sequential id column.  I get a problem if I try to delete
>>> many rows from it:
>>> # delete from contacts where id % 3 != 0 ;
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>> I didn't manage to reproduce this. Thom, could you describe exact steps
>> to reproduce this issue please?
> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
> -r0), which creates an instance with a custom config, which is as
> follows:
>
> shared_buffers = 8MB
> max_connections = 7
> wal_level = 'hot_standby'
> cluster_name = 'primary'
> max_wal_senders = 3
> wal_keep_segments = 6
>
> Then create a pgbench data set (I didn't originally use pgbench, but
> you can get the same results with it):
>
> createdb -p 5530 pgbench
> pgbench -p 5530 -i -s 100 pgbench
>
> And delete some stuff:
>
> thom@swift:~/Development/test$ psql -p 5530 pgbench
> Timing is on.
> psql (9.6devel)
> Type "help" for help.
>
>
>   ➤ psql://thom@[local]:5530/pgbench
>
> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> ...
> WARNING:  out of shared memory
> WARNING:  out of shared memory
> DELETE 6666667
> Time: 22218.804 ms
>
> There were 358 lines of that warning message.  I don't get these
> messages without the patch.
>
> Thom

Thank you for this report.
I tried to reproduce it, but I couldn't. Debug will be much easier now.

I hope I'll fix these issueswithin the next few days.

BTW, I found a dummy mistake, the previous patch contains some unrelated
changes. I fixed it in the new version (attached).

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

btree_compression_2.0.patch

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

29 January 2016, 17:43:41

On 29 January 2016 at 16:50, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 29.01.2016 19:01, Thom Brown:
>>
>> On 29 January 2016 at 15:47, Aleksander Alekseev
>> <a.alekseev@postgrespro.ru> wrote:
>>>
>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>> only problem I could find is that INSERT works considerably slower after
>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>> pass, memory doesn't seem to leak, etc.
>
> Thank you for testing. I rechecked that, and insertions are really very very
> very slow. It seems like a bug.
>
>>>> Okay, now for some badness.  I've restored a database containing 2
>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>> many rows from it:
>>>> # delete from contacts where id % 3 != 0 ;
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>
>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>> to reproduce this issue please?
>>
>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>> -r0), which creates an instance with a custom config, which is as
>> follows:
>>
>> shared_buffers = 8MB
>> max_connections = 7
>> wal_level = 'hot_standby'
>> cluster_name = 'primary'
>> max_wal_senders = 3
>> wal_keep_segments = 6
>>
>> Then create a pgbench data set (I didn't originally use pgbench, but
>> you can get the same results with it):
>>
>> createdb -p 5530 pgbench
>> pgbench -p 5530 -i -s 100 pgbench
>>
>> And delete some stuff:
>>
>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>> Timing is on.
>> psql (9.6devel)
>> Type "help" for help.
>>
>>
>>   ➤ psql://thom@[local]:5530/pgbench
>>
>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> ...
>> WARNING:  out of shared memory
>> WARNING:  out of shared memory
>> DELETE 6666667
>> Time: 22218.804 ms
>>
>> There were 358 lines of that warning message.  I don't get these
>> messages without the patch.
>>
>> Thom
>
>
> Thank you for this report.
> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>
> I hope I'll fix these issueswithin the next few days.
>
> BTW, I found a dummy mistake, the previous patch contains some unrelated
> changes. I fixed it in the new version (attached).

Thanks.  Well I've tested this latest patch, and the warnings are no
longer generated.  However, the index sizes show that the patch
doesn't seem to be doing its job, so I'm wondering if you removed too
much from it.

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

02 February 2016, 11:47:41


29.01.2016 20:43, Thom Brown:
> On 29 January 2016 at 16:50, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru>  wrote:
>> 29.01.2016 19:01, Thom Brown:
>>> On 29 January 2016 at 15:47, Aleksander Alekseev
>>> <a.alekseev@postgrespro.ru>  wrote:
>>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>>> only problem I could find is that INSERT works considerably slower after
>>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>>> pass, memory doesn't seem to leak, etc.
>> Thank you for testing. I rechecked that, and insertions are really very very
>> very slow. It seems like a bug.
>>
>>>>> Okay, now for some badness.  I've restored a database containing 2
>>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>>> many rows from it:
>>>>> # delete from contacts where id % 3 != 0 ;
>>>>> WARNING:  out of shared memory
>>>>> WARNING:  out of shared memory
>>>>> WARNING:  out of shared memory
>>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>>> to reproduce this issue please?
>>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>>> -r0), which creates an instance with a custom config, which is as
>>> follows:
>>>
>>> shared_buffers = 8MB
>>> max_connections = 7
>>> wal_level = 'hot_standby'
>>> cluster_name = 'primary'
>>> max_wal_senders = 3
>>> wal_keep_segments = 6
>>>
>>> Then create a pgbench data set (I didn't originally use pgbench, but
>>> you can get the same results with it):
>>>
>>> createdb -p 5530 pgbench
>>> pgbench -p 5530 -i -s 100 pgbench
>>>
>>> And delete some stuff:
>>>
>>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>>> Timing is on.
>>> psql (9.6devel)
>>> Type "help" for help.
>>>
>>>
>>>    ➤ psql://thom@[local]:5530/pgbench
>>>
>>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> ...
>>> WARNING:  out of shared memory
>>> WARNING:  out of shared memory
>>> DELETE 6666667
>>> Time: 22218.804 ms
>>>
>>> There were 358 lines of that warning message.  I don't get these
>>> messages without the patch.
>>>
>>> Thom
>> Thank you for this report.
>> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>>
>> I hope I'll fix these issueswithin the next few days.
>>
>> BTW, I found a dummy mistake, the previous patch contains some unrelated
>> changes. I fixed it in the new version (attached).
> Thanks.  Well I've tested this latest patch, and the warnings are no
> longer generated.  However, the index sizes show that the patch
> doesn't seem to be doing its job, so I'm wondering if you removed too
> much from it.

Huh, this patch seems to be enchanted) It works fine for me. Did you 
perform "make distclean"?
Anyway, I'll send a new version soon.
I just write here to say that I do not disappear and I do remember about 
the issue.
I even almost fixed the insert speed problem. But I'm very very busy 
this week.
I'll send an updated patch next week as soon as possible.

Thank you for attention to this work.

-- 
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

02 February 2016, 11:59:59

On 2 February 2016 at 11:47, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>
>
> 29.01.2016 20:43, Thom Brown:
>
>> On 29 January 2016 at 16:50, Anastasia Lubennikova
>> <a.lubennikova@postgrespro.ru>  wrote:
>>>
>>> 29.01.2016 19:01, Thom Brown:
>>>>
>>>> On 29 January 2016 at 15:47, Aleksander Alekseev
>>>> <a.alekseev@postgrespro.ru>  wrote:
>>>>>
>>>>> I tested this patch on x64 and ARM servers for a few hours today. The
>>>>> only problem I could find is that INSERT works considerably slower
>>>>> after
>>>>> applying a patch. Beside that everything looks fine - no crashes, tests
>>>>> pass, memory doesn't seem to leak, etc.
>>>
>>> Thank you for testing. I rechecked that, and insertions are really very
>>> very
>>> very slow. It seems like a bug.
>>>
>>>>>> Okay, now for some badness.  I've restored a database containing 2
>>>>>> tables, one 318MB, another 24kB.  The 318MB table contains 5 million
>>>>>> rows with a sequential id column.  I get a problem if I try to delete
>>>>>> many rows from it:
>>>>>> # delete from contacts where id % 3 != 0 ;
>>>>>> WARNING:  out of shared memory
>>>>>> WARNING:  out of shared memory
>>>>>> WARNING:  out of shared memory
>>>>>
>>>>> I didn't manage to reproduce this. Thom, could you describe exact steps
>>>>> to reproduce this issue please?
>>>>
>>>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test
>>>> -r0), which creates an instance with a custom config, which is as
>>>> follows:
>>>>
>>>> shared_buffers = 8MB
>>>> max_connections = 7
>>>> wal_level = 'hot_standby'
>>>> cluster_name = 'primary'
>>>> max_wal_senders = 3
>>>> wal_keep_segments = 6
>>>>
>>>> Then create a pgbench data set (I didn't originally use pgbench, but
>>>> you can get the same results with it):
>>>>
>>>> createdb -p 5530 pgbench
>>>> pgbench -p 5530 -i -s 100 pgbench
>>>>
>>>> And delete some stuff:
>>>>
>>>> thom@swift:~/Development/test$ psql -p 5530 pgbench
>>>> Timing is on.
>>>> psql (9.6devel)
>>>> Type "help" for help.
>>>>
>>>>
>>>>    ➤ psql://thom@[local]:5530/pgbench
>>>>
>>>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0;
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> ...
>>>> WARNING:  out of shared memory
>>>> WARNING:  out of shared memory
>>>> DELETE 6666667
>>>> Time: 22218.804 ms
>>>>
>>>> There were 358 lines of that warning message.  I don't get these
>>>> messages without the patch.
>>>>
>>>> Thom
>>>
>>> Thank you for this report.
>>> I tried to reproduce it, but I couldn't. Debug will be much easier now.
>>>
>>> I hope I'll fix these issueswithin the next few days.
>>>
>>> BTW, I found a dummy mistake, the previous patch contains some unrelated
>>> changes. I fixed it in the new version (attached).
>>
>> Thanks.  Well I've tested this latest patch, and the warnings are no
>> longer generated.  However, the index sizes show that the patch
>> doesn't seem to be doing its job, so I'm wondering if you removed too
>> much from it.
>
>
> Huh, this patch seems to be enchanted) It works fine for me. Did you perform
> "make distclean"?

Yes.  Just tried it again:

git clean -fd
git stash
make distclean
patch -p1 < ~/Downloads/btree_compression_2.0.patch
../dopg.sh (script I've always used to build with)
pg_ctl start
createdb pgbench
pgbench -i -s 100 pgbench

$ psql pgbench
Timing is on.
psql (9.6devel)
Type "help" for help.

➤ psql://thom@[local]:5488/pgbench

# \di+                                   List of relationsSchema |         Name          | Type  | Owner |      Table
   | 
Size  | Description
--------+-----------------------+-------+-------+------------------+--------+-------------public |
pgbench_accounts_pkey| index | thom  | pgbench_accounts | 214 MB |public | pgbench_branches_pkey | index | thom  |
pgbench_branches| 24 kB  |public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  | 
(3 rows)

Previously, this would show an index size of 87MB for pgbench_accounts_pkey.

> Anyway, I'll send a new version soon.
> I just write here to say that I do not disappear and I do remember about the
> issue.
> I even almost fixed the insert speed problem. But I'm very very busy this
> week.
> I'll send an updated patch next week as soon as possible.

Thanks.

> Thank you for attention to this work.

Thanks for your awesome patches.

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

04 February 2016, 15:07:41

On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote:
>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |

I see the same.

I use my regular SQL query to see the breakdown of leaf/internal/root pages:

postgres=# with tots as ( SELECT count(*) c, avg(live_items) avg_live_items, avg(dead_items) avg_dead_items, u.type,
r.oidfrom (select c.oid,         c.relpages,         generate_series(1, c.relpages - 1) i         from pg_index i
 join pg_opclass op on i.indclass[0] = op.oid         join pg_am am on op.opcmethod = am.oid         join pg_class c on
i.indexrelid= c.oid         where am.amname = 'btree') r,       lateral (select * from
bt_page_stats(r.oid::regclass::text,i)) u group by r.oid, type) 
select ct.relname table_name, tots.oid::regclass::text index_name, (select relpages - 1 from pg_class c where c.oid =
tots.oid)non_meta_pages, upper(type) page_type, c npages, to_char(avg_live_items, '990.999'), to_char(avg_dead_items,
'990.999'),to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || ' 
%' as prop_of_index from tots join pg_index i on i.indexrelid = tots.oid join pg_class ct on ct.oid = i.indrelid where
tots.oid= 'pgbench_accounts_pkey'::regclass order by ct.relnamespace, table_name, index_name, npages, type;
table_name   │      index_name       │ non_meta_pages │ page_type 
│ npages │ to_char  │ to_char  │ prop_of_index

──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────pgbench_accounts
│pgbench_accounts_pkey │         27,421 │ R 
│      1 │   97.000 │    0.000 │    0.004 %pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ I
│     97 │  282.670 │    0.000 │    0.354 %pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ L
│ 27,323 │  366.992 │    0.000 │   99.643 %
(3 rows)

But this looks healthy -- I see the same with master. And since the
accounts table is listed as 1281 MB, this looks like a plausible ratio
in the size of the table to its primary index (which I would not say
is true of an 87MB primary key index).

Are you sure you have the details right, Thom?
--
Peter Geoghegan

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Thom Brown

Date:

04 February 2016, 16:26:30

On 4 February 2016 at 15:07, Peter Geoghegan <pg@heroku.com> wrote:
> On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote:
>>  public | pgbench_accounts_pkey | index | thom  | pgbench_accounts | 214 MB |
>>  public | pgbench_branches_pkey | index | thom  | pgbench_branches | 24 kB  |
>>  public | pgbench_tellers_pkey  | index | thom  | pgbench_tellers  | 48 kB  |
>
> I see the same.
>
> I use my regular SQL query to see the breakdown of leaf/internal/root pages:
>
> postgres=# with tots as (
>   SELECT count(*) c,
>   avg(live_items) avg_live_items,
>   avg(dead_items) avg_dead_items,
>   u.type,
>   r.oid
>   from (select c.oid,
>           c.relpages,
>           generate_series(1, c.relpages - 1) i
>           from pg_index i
>           join pg_opclass op on i.indclass[0] = op.oid
>           join pg_am am on op.opcmethod = am.oid
>           join pg_class c on i.indexrelid = c.oid
>           where am.amname = 'btree') r,
>         lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u
>   group by r.oid, type)
> select ct.relname table_name,
>   tots.oid::regclass::text index_name,
>   (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages,
>   upper(type) page_type,
>   c npages,
>   to_char(avg_live_items, '990.999'),
>   to_char(avg_dead_items, '990.999'),
>   to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || '
> %' as prop_of_index
>   from tots
>   join pg_index i on i.indexrelid = tots.oid
>   join pg_class ct on ct.oid = i.indrelid
>   where tots.oid = 'pgbench_accounts_pkey'::regclass
>   order by ct.relnamespace, table_name, index_name, npages, type;
>     table_name    │      index_name       │ non_meta_pages │ page_type
> │ npages │ to_char  │ to_char  │ prop_of_index
>
──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────
>  pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ R
> │      1 │   97.000 │    0.000 │    0.004 %
>  pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ I
> │     97 │  282.670 │    0.000 │    0.354 %
>  pgbench_accounts │ pgbench_accounts_pkey │         27,421 │ L
> │ 27,323 │  366.992 │    0.000 │   99.643 %
> (3 rows)
>
> But this looks healthy -- I see the same with master. And since the
> accounts table is listed as 1281 MB, this looks like a plausible ratio
> in the size of the table to its primary index (which I would not say
> is true of an 87MB primary key index).
>
> Are you sure you have the details right, Thom?

*facepalm*

No, I'm not.  I've just realised that all I've been checking is the
primary key expecting it to change in size, which is, of course,
nonsense.  I should have been creating an index on the bid field of
pgbench_accounts and reviewing the size of that.

Now I've checked it with the latest patch, and can see it working
fine.  Apologies for the confusion.

Thom

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

04 February 2016, 16:27:30

On Thu, Feb 4, 2016 at 8:25 AM, Thom Brown <thom@linux.com> wrote:
>
> No, I'm not.  I've just realised that all I've been checking is the
> primary key expecting it to change in size, which is, of course,
> nonsense.  I should have been creating an index on the bid field of
> pgbench_accounts and reviewing the size of that.

Right. Because, apart from everything else, unique indexes are not
currently supported.

-- 
Peter Geoghegan

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

04 February 2016, 17:16:58

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I fixed it in the new version (attached).

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,    * scan key), which could be the last slot + 1.    */   if
(P_ISLEAF(opaque))
+   {
+       if (low <= PageGetMaxOffsetNumber(page))
+       {
+           IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+           /* one excessive check of equality. for possible posting
tuple update or creation */
+           if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+               && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+               *updposing = true;
+       }       return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;?column?
──────────t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)        * it off the old page, not the new one, in case we are not at leaf        * level.
 */ 
-       state->btps_minkey = CopyIndexTuple(oitup);
+       ItemId iihk = PageGetItemId(opage, P_HIKEY);
+       IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+       state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage    \
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+           (sizeof(ItemPointerData))))

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

* More explanation of the design would be nice. I suggest modifying
the nbtree README file, so it's easy to tell what the current design
is. It's hard to follow this from the thread. When I reviewed Heikki's
B-Tree patches from a couple of years ago, we spent ~75% of the time
on design, and only ~25% on code.

* I have a paranoid feeling that the deletion locking protocol
(VACUUMing index tuples concurrently and safely) may need special
consideration here. Basically, with the B-Tree code, there are several
complicated locking protocols, like for page splits, page deletion,
and interlocking with vacuum ("super exclusive lock" stuff). These are
why the B-Tree code is complicated in general, and it's very important
to pin down exactly how we deal with each. Ideally, you'd have an
explanation for why your code was correct in each of these existing
cases (especially deletion). With very complicated and important code
like this, it's often wise to be very clear about when we are talking
about your design, and when we are talking about your code. It's
generally too hard to review both at the same time.

Ideally, when you talk about your design, you'll be able to say things
like "it's clear that this existing thing is correct; at least we have
no complaints from the field. Therefore, it must be true that my new
technique is also correct, because it makes that general situation no
worse". Obviously that kind of rigor is just something we aspire to,
and still fall short of at times. Still, it would be nice to
specifically see a reason why the new code isn't special from the
point of view of the super-exclusive lock thing (which is what I mean
by deletion locking protocol + special consideration). Or why it is
special, but that's okay, or whatever. This style of review is normal
when writing B-Tree code. Some other things don't need this rigor, or
have no invariants that need to be respected/used. Maybe this is
obvious to you already, but it isn't obvious to me.

It's okay if you don't know why, but knowing that you don't have a
strong opinion about something is itself useful information.

* I see you disabled the LP_DEAD thing; why? Just because that made
bugs go away?

* Have you done much stress testing? Using pgbench with many
concurrent VACUUM FREEZE operations would be a good idea, if you
haven't already, because that is insistent about getting super
exclusive locks, unlike regular VACUUM.

* Are you keeping the restriction of 1/3 of a buffer page, but that
just includes the posting list now? That's the kind of detail I'd like
to see in the README now.

* Why not support unique indexes? The obvious answer is that it isn't
worth it, but why? How useful would that be (a bit, just not enough)?
What's the trade-off?

Anyway, this is really cool work; I have often thought that we don't
have nearly enough people thinking about how to optimize B-Tree
indexing. It is hard, but so is anything worthwhile.

That's all I have for now. Just a quick review focused on code and
correctness (and not on the benefits). I want to do more on this,
especially the benefits, because it deserves more attention.

--
Peter Geoghegan

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

18 February 2016, 17:18:37

04.02.2016 20:16, Peter Geoghegan:

On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I fixed it in the new version (attached).

Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it looks much better.
I described all details of the compression in this document https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Some quick remarks on your V2.0:

* Seems unnecessary that _bt_binsrch() is passed a real pointer by all
callers. Maybe the one current posting list caller
_bt_findinsertloc(), or its caller, _bt_doinsert(), should do this
work itself:

@@ -373,7 +377,17 @@ _bt_binsrch(Relation rel,    * scan key), which could be the last slot + 1.    */   if (P_ISLEAF(opaque))
+   {
+       if (low <= PageGetMaxOffsetNumber(page))
+       {
+           IndexTuple oitup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, low));
+           /* one excessive check of equality. for possible posting
tuple update or creation */
+           if ((_bt_compare(rel, keysz, scankey, page, low) == 0)
+               && (IndexTupleSize(oitup) + sizeof(ItemPointerData) <
BTMaxItemSize(page)))
+               *updposing = true;
+       }       return low;
+   }

* ISTM that you should not use _bt_compare() above, in any case. Consider this:

postgres=# select 5.0 = 5.000;?column?
──────────t
(1 row)

B-Tree operator class indicates equality here. And yet, users will
expect to see the original value in an index-only scan, including the
trailing zeroes as they were originally input. So this should be a bit
closer to HeapSatisfiesHOTandKeyUpdate() (actually,
heap_tuple_attr_equals()), which looks for strict binary equality for
similar reasons.

Thank you for the notice. Fixed.

* Is this correct?:

@@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)        * it off the old page, not the new one, in case we are not at leaf        * level.        */
-       state->btps_minkey = CopyIndexTuple(oitup);
+       ItemId iihk = PageGetItemId(opage, P_HIKEY);
+       IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk);
+       state->btps_minkey = CopyIndexTuple(hikey);

How this code has changed from the master branch is not clear to me.

Yes, it is. I completed the comment above.

I understand that this code in incomplete/draft:

+#define MaxPackedIndexTuplesPerPage    \
+   ((int) ((BLCKSZ - SizeOfPageHeaderData) / \
+           (sizeof(ItemPointerData))))

But why is it different to the old (actually unchanged)
MaxIndexTuplesPerPage? I would like to see comments explaining your
understanding, even if they are quite rough. Why did GIN never require
this change to a generic header (itup.h)? Should such a change live in
that generic header file, and not another one more localized to
nbtree?

I agree.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

18 February 2016, 17:29:47

18.02.2016 20:18, Anastasia Lubennikova:

04.02.2016 20:16, Peter Geoghegan:
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it looks much better.
I described all details of the compression in this document https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.

Sorry, previous patch was dirty. Hotfix is attached.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Please, find the new version of the patch attached. Now it has WAL
functionality.

Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

btree_compression_4.0.patch

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Alexandr Popov

Date:

23 March 2016, 15:30:19

<br /><br /><div class="moz-cite-prefix">On 18.03.2016 20:19, Anastasia Lubennikova wrote:<br /></div><blockquote
cite="mid:56EC38A9.9030303@postgrespro.ru"type="cite">Please, find the new version of the patch attached. Now it has
WALfunctionality. <br /><br /> Detailed description of the feature you can find in README draft <a
class="moz-txt-link-freetext"href="https://goo.gl/50O8Q0">https://goo.gl/50O8Q0</a><br /><br /> This patch is pretty
complicated,so I ask everyone, who interested in this feature, <br /> to help with reviewing and testing it. I will be
gratefulfor any feedback. <br /> But please, don't complain about code style, it is still work in progress. <br /><br
/>Next things I'm going to do: <br /> 1. More debugging and testing. I'm going to attach in next message couple of sql
scriptsfor testing. <br /> 2. Fix NULLs processing <br /> 3. Add a flag into pg_index, that allows to enable/disable
compressionfor each particular index. <br /> 4. Recheck locking considerations. I tried to write code as less invasive
aspossible, but we need to make sure that algorithm is still correct. <br /> 5. Change BTMaxItemSize <br /> 6. Bring
backmicrovacuum functionality. <br /><br /></blockquote><br /><br /> Hi, hackers.<br /><br /> It's my first review, so
donot be strict to me.<br /><br /> I have tested this patch on the next table:<br /> create table message<br />    
(<br/>         id        serial,<br />         usr_id        integer,<br />         text        text<br />     );<br />
CREATEINDEX message_usr_id ON message (usr_id);<br /> The table has 10000000 records.<br /><br /> I found the
following:<br/> The less unique keys the less size of the table.<br /><br /> Next 2 tablas demonstrates it. <br /> New
B-tree<br /> Count of unique keys (usr_id), index“s size , time of creation<br /> 10000000    ;"214 MB"   
;"00:00:34.193441"<br/> 3333333      ;"214 MB"    ;"00:00:45.731173"<br /> 2000000      ;"129 MB"   
;"00:00:41.445876"<br/> 1000000      ;"129 MB"    ;"00:00:38.455616"<br /> 100000        ;"86 MB"     
;"00:00:40.887626"<br/> 10000          ;"79 MB"      ;"00:00:47.199774"<br /><br /> Old B-tree <br /> Count of unique
keys(usr_id), index“s size , time of creation<br /> 10000000    ;"214 MB"    ;"00:00:35.043677"<br /> 3333333     
;"286MB"    ;"00:00:40.922845"<br /> 2000000      ;"300 MB"    ;"00:00:46.454846"<br /> 1000000      ;"278 MB"   
;"00:00:42.323525"<br/> 100000        ;"287 MB"    ;"00:00:47.438132"<br /> 10000          ;"280 MB"   
;"00:01:00.307873"<br/><br /> I inserted data  randomly and sequentially, it did not influence the index's size.<br />
Timeof select, insert and update random rows is not changed. It is great, but certainly it needs some more detailed
study.<br/>  <br /> Alexander Popov<br /> Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a><br/> The Russian Postgres Company <br /><br /><br />

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Robert Haas

Date:

24 March 2016, 14:17:32

On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in this
> feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message couple of
> sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression for
> each particular index.
> 4. Recheck locking considerations. I tried to write code as less invasive as
> possible, but we need to make sure that algorithm is still correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7.  A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate.  First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest.  This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline.  Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything.  And despite
new tools like amcheck, it's not a particularly easy thing to debug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Alexander Korotkov

Date:

24 March 2016, 15:22:17

On Thu, Mar 24, 2016 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in this
> feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message couple of
> sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression for
> each particular index.
> 4. Recheck locking considerations. I tried to write code as less invasive as
> possible, but we need to make sure that algorithm is still correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.

It's all true. But:

1) It's a great feature many users dream about.

2) Patch is not very big.

3) Patch doesn't introduce significant infrastructural changes. It just change some well-isolated placed.

Let's give it a chance. I've signed as additional reviewer and I'll do my best in spotting all possible issues in this patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

24 March 2016, 22:12:51

On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I really like this idea, and the performance results seem impressive,
> but I think we should push this out to 9.7.  A btree patch that didn't
> have WAL support until two and a half weeks into the final CommitFest
> just doesn't seem to me like a good candidate.  First, as a general
> matter, if a patch isn't code-complete at the start of a CommitFest,
> it's reasonable to say that it should be reviewed but not necessarily
> committed in that CommitFest.  This patch has had some review, but I'm
> not sure how deep that review is, and I think it's had no code review
> at all of the WAL logging changes, which were submitted only a week
> ago, well after the CF deadline.  Second, the btree AM is a
> particularly poor place to introduce possibly destabilizing changes.
> Everybody depends on it, all the time, for everything.  And despite
> new tools like amcheck, it's not a particularly easy thing to debug.

Regrettably, I must agree. I don't see a plausible path to commit for
this patch in the ongoing CF.

I think that Anastasia did an excellent job here, and I wish I could
have been of greater help sooner. Nevertheless, it would be unwise to
commit this given the maturity of the code. There have been very few
instances of performance improvements to the B-Tree code for as long
as I've been interested, because it's so hard, and the standard is so
high. The only example I can think of from the last few years is
Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
were far less invasive, and Simon's commit c7111d11b1, which we just
outright reverted from 9.5 due to subtle bugs (and even that was
significantly less invasive than this patch). Improving nbtree is
something that requires several rounds of expert review, and that's
something that's in short supply for the B-Tree code in particular. I
think that a new testing strategy is needed to make this easier, and I
hope to get that going with amcheck. I need help with formalizing a
"testing first" approach for improving the B-Tree code, because I
think it's the only way that we can move forward with projects like
this. It's *incredibly* hard to push forward patches like this given
our current, limited testing strategy.

-- 
Peter Geoghegan

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Jim Nasby

Date:

25 March 2016, 00:23:07

On 3/24/16 10:21 AM, Alexander Korotkov wrote:
> 1) It's a great feature many users dream about.

Doesn't matter if it starts eating their data...

> 2) Patch is not very big.
> 3) Patch doesn't introduce significant infrastructural changes.  It just
> change some well-isolated placed.

It doesn't really matter how big the patch is, it's a question of "What 
did the patch fail to consider?". With something as complicated as the 
btree code, there's ample opportunities for missing things. (And FWIW, 
I'd argue that a 51kB patch is certainly not small, and a patch that is 
doing things in critical sections isn't terribly isolated).

I do think this will be a great addition, but it's just too late to be 
adding this to 9.6.

(BTW, I'm getting bounces from a.lebedev@postgrespro.ru, as well as 
postmaster@. I emailed info@postgrespro.ru about this but never heard back.)
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

28 March 2016, 14:30:06

25.03.2016 01:12, Peter Geoghegan:
> On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I really like this idea, and the performance results seem impressive,
>> but I think we should push this out to 9.7.  A btree patch that didn't
>> have WAL support until two and a half weeks into the final CommitFest
>> just doesn't seem to me like a good candidate.  First, as a general
>> matter, if a patch isn't code-complete at the start of a CommitFest,
>> it's reasonable to say that it should be reviewed but not necessarily
>> committed in that CommitFest.
You're right.
Frankly, I thought that someone will help me with the path, but I had to 
finish it myself.
*off-topic*
I wonder, if we can add new flag to commitfest. Something like "Needs 
assistance",
which will be used to mark big and complicated patches in progress.
While "Needs review" means that the patch is almost ready and only 
requires the final review.

>>    This patch has had some review, but I'm
>> not sure how deep that review is, and I think it's had no code review
>> at all of the WAL logging changes, which were submitted only a week
>> ago, well after the CF deadline.  Second, the btree AM is a
>> particularly poor place to introduce possibly destabilizing changes.
>> Everybody depends on it, all the time, for everything.  And despite
>> new tools like amcheck, it's not a particularly easy thing to debug.
> Regrettably, I must agree. I don't see a plausible path to commit for
> this patch in the ongoing CF.
>
> I think that Anastasia did an excellent job here, and I wish I could
> have been of greater help sooner. Nevertheless, it would be unwise to
> commit this given the maturity of the code. There have been very few
> instances of performance improvements to the B-Tree code for as long
> as I've been interested, because it's so hard, and the standard is so
> high. The only example I can think of from the last few years is
> Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
> were far less invasive, and Simon's commit c7111d11b1, which we just
> outright reverted from 9.5 due to subtle bugs (and even that was
> significantly less invasive than this patch). Improving nbtree is
> something that requires several rounds of expert review, and that's
> something that's in short supply for the B-Tree code in particular. I
> think that a new testing strategy is needed to make this easier, and I
> hope to get that going with amcheck. I need help with formalizing a
> "testing first" approach for improving the B-Tree code, because I
> think it's the only way that we can move forward with projects like
> this. It's *incredibly* hard to push forward patches like this given
> our current, limited testing strategy.

Unfortunately, I must agree. This patch seems to be far from final 
version until the feature freeze.
I'll move it to the future commitfest.

Anyway it means, that now we have more time to improve the patch.
If you have any ideas related to this patch like prefix/suffix 
compression, I'll be glad to discuss them.
Same for any other ideas of B-tree optimization.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Claudio Freire

Date:

28 March 2016, 17:28:32

On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I really like this idea, and the performance results seem impressive,
>> but I think we should push this out to 9.7.  A btree patch that didn't
>> have WAL support until two and a half weeks into the final CommitFest
>> just doesn't seem to me like a good candidate.  First, as a general
>> matter, if a patch isn't code-complete at the start of a CommitFest,
>> it's reasonable to say that it should be reviewed but not necessarily
>> committed in that CommitFest.  This patch has had some review, but I'm
>> not sure how deep that review is, and I think it's had no code review
>> at all of the WAL logging changes, which were submitted only a week
>> ago, well after the CF deadline.  Second, the btree AM is a
>> particularly poor place to introduce possibly destabilizing changes.
>> Everybody depends on it, all the time, for everything.  And despite
>> new tools like amcheck, it's not a particularly easy thing to debug.
>
> Regrettably, I must agree. I don't see a plausible path to commit for
> this patch in the ongoing CF.
>
> I think that Anastasia did an excellent job here, and I wish I could
> have been of greater help sooner. Nevertheless, it would be unwise to
> commit this given the maturity of the code. There have been very few
> instances of performance improvements to the B-Tree code for as long
> as I've been interested, because it's so hard, and the standard is so
> high. The only example I can think of from the last few years is
> Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
> were far less invasive, and Simon's commit c7111d11b1, which we just
> outright reverted from 9.5 due to subtle bugs (and even that was
> significantly less invasive than this patch). Improving nbtree is
> something that requires several rounds of expert review, and that's
> something that's in short supply for the B-Tree code in particular. I
> think that a new testing strategy is needed to make this easier, and I
> hope to get that going with amcheck. I need help with formalizing a
> "testing first" approach for improving the B-Tree code, because I
> think it's the only way that we can move forward with projects like
> this. It's *incredibly* hard to push forward patches like this given
> our current, limited testing strategy.

I've been toying (having gotten nowhere concrete really) with prefix
compression myself, I agree that messing with btree code is quite
harder than it ought to be.

Perhaps trying experimental format changes in a separate experimental
am wouldn't be all that bad (say, nxbtree?). People could opt-in to
those, by creating the indexes with nxbtree instead of plain btree
(say in development environments) and get some testing going without
risking much.

Normally the same effect should be achievable with mere flags, but
since format changes to btree tend to be rather invasive, ensuring the
patch doesn't change behavior with the flag off is hard as well, hence
the wholly separate am idea.

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Heikki Linnakangas

Date:

04 July 2016, 09:30:19

On 18/03/16 19:19, Anastasia Lubennikova wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in
> this feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message
> couple of sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression
> for each particular index.
> 4. Recheck locking considerations. I tried to write code as less
> invasive as possible, but we need to make sure that algorithm is still
> correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.

I think we should pack the TIDs more tightly, like GIN does with the 
varbyte encoding. It's tempting to commit this without it for now, and 
add the compression later, but I'd like to avoid having to deal with 
multiple binary-format upgrades, so let's figure out the final on-disk 
format that we want, right from the beginning.

It would be nice to reuse the varbyte encoding code from GIN, but we 
might not want to use that exact scheme for B-tree. Firstly, an 
important criteria when we designed GIN's encoding scheme was to avoid 
expanding on-disk size for any data set, which meant that a TID had to 
always be encoded in 6 bytes or less. We don't have that limitation with 
B-tree, because in B-tree, each item is currently stored as a separate 
IndexTuple, which is much larger. So we are free to choose an encoding 
scheme that's better at packing some values, at the expense of using 
more bytes for other values, if we want to. Some analysis on what we 
want would be nice. (It's still important that removing a TID from the 
list never makes the list larger, for VACUUM.)

Secondly, to be able to just always enable this feature, without a GUC 
or reloption, we might need something that's faster for random access 
than GIN's posting lists. Or can we just add the setting, but it would 
be nice to have some more analysis on the worst-case performance before 
we decide on that.

I find the macros in nbtree.h in the patch quite confusing. They're 
similar to what we did in GIN, but again we might want to choose 
differently here. So some discussion on the desired IndexTuple layout is 
in order. (One clear bug is that using the high bit of BlockNumber for 
the BT_POSTING flag will fail for a table larger than 2^31 blocks.)

- Heikki

Re: [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

07 November 2016, 00:02:52

On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I think we should pack the TIDs more tightly, like GIN does with the varbyte
> encoding. It's tempting to commit this without it for now, and add the
> compression later, but I'd like to avoid having to deal with multiple
> binary-format upgrades, so let's figure out the final on-disk format that we
> want, right from the beginning.

While the idea of duplicate storage is pretty obviously compelling,
there could be other, non-obvious benefits. I think that it could
bring further benefits if we could use duplicate storage to change
this property of nbtree (this is from the README):

"""
Lehman and Yao assume that the key range for a subtree S is described
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page.  This does not work for nonunique keys (for example, if we have
enough equal keys to spread across several leaf pages, there *must* be
some equal bounding keys in the first level up).  Therefore we assume
Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
bounding key in an upper tree level must descend to the left of that
key to ensure it finds any equal keys in the preceding page.  An
insertion that sees the high key of its target page is equal to the key
to be inserted has a choice whether or not to move right, since the new
key could go on either page.  (Currently, we try to find a page where
there is room for the new key without a split.)

"""

If we could *guarantee* that all keys in the index are unique, then we
could maintain the keyspace as L&Y originally described.

The practical benefits to this would be:

* We wouldn't need to take the extra step described above -- finding a
bounding key/separator key that's fully equal to our scankey would no
longer necessitate a probably-useless descent to the left of that key.
(BTW, I wonder if we could get away with not inserting a downlink into
parent when a leaf page split finds an identical IndexTuple in parent,
*without* changing the keyspace invariant I mention -- if we're always
going to go to the left of an equal-to-scankey key in an internal
page, why even have more than one?)

* This would make suffix truncation of internal index tuples easier,
and that's important.

The traditional reason why suffix truncation is important is that it
can keep the tree a lot shorter than it would otherwise be. These
days, that might not seem that important, because even if you have
twice the number of internal pages than strictly necessary, that still
isn't that many relative to typical main memory size (and even CPU
cache sizes, perhaps).

The reason I think it's important these days is that not having suffix
truncation makes our "separator keys" overly prescriptive about what
part of the keyspace is owned by each internal page. With a pristine
index (following REINDEX), this doesn't matter much. But, I think that
we get much bigger problems with index bloat due to the poor fan-out
that we sometimes see due to not having suffix truncation, *combined*
with the page deletion algorithms restriction on deleting internal
pages (it can only be done for internal pages with *no* children).

Adding another level or two to the B-Tree makes it so that your
workload's "sparse deletion patterns" really don't need to be that
sparse in order to bloat the B-Tree badly, necessitating a REINDEX to
get back to acceptable performance (VACUUM won't do it). To avoid
this, we should make the internal pages represent the key space in the
least restrictive way possible, by applying suffix truncation so that
it's much more likely that things will *stay* balanced as churn
occurs. This is probably a really bad problem with things like
composite indexes over text columns, or indexes with many NULL values.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

04 July 2019, 12:06:38

The new version of the patch is attached.
This version is even simpler than the previous one,
thanks to the recent btree design changes and all the feedback I received.
I consider it ready for review and testing.

[feature overview]
This patch implements the deduplication of btree non-pivot tuples on 
leaf pages
in a manner similar to GIN index "posting lists".

Non-pivot posting tuple has following format:
t_tid | t_info | key values | posting_list[]

Where t_tid and t_info fields are used to store meta info
about tuple's posting list.
posting list is an array of ItemPointerData.

Currently, compression is applied to all indexes except system indexes, 
unique
indexes, and indexes with included columns.

On insertion, compression applied not to each tuple, but to the page before
split. If the target page is full, we try to compress it.

[benchmark results]
idx ON tbl(c1);
index contains 10000000 integer values

i - number of distinct values in the index.
So i=1 means that all rows have the same key,
and i=10000000 means that all keys are different.

i / old size (MB) / new size (MB)
1            215    88
1000        215    90
100000        215    71
10000000    214    214

For more, see the attached diagram with test results.

[future work]
Many things can be improved in this feature.
Personally, I'd prefer to keep this patch as small as possible
and work on other improvements after a basic part is committed.
Though, I understand that some of these can be considered essential
for this patch to be approved.

1. Implement a split of the posting tuples on a page split.
2. Implement microvacuum of posting tuples.
3. Add a flag into pg_index, which allows enabling/disabling compression
for a particular index.
4. Implement posting list compression.

-- 
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> The new version of the patch is attached.
> This version is even simpler than the previous one,
> thanks to the recent btree design changes and all the feedback I received.
> I consider it ready for review and testing.

I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.

Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.

The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.

I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:

    /*
     * Lehman and Yao require that the downlink to the right page, which is to
     * be inserted into the parent page in the second phase of a page split be
     * a strict lower bound on items on the right page, and a non-strict upper
     * bound for items on the left page.  Assert that heap TIDs follow these
     * invariants, since a heap TID value is apparently needed as a
     * tiebreaker.
     */
#ifndef DEBUG_NO_TRUNCATE
    Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
                              BTreeTupleGetMinTID(firstright)) < 0);
...

This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)

Other feedback on specific things:

* A good way to assess whether or not the "logical tuple versus
physical tuple" thing works is to make sure that amcheck's
"rootdescend" verification works with a variety of indexes. As I said,
it has the same requirements for nbtree as retail index tuple deletion
will.

* _bt_findinsertloc() should not call _bt_compress_one_page() for
!heapkeyspace (version 3) indexes -- the second call to
_bt_compress_one_page() should be removed.

* Why can't compression be used on system catalog indexes? I
understand that they are not a compelling case, but we tend to do
things the same way with catalog tables and indexes unless there is a
very good reason not to (e.g. HOT, suffix truncation). I see that the
tests fail when that restriction is removed, but I don't think that
that has anything to do with system catalogs. I think that that's due
to a bug somewhere else. Why have this restriction at all?

* It looks like we could be less conservative in nbtsplitloc.c to good
effect. We know for sure that a posting list will be truncated down to
one heap TID even in the worst case, so we can safely assume that the
new high key will be a lot smaller than the firstright tuple that it
is based on when it has a posting list. We only have to keep one TID.
This will allow us to leave more tuples on the left half of the page
in certain cases, further improving space utilization.

* Don't you need to update nbtdesc.c?

* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.

* It looks like amcheck's heapallindexed verification needs to have
normalization added, to avoid false positives. This situation is
specifically anticipated by existing comments above
bt_normalize_tuple(). Again, being careful about "logical versus
physical tuple" seems necessary.

* Doesn't the nbtsearch.c/_bt_readpage() code that deals with
backwards scans need to return posting lists backwards, not forwards?
It seems like a good idea to try to "preserve the logical contents"
here too, just to be conservative.

Within nbtsort.c:

* Is the new code in _bt_buildadd() actually needed? If so, why?

* insert_itupprev_to_page_buildadd() is only called within nbtsort.c,
and so should be static. The name also seems very long.

* add_item_to_posting() is called within both nbtsort.c and
nbtinsert.c, and so should remain non-static, but have less generic
(and shorter) name.  (Use the usual _bt_* style instead.)

* Is nbtsort.c the right place for these functions, anyway? (Maybe,
but maybe not, IMV.)

I ran pgindent on the patch, and made some small manual whitespace
adjustments, which is attached. There are no real changes, but some of
the formatting in the original version you posted was hard to read.
Please work off this for your next revision.

--
Peter Geoghegan

Attachment

0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

11 July 2019, 04:53:04

On Sat, Jul 6, 2019 at 4:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.

I have some high level concerns about how the patch might increase
contention, which could make queries slower. Apparently that is a real
problem in other systems that use MVCC when their bitmap index feature
is used -- they are never really supposed to be used with OLTP apps.
This patch makes nbtree behave rather a lot like a bitmap index.
That's not exactly true, because you're not storing a bitmap or
compressing the TID lists, but they're definitely quite similar. It's
easy to imagine a hybrid approach, that starts with a B-Tree with
deduplication/TID lists, and eventually becomes a bitmap index as more
duplicates are added [1].

It doesn't seem like it would be practical for these other MVCC
database systems to have standard B-Tree secondary indexes that
compress duplicates gracefully in the way that you propose to with
this patch, because lock contention would presumably be a big problem
for the same reason as it is with their bitmap indexes (whatever the
true reason actually is). Is it really possible to have something
that's adaptive, offering the best of both worlds?

Having dug into it some more, I think that the answer for us might
actually be "yes, we can have it both ways". Other database systems
that are also based on MVCC still probably use a limited form of index
locking, even in READ COMMITTED mode, though this isn't very widely
known. They need this for unique indexes, but they also need it for
transaction rollback, to remove old entries from the index when the
transaction must abort. The section "6.7 Standard Practice" from the
paper "Architecture of a Database System" [2] goes into this, saying:

"All production databases today support ACID transactions. As a rule,
they use write-ahead logging for durability, and two-phase locking for
concurrency control. An exception is PostgreSQL, which uses
multiversion concurrency control throughout."

I suggest reading "6.7 Standard Practice" in full.

Anyway, I think that *hundreds* or even *thousands* of rows are
effectively locked all at once when a bitmap index needs to be updated
in these other systems -- and I mean a heavyweight lock that lasts
until the xact commits or aborts, like a Postgres row lock. As I said,
this is necessary simply because the transaction might need to roll
back. Of course, your patch never needs to do anything like that --
the only risk is that buffer lock contention will be increased. Maybe
VACUUM isn't so bad after all!

Doing deduplication adaptively and automatically in nbtree seems like
it might play to the strengths of Postgres, while also ameliorating
its weaknesses. As the same paper goes on to say, it's actually quite
unusual that PostgreSQL has *transactional* full text search built in
(using GIN), and offers transactional, high concurrency spatial
indexing (using GiST). Actually, this is an additional advantages of
our "pure" approach to MVCC -- we can add new high concurrency,
transactional access methods relatively easily.

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159&rep=rep1&type=pdf
[2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Bruce Momjian

Date:

11 July 2019, 14:30:30

On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote:
> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
> 
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Wow, I never thought of that.  The only things I know we lock until
transaction end are rows we update (against concurrent updates), and
additions to unique indexes.  By definition, indexes with many
duplicates are not unique, so that doesn't apply.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Alexander Korotkov

Date:

11 July 2019, 15:02:04

Hi Peter,

Thank you very much for your attention to this patch. Let me comment
some points of your message.

On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
> > The new version of the patch is attached.
> > This version is even simpler than the previous one,
> > thanks to the recent btree design changes and all the feedback I received.
> > I consider it ready for review and testing.
>
> I took a closer look at this patch, and have some general thoughts on
> its design, and specific feedback on the implementation.
>
> Preserving the *logical contents* of B-Tree indexes that use
> compression is very important -- that should not change in a way that
> outside code can notice. The heap TID itself should count as logical
> contents here, since we want to be able to implement retail index
> tuple deletion in the future. Even without retail index tuple
> deletion, amcheck's "rootdescend" verification assumes that it can
> find one specific tuple (which could now just be one specific "logical
> tuple") using specific key values from the heap, including the heap
> tuple's heap TID. This requirement makes things a bit harder for your
> patch, because you have to deal with one or two edge-cases that you
> currently don't handle: insertion of new duplicates that fall inside
> the min/max range of some existing posting list. That should be rare
> enough in practice, so the performance penalty won't be too bad. This
> probably means that code within _bt_findinsertloc() and/or
> _bt_binsrch_insert() will need to think about a logical tuple as a
> distinct thing from a physical tuple, though that won't be necessary
> in most places.

Could you please elaborate more on preserving the logical contents?  I
can understand it as following: "B-Tree should have the same structure
and invariants as if each TID in posting list be a separate tuple".
So, if we imagine each TID to become separate tuple it would be the
same B-tree, which just can magically sometimes store more tuples in
page.  Is my understanding correct?  But outside code will still
notice changes as soon as it directly accesses B-tree pages (like
contrib/amcheck does).  Do you mean we need an API for accessing
logical B-tree tuples or something?

> The need to "preserve the logical contents" also means that the patch
> will need to recognize when indexes are not safe as a target for
> compression/deduplication (maybe we should call this feature
> deduplilcation, so it's clear how it differs from TOAST?). For
> example, if we have a case-insensitive ICU collation, then it is not
> okay to treat an opclass-equal pair of text strings that use the
> collation as having the same value when considering merging the two
> into one. You don't actually do that in the patch, but you also don't
> try to deal with the fact that such a pair of strings are equal, and
> so must have their final positions determined by the heap TID column
> (deduplication within _bt_compress_one_page() must respect that).
> Possibly equal-but-distinct values seems like a problem that's not
> worth truly fixing, but it will be necessary to store metadata about
> whether or not we're willing to do deduplication in the meta page,
> based on operator class and collation details. That seems like a
> restriction that we're just going to have to accept, though I'm not
> too worried about exactly what that will look like right now. We can
> work it out later.

I think in order to deduplicate "equal but distinct" values we need at
least to give up with index only scans.  Because we have no
restriction that equal according to B-tree opclass values are same for
other operations and/or user output.

> I think that the need to be careful about the logical contents of
> indexes already causes bugs, even with "safe for compression" indexes.
> For example, I can sometimes see an assertion failure
> within_bt_truncate(), at the point where we check if heap TID values
> are safe:
>
>     /*
>      * Lehman and Yao require that the downlink to the right page, which is to
>      * be inserted into the parent page in the second phase of a page split be
>      * a strict lower bound on items on the right page, and a non-strict upper
>      * bound for items on the left page.  Assert that heap TIDs follow these
>      * invariants, since a heap TID value is apparently needed as a
>      * tiebreaker.
>      */
> #ifndef DEBUG_NO_TRUNCATE
>     Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
>                               BTreeTupleGetMinTID(firstright)) < 0);
> ...
>
> This bug is not that easy to see, but it will happen with a big index,
> even without updates or deletes. I think that this happens because
> compression can allow the "logical tuples" to be in the wrong heap TID
> order when there are multiple posting lists for the same value. As I
> said, I think that it's necessary to see a posting list as being
> comprised of multiple logical tuples in the context of inserting new
> tuples, even when you're not performing compression or splitting the
> page. I also see that amcheck's bt_index_parent_check() function
> fails, though bt_index_check() does not fail when I don't use any of
> its extra verification options. (You haven't updated amcheck, but I
> don't think that you need to update it for these basic checks to
> continue to work.)

Do I understand correctly that current patch may produce posting lists
of the same value with overlapping ranges of TIDs?  If so, it's
definitely wrong.

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
> NULLs into a unique index as if it wasn't even a unique index -- see
> the "checkingunique" optimization at the beginning of _bt_doinsert().
> Having many NULL values in a unique index is probably fairly common.

I think unique indexes may benefit from deduplication not only because
of NULL values.  Non-HOT updates produce duplicates of non-NULL values
in unique indexes.  And those duplicates can take significant space.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Alexander Korotkov

Date:

11 July 2019, 15:08:58

On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Anyway, I think that *hundreds* or even *thousands* of rows are
> effectively locked all at once when a bitmap index needs to be updated
> in these other systems -- and I mean a heavyweight lock that lasts
> until the xact commits or aborts, like a Postgres row lock. As I said,
> this is necessary simply because the transaction might need to roll
> back. Of course, your patch never needs to do anything like that --
> the only risk is that buffer lock contention will be increased. Maybe
> VACUUM isn't so bad after all!
>
> Doing deduplication adaptively and automatically in nbtree seems like
> it might play to the strengths of Postgres, while also ameliorating
> its weaknesses. As the same paper goes on to say, it's actually quite
> unusual that PostgreSQL has *transactional* full text search built in
> (using GIN), and offers transactional, high concurrency spatial
> indexing (using GiST). Actually, this is an additional advantages of
> our "pure" approach to MVCC -- we can add new high concurrency,
> transactional access methods relatively easily.

Good finding, thank you!

BTW, I think deduplication could cause some small performance
degradation in some particular cases, because page-level locks became
more coarse grained once pages hold more tuples.  However, this
doesn't seem like something we should much care about.  Providing an
option to turn deduplication off looks enough for me.

Regarding bitmap indexes itself, I think our BRIN could provide them.
However, it would be useful to have opclass parameters to make them
tunable.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Fwd: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Rafia Sabih

Date:

11 July 2019, 15:34:42

On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan <pg@bowt.ie> wrote:

> * Maybe we could do compression with unique indexes when inserting
> values with NULLs? Note that we now treat an insertion of a tuple with
+1

I tried this patch and found the improvements impressive. However,
when I tried with multi-column indexes it wasn't giving any
improvement, is it the known limitation of the patch?
I am surprised to find that such a patch is on radar since quite some
years now and not yet committed.

Going through the patch, here are a few comments from me,

 /* Add the new item into the page */
+ offnum = OffsetNumberNext(offnum);
+
+ elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
IndexTupleSize %zu free %zu",
+ compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
+
and other such DEBUG4 statements are meant to be removed, right...?
Just because I didn't find any other such statements in this API and
there are many in this patch, so not sure how much are they needed.

/*
* If we have only 10 uncompressed items on the full page, it probably
* won't worth to compress them.
*/
if (maxoff - n_posting_on_page < 10)
return;

Is this a magic number...?

/*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

This makes me wonder about those 'some' reasons.

Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * he will get what he expects

This can be re-framed to make the caller more gender neutral.

Other than that, I am curious about the plans for its backward compatibility.

--
Regards,
Rafia Sabih

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

11 July 2019, 17:20:17

On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian <bruce@momjian.us> wrote:
> Wow, I never thought of that.  The only things I know we lock until
> transaction end are rows we update (against concurrent updates), and
> additions to unique indexes.  By definition, indexes with many
> duplicates are not unique, so that doesn't apply.

Right. Another advantage of their approach is that you can make
queries like this work:

UPDATE tab SET unique_col = unique_col + 1

This will not throw a unique violation error on most/all other DB
systems when the updated column (in this case "unique_col") has a
unique constraint/is the primary key. This behavior is actually
required by the SQL standard. An SQL statement is supposed to be
all-or-nothing, which Postgres doesn't quite manage here.

The section "6.6 Interdependencies of Transactional Storage" from the
paper "Architecture of a Database System" provides additional
background information (I should have suggested reading both 6.6 and
6.7 together).

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

11 July 2019, 17:42:55

On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Could you please elaborate more on preserving the logical contents?  I
> can understand it as following: "B-Tree should have the same structure
> and invariants as if each TID in posting list be a separate tuple".

That's exactly what I mean.

> So, if we imagine each TID to become separate tuple it would be the
> same B-tree, which just can magically sometimes store more tuples in
> page.  Is my understanding correct?

Yes.

> But outside code will still
> notice changes as soon as it directly accesses B-tree pages (like
> contrib/amcheck does).  Do you mean we need an API for accessing
> logical B-tree tuples or something?

Well, contrib/amcheck isn't really outside code. But amcheck's
"rootdescend" option will still need to be able to supply a heap TID
as just another column, and get back zero or one logical tuples from
the index. This is important because retail index tuple deletion needs
to be able to think about logical tuples in the same way. I also think
that it might be useful for the planner to expect to get back
duplicates in heap TID order in the future (or in reverse order in the
case of a backwards scan). Query execution and VACUUM code outside of
nbtree should be able to pretend that there is no such thing as a
posting list.

The main thing that the patch is missing that is needed to "preserve
logical contents" is the ability to update/expand an *existing*
posting list due to a retail insertion of a new duplicate that happens
to be within the range of that existing posting list. This will
usually be a non-HOT update that doesn't change the value for the row
in the index -- that must change the posting list, even when there is
available space on the page without recompressing. We must still
occasionally be eager, like GIN always is, though in practice we'll
almost always add to posting lists in a lazy fashion, when it looks
like we might have to split the page -- the lazy approach seems to
perform best.

> I think in order to deduplicate "equal but distinct" values we need at
> least to give up with index only scans.  Because we have no
> restriction that equal according to B-tree opclass values are same for
> other operations and/or user output.

We can either prevent index-only scans in the case of affected
indexes, or prevent compression, or give the user a choice. I'm not
too worried about how that will work for users just yet.

> Do I understand correctly that current patch may produce posting lists
> of the same value with overlapping ranges of TIDs?  If so, it's
> definitely wrong.

Yes, it can, since the assertion fails. It looks like the assertion
itself was changed to match what I expect, so I assume that this bug
will be fixed in the next version of the patch. It fails with a fairly
big index on text for me.

> > * Maybe we could do compression with unique indexes when inserting
> > values with NULLs? Note that we now treat an insertion of a tuple with
> > NULLs into a unique index as if it wasn't even a unique index -- see
> > the "checkingunique" optimization at the beginning of _bt_doinsert().
> > Having many NULL values in a unique index is probably fairly common.
>
> I think unique indexes may benefit from deduplication not only because
> of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> in unique indexes.  And those duplicates can take significant space.

I agree that we should definitely have an open mind about unique
indexes, even with non-NULL values. If we can prevent a page split by
deduplicating the contents of a unique index page, then we'll probably
win. Why not try? This will need to be tested.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

11 July 2019, 17:51:23

On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> BTW, I think deduplication could cause some small performance
> degradation in some particular cases, because page-level locks became
> more coarse grained once pages hold more tuples.  However, this
> doesn't seem like something we should much care about.  Providing an
> option to turn deduplication off looks enough for me.

There was an issue like this with my v12 work on nbtree, with the
TPC-C indexes. They were always ~40% smaller, but there was a
regression when TPC-C was used with a small number of warehouses, when
the data could easily fit in memory (which is not allowed by the TPC-C
spec, in effect). TPC-C is very write-heavy, which combined with
everything else causes this problem. I wasn't doing anything too fancy
there -- the regression seemed to happen simply because the index was
smaller, not because of the overhead of doing page splits differently
or anything like that (there were far fewer splits).

I expect there to be some regression for workloads like this. I am
willing to accept that provided it's not too noticeable, and doesn't
have an impact on other workloads. I am optimistic about it.

> Regarding bitmap indexes itself, I think our BRIN could provide them.
> However, it would be useful to have opclass parameters to make them
> tunable.

I thought that we might implement them in nbtree myself. But we don't
need to decide now.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

11 July 2019, 18:19:46

On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> I tried this patch and found the improvements impressive. However,
> when I tried with multi-column indexes it wasn't giving any
> improvement, is it the known limitation of the patch?

It'll only deduplicate full duplicates. It works with multi-column
indexes, provided the entire set of values in duplicated -- not just a
prefix. Prefix compression is possible, but it's more complicated. It
seems to generally require the DBA to specify a prefix length,
expressed as a number of prefix columns.

> I am surprised to find that such a patch is on radar since quite some
> years now and not yet committed.

The v12 work on nbtree (making heap TID a tiebreaker column) seems to
have made the general approach a lot more effective. Compression is
performed lazily, not eagerly, which seems to work a lot better.

> + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
> IndexTupleSize %zu free %zu",
> + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
> +
> and other such DEBUG4 statements are meant to be removed, right...?

I hope so too.

> /*
> * If we have only 10 uncompressed items on the full page, it probably
> * won't worth to compress them.
> */
> if (maxoff - n_posting_on_page < 10)
> return;
>
> Is this a magic number...?

I think that this should be a constant or something.

> /*
> * We do not expect to meet any DEAD items, since this function is
> * called right after _bt_vacuum_one_page(). If for some reason we
> * found dead item, don't compress it, to allow upcoming microvacuum
> * or vacuum clean it up.
> */
> if (ItemIdIsDead(itemId))
> continue;
>
> This makes me wonder about those 'some' reasons.

I think that this is just defensive. Note that _bt_vacuum_one_page()
is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
is set for the page.

> Caller is responsible for checking BTreeTupleIsPosting to ensure that
> + * he will get what he expects
>
> This can be re-framed to make the caller more gender neutral.

Agreed. I also don't like anthropomorphizing code like this.

> Other than that, I am curious about the plans for its backward compatibility.

Me too. There is something about a new version 5 in comments in
nbtree.h, but the version number isn't changed. I think that we may be
able to get away with not increasing the B-Tree version from 4 to 5,
actually. Deduplication is performed lazily when it looks like we
might have to split the page, so there isn't any expectation that
tuples will either be compressed or uncompressed in any context.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

12 July 2019, 01:47:14

On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think unique indexes may benefit from deduplication not only because
> > of NULL values.  Non-HOT updates produce duplicates of non-NULL values
> > in unique indexes.  And those duplicates can take significant space.
>
> I agree that we should definitely have an open mind about unique
> indexes, even with non-NULL values. If we can prevent a page split by
> deduplicating the contents of a unique index page, then we'll probably
> win. Why not try? This will need to be tested.

I thought about this some more. I believe that the LP_DEAD bit setting
within _bt_check_unique() is generally more important than the more
complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even
though the _bt_check_unique() thing can only be used with unique
indexes. Also, I have often thought that we don't do enough to take
advantage of the special characteristics of unique indexes -- they
really are quite different. I believe that other database systems do
this in various ways. Maybe we should too.

Unique indexes are special because there can only ever be zero or one
tuples of the same value that are visible to any possible MVCC
snapshot. Within the index AM, there is little difference between an
UPDATE by a transaction and a DELETE + INSERT of the same value by a
transaction. If there are 3 or 5 duplicates within a unique index,
then there is a strong chance that VACUUM could reclaim some of them,
given the chance. It is worth going to a little effort to find out.

In a traditional serial/bigserial primary key, the key space that is
typically "owned" by the left half of a rightmost page split describes
a range of about ~366 items, with few or no gaps for other values that
didn't exist at the time of the split (i.e. the two pivot tuples on
each side cover a range that is equal to the number of items itself).
If the page ever splits again, the chances of it being due to non-HOT
updates is perhaps 100%. Maybe VACUUM just didn't get around to the
index in time, or maybe there is a long running xact, or whatever. If
we can delay page splits in indexes like this, then we could easily
prevent them from *ever* happening.

Our first line of defense against page splits within unique indexes
will probably always be LP_DEAD bits set within _bt_check_unique(),
because it costs so little -- same as today. We could also add a
second line of defense: deduplication -- same as with non-unique
indexes with the patch. But we can even add a third line of defense on
top of those two: more aggressive reclaiming of posting list space, by
going to the heap to check the visibility status of earlier posting
list entries. We can do this optimistically when there is no LP_DEAD
bit set, based on heuristics.

The high level principle here is that we can justify going to a small
amount of extra effort for the chance to avoid a page split, and maybe
even more than a small amount. Our chances of reversing the split by
merging pages later on are almost zero. The two halves of the split
will probably each get dirtied again and again in the future if we
cannot avoid it, plus we have to dirty the parent page, and the old
sibling page (to update its left link). In general, a page split is
already really expensive. We could do something like amortize the cost
of accessing the heap a second time for tuples that we won't have
considered setting the LP_DEAD bit on within _bt_check_unique() by
trying the *same* heap page a *second* time where possible (distinct
values are likely to be nearby on the same page). I think that an
approach like this could work quite well for many workloads. You only
pay a cost (visiting the heap an extra time) when it looks like you'll
get a benefit (not splitting the page).

As you know, Andres already changed nbtree to get an XID for conflict
purposes on the primary by visiting the heap a second time (see commit
558a9165e08), when we need to actually reclaim LP_DEAD space. I
anticipated that we could extend this to do more clever/eager/lazy
cleanup of additional items before that went in, which is a closely
related idea. See:

https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7

I know that this is a bit hand-wavy; the details certainly need to be
worked out. However, it is not so different to the "ghost bit" design
that other systems use with their non-unique indexes (though this idea
applies specifically to unique indexes in our case). The main
difference is that we're going to the heap rather than to UNDO,
because that's where we store our visibility information. That doesn't
seem like such a big difference -- we are also reasonably confident
that we'll find that the TID is dead, even without LP_DEAD bits being
set, because we only do the extra stuff with unique indexes. And, we
do it lazily.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

17 July 2019, 16:36:27

11.07.2019 21:19, Peter Geoghegan wrote:
> On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
Hi,
Peter, Rafia, thanks for the review. New version is attached.

>> + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d
>> IndexTupleSize %zu free %zu",
>> + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page));
>> +
>> and other such DEBUG4 statements are meant to be removed, right...?
> I hope so too.
Yes, these messages are only for debugging.
I haven't delete them since this is still work in progress
and it's handy to be able to print inner details.
Maybe I should also write a patch for pageinspect.

>> /*
>> * If we have only 10 uncompressed items on the full page, it probably
>> * won't worth to compress them.
>> */
>> if (maxoff - n_posting_on_page < 10)
>> return;
>>
>> Is this a magic number...?
> I think that this should be a constant or something.
Fixed. Now this is a constant in nbtree.h. I'm not 100% sure about the 
value.
When the code will stabilize we can benchmark it and find optimal value.

>> /*
>> * We do not expect to meet any DEAD items, since this function is
>> * called right after _bt_vacuum_one_page(). If for some reason we
>> * found dead item, don't compress it, to allow upcoming microvacuum
>> * or vacuum clean it up.
>> */
>> if (ItemIdIsDead(itemId))
>> continue;
>>
>> This makes me wonder about those 'some' reasons.
> I think that this is just defensive. Note that _bt_vacuum_one_page()
> is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag
> is set for the page.
You are right, now it is impossible to meet dead items in this function.
Though it can change in the future if, for example, _bt_vacuum_one_page 
will behave lazily.
So this is just a sanity check. Maybe it's worth to move it to Assert.

>
>> Caller is responsible for checking BTreeTupleIsPosting to ensure that
>> + * he will get what he expects
>>
>> This can be re-framed to make the caller more gender neutral.
> Agreed. I also don't like anthropomorphizing code like this.

Fixed.

>> Other than that, I am curious about the plans for its backward compatibility.
> Me too. There is something about a new version 5 in comments in
> nbtree.h, but the version number isn't changed. I think that we may be
> able to get away with not increasing the B-Tree version from 4 to 5,
> actually. Deduplication is performed lazily when it looks like we
> might have to split the page, so there isn't any expectation that
> tuples will either be compressed or uncompressed in any context.
Current implementation is backward compatible.
To distinguish posting tuples, it only adds one new flag combination.
This combination was never possible before. Comment about version 5 is 
deleted.

I also added a patch for amcheck.

There is one major issue left - preserving TID order in posting lists.
For a start, I added a sort into BTreeFormPostingTuple function.
It turned out to be not very helpful, because we cannot check this 
invariant lazily.

Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to 
implement
insertion into the middle of the posting list. I'll send a new version 
this week.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

0001-btree_compression_pg12_v2.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

19 July 2019, 17:53:22

17.07.2019 19:36, Anastasia Lubennikova:
>
> There is one major issue left - preserving TID order in posting lists.
> For a start, I added a sort into BTreeFormPostingTuple function.
> It turned out to be not very helpful, because we cannot check this 
> invariant lazily.
>
> Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to 
> implement
> insertion into the middle of the posting list. I'll send a new version 
> this week.

Patch 0002 (must be applied on top of 0001) implements preserving of 
correct TID order
inside posting list when inserting new tuples.
This version passes all regression tests including amcheck test.
I also used following script to test insertion into the posting list:

set client_min_messages to debug4;
drop table tbl;
create table tbl (i1 int, i2 int);
insert into tbl select 1, i from generate_series(0,1000) as i;
insert into tbl select 1, i from generate_series(0,1000) as i;
create index idx on tbl (i1);
delete from tbl where i2 <500;
vacuum tbl ;
insert into tbl select 1, i from generate_series(1001, 1500) as i;

The last insert triggers several insertions that can be seen in debug 
messages.
I suppose it is not the final version of the patch yet,
so I left some debug messages and TODO comments to ease review.

Please, in your review, pay particular attention to usage of 
BTreeTupleGetHeapTID.
For posting tuples it returns the first tid from posting list like 
BTreeTupleGetMinTID,
but maybe some callers are not ready for that and want 
BTreeTupleGetMaxTID instead.
Incorrect usage of these macros may cause some subtle bugs,
which are probably not covered by tests. So, please double-check it.

Next week I'm going to check performance and try to find specific 
scenarios where this
feature can lead to degradation and measure it, to understand if we need 
to make this deduplication optional.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

19 July 2019, 19:32:18

On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Patch 0002 (must be applied on top of 0001) implements preserving of
> correct TID order
> inside posting list when inserting new tuples.
> This version passes all regression tests including amcheck test.
> I also used following script to test insertion into the posting list:

Nice!

> I suppose it is not the final version of the patch yet,
> so I left some debug messages and TODO comments to ease review.

I'm fine with leaving them in. I have sometimes distributed a separate
patch with debug messages, but now that I think about it, that
probably wasn't a good use of time.

You will probably want to remove at least some of the debug messages
during performance testing. I'm thinking of code that appears in very
tight inner loops, such as the _bt_compare() code.

> Please, in your review, pay particular attention to usage of
> BTreeTupleGetHeapTID.
> For posting tuples it returns the first tid from posting list like
> BTreeTupleGetMinTID,
> but maybe some callers are not ready for that and want
> BTreeTupleGetMaxTID instead.
> Incorrect usage of these macros may cause some subtle bugs,
> which are probably not covered by tests. So, please double-check it.

One testing strategy that I plan to use for the patch is to
deliberately corrupt a compressed index in a subtle way using
pg_hexedit, and then see if amcheck detects the problem. For example,
I may swap the order of two TIDs in the middle of a posting list,
which is something that is unlikely to produce wrong answers to
queries, and won't even be detected by the "heapallindexed" check, but
is still wrong. If we can detect very subtle, adversarial corruption
like this, then we can detect any real-world problem.

Once we have confidence in amcheck's ability to detect problems with
posting lists in general, we can use it in many different contexts
without much thought. For example, we'll probably need to do long
running benchmarks to validate the performance of the patch. It's easy
to add amcheck testing at the end of each run. Every benchmark is now
also a correctness/stress test, for free.

> Next week I'm going to check performance and try to find specific
> scenarios where this
> feature can lead to degradation and measure it, to understand if we need
> to make this deduplication optional.

Sounds good, though I think it might be a bit too early to decide
whether or not it needs to be enabled by default. For one thing, the
approach to WAL-logging within _bt_compress_one_page() is probably
fairly inefficient, which may be a problem for certain workloads. It's
okay to leave it that way for now, because it is not relevant to the
core design of the patch. I'm sure that _bt_compress_one_page() can be
carefully optimized when the time comes.

My current focus is not on the raw performance itself. For now, I am
focussed on making sure that the compression works well, and that the
resulting indexes "look nice" in general. FWIW, the first few versions
of my v12 work on nbtree didn't actually make *anything* go faster. It
took a couple of months to fix the more important regressions, and a
few more months to fix all of them. I think that the work on this
patch may develop in a similar way. I am willing to accept regressions
in the unoptimized code during development because it seems likely
that you have the right idea about the data structure itself, which is
the one thing that I *really* care about. Once you get that right, the
remaining problems are very likely to either be fixable with further
work on optimizing specific code, or a price that users will mostly be
happy to pay to get the benefits.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

20 July 2019, 02:24:48

On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
> > Patch 0002 (must be applied on top of 0001) implements preserving of
> > correct TID order
> > inside posting list when inserting new tuples.
> > This version passes all regression tests including amcheck test.
> > I also used following script to test insertion into the posting list:
>
> Nice!

Hmm. So, the attached test case fails amcheck verification for me with
the latest version of the patch:

$ psql -f amcheck-compress-test.sql
DROP TABLE
CREATE TABLE
CREATE INDEX
CREATE EXTENSION
INSERT 0 2001
psql:amcheck-compress-test.sql:6: ERROR:  down-link lower bound
invariant violated for index "idx_desc_nl"
DETAIL:  Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438.

Note that this test only has an INSERT statement. You have to use
bt_index_parent_check() to see the problem -- bt_index_check() will
not detect the problem.

-- 
Peter Geoghegan

Attachment

amcheck-compress-test.sql

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

24 July 2019, 01:22:01

On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Hmm. So, the attached test case fails amcheck verification for me with
> the latest version of the patch:

Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.

Other changes:

* We now support system catalog indexes. There is no reason not to support them.

* Removed unnecessary code from _bt_buildadd().

* Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which
I used to fix that bug I mentioned. I agree that we should keep the
DEBUG4 traces around until the overall design settles down. I found
the ones that you added helpful, too.

* Added quite a few new assertions. For example, we need to still
support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot
let them use compression -- new defensive assertions were added to
make this break loudly.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.

* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.

I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.

Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT queries run as fast or faster than before when
inserting into unlogged tables, which is a very good start.

* Since we can now split a posting list in two, we may also have to
reconsider BTMaxItemSize, or some similar mechanism that worries about
extreme cases where it becomes impossible to split because even two
pages are not enough to fit everything. Think of what happens when
there is a tuple with a single large datum, that gets split in two
(the tuple is split, not the page), with each half receiving its own
copy of the datum. I haven't proven to myself that this is broken, but
that may just be because I haven't spent any time on it. OTOH, maybe
you already have it right, in which case it seems like it should be
explained somewhere. Possibly in nbtree.h. This is tricky stuff.

* I agree with all of your existing TODO items -- most of them seem
very important to me.

* Do we really need to keep BTreeTupleGetHeapTID(), now that we have
BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
that callers don't need to think about the pivot vs posting list thing
themselves? See the new code added to _bt_mkscankey() by v3, for
example. It now handles both cases/macros at once, in order to keep
its amcheck caller happy. amcheck's verify_nbtree.c received similar
ugly code in v3.

* We should at least experiment with applying compression when
inserting into unique indexes. Like Alexander, I think that
compression in unique indexes might work well, given how they must
work in Postgres.

My next steps will be to study the design of the
_bt_insertonpg_in_posting() stuff some more. It seems like you already
have the right general idea there, but I would like to come up with a
way of making _bt_insertonpg_in_posting() understand how to work with
space on the page with total certainty, much like nbtsplitloc.c does
today. This should allow us to make WAL-logging more
precise/incremental.

--
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

24 July 2019, 22:06:13

On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3.

Remember that index that I said was 5.5x smaller with the patch
applied, following retail insertions (a single big INSERT ... SELECT
...)? Well, it's 6.5x faster with this small additional patch applied
on top of the v3 I posted yesterday. Many of the indexes in my test
suite are about ~20% smaller __in addition to__ very big size
reductions. Some are even ~30% smaller than they were with v3 of the
patch. For example, the fair use implementation of TPC-H that my test
data comes from has an index on the "orders" o_orderdate column, named
idx_orders_orderdate, which is made ~30% smaller by the addition of
this simple patch (once again, this is following a single big INSERT
... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller
than it is with master/Postgres 12, in case you were wondering.

This new patch teaches nbtsplitloc.c to subtract posting list overhead
when sizing the new high key for the left half of a candidate split
point, since we know for sure that _bt_truncate() will at least manage
to truncate away that much from the new high key, even in the worst
case. Since posting lists are often very large, this can make a big
difference. This is actually just a bugfix, not a new idea -- I merely
made nbtsplitloc.c understand how truncation works with posting lists.

There seems to be a kind of "synergy" between the nbtsplitloc.c
handling of pages that have lots of duplicates and posting list
compression. It seems as if the former mechanism "sets up the bowling
pins", while the latter mechanism "knocks them down", which is really
cool. We should try to gain a better understanding of how that works,
because it's possible that it could be even more effective in some
cases.

-- 
Peter Geoghegan

Attachment

0003-Account-for-posting-list-overhead-during-splits.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

25 July 2019, 03:48:59

On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> There seems to be a kind of "synergy" between the nbtsplitloc.c
> handling of pages that have lots of duplicates and posting list
> compression. It seems as if the former mechanism "sets up the bowling
> pins", while the latter mechanism "knocks them down", which is really
> cool. We should try to gain a better understanding of how that works,
> because it's possible that it could be even more effective in some
> cases.

I found another important way in which this synergy can fail to take
place, which I can fix.

By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
from my test suite become much smaller, while most are not affected.
These indexes were not helped too much by the patch before. For
example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
of duplicates of a single value (that's how it appears after an
initial TPC-E bulk load), as are a couple of other TPC-E indexes.
TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
idx_lineitem_orderkey index becomes ~15% smaller.

I believe that this happened because rightmost page splits were an
inefficient case for compression. But rightmost page split heavy
indexes with lots of duplicates are not that uncommon. Think of any
index with many NULL values, for example.

I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
not sure what the idea is behind it. My sense is that we're likely to
benefit by delaying page splits, no matter what. Though I am still
looking at it purely from a space utilization point of view, at least
for now.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Rafia Sabih

Date:

31 July 2019, 14:59:22

On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > There seems to be a kind of "synergy" between the nbtsplitloc.c
> > handling of pages that have lots of duplicates and posting list
> > compression. It seems as if the former mechanism "sets up the bowling
> > pins", while the latter mechanism "knocks them down", which is really
> > cool. We should try to gain a better understanding of how that works,
> > because it's possible that it could be even more effective in some
> > cases.
>
> I found another important way in which this synergy can fail to take
> place, which I can fix.
>
> By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
> from my test suite become much smaller, while most are not affected.
> These indexes were not helped too much by the patch before. For
> example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
> of duplicates of a single value (that's how it appears after an
> initial TPC-E bulk load), as are a couple of other TPC-E indexes.
> TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
> idx_lineitem_orderkey index becomes ~15% smaller.
>
> I believe that this happened because rightmost page splits were an
> inefficient case for compression. But rightmost page split heavy
> indexes with lots of duplicates are not that uncommon. Think of any
> index with many NULL values, for example.
>
> I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
> not sure what the idea is behind it. My sense is that we're likely to
> benefit by delaying page splits, no matter what. Though I am still
> looking at it purely from a space utilization point of view, at least
> for now.
>
Minor comment fix, pointes-->pointer, plus, are we really doing the
half, or is it just splitting into two.
/*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */

Some remains of 'he'.
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */

Everything reads just fine without 'us'.
/*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
-- 
Regards,
Rafia Sabih

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

31 July 2019, 16:22:59

24.07.2019 4:22, Peter Geoghegan wrote:
>
> Attached is a revised version of your v2 that fixes this issue -- I'll
> call this v3. In general, my goal for the revision was to make sure
> that all of my old tests from the v12 work passed, and to make sure
> that amcheck can detect almost any possible problem. I tested the
> amcheck changes by corrupting random state in a test index using
> pg_hexedit, then making sure that amcheck actually complained in each
> case.
>
> I also fixed one or two bugs in passing, including the bug that caused
> an assertion failure in _bt_truncate(). That was down to a subtle
> off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
> make that many changes to your v2. There are probably some things
> about the patch that I still don't understand, or things that I have
> misunderstood.
>
Thank you for this review and fixes.
> * Changed the custom binary search code within _bt_compare_posting()
> to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
> of any reason not to do it that way?
It's ok to update it. There was no particular reason, just my habit.
> * Added quite a few "FIXME"/"XXX" comments at various points, to
> indicate where I have general concerns that need more discussion.

+         * FIXME:  The calls to BTreeGetNthTupleOfPosting() allocate 
memory,

If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not 
tuples.

Fixed in version 4.

> * Included my own pageinspect hack to visualize the minimum TIDs in
> posting lists. It's broken out into a separate patch file. The code is
> very rough, but it might help someone else, so I thought I'd include
> it.
Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.

> I also have some new concerns about the code in the patch that I will
> point out now (though only as something to think about a solution on
> -- I am unsure myself):
>
> * It's a bad sign that compression involves calls to PageAddItem()
> that are allowed to fail (we just give up on compression when that
> happens). For one thing, all existing calls to PageAddItem() in
> Postgres are never expected to fail -- if they do fail we get a "can't
> happen" error that suggests corruption. It was a good idea to take
> this approach to get the patch to work, and to prove the general idea,
> but we now need to fully work out all the details about the use of
> space. This includes complicated new questions around how alignment is
> supposed to work.

The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out 
of space
on the page.

First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.

Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key

and compressed result is:
t_tid | t_info | key | t_tid | t_tid

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple 
can be
larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16 
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is 
also safe.

I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special 
tests.

> Alignment in nbtree is already complicated today -- you're supposed to
> MAXALIGN() everything in nbtree, so that the MAXALIGN() within
> bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
> length (note that heapam can have tuples whose lp_len isn't aligned,
> so nbtree could do it differently if it proved useful). Code within
> nbtsplitloc.c fully understands the space requirements for the
> bufpage.c routines, and is very careful about it. (The bufpage.c
> details are supposed to be totally hidden from code like
> nbtsplitloc.c, but I guess that that ideal isn't quite possible in
> reality. Code comments don't really explain the situation today.)
>
> I'm not sure what it would look like for this patch to be as precise
> about free space as nbtsplitloc.c already is, even though that seems
> desirable (I just know that it would mean you would trust
> PageAddItem() to work in all cases). The patch is different to what we
> already have today in that it tries to add *less than* a single
> MAXALIGN() quantum at a time in some places (when a posting list needs
> to grow by one item). The devil is in the details.
>
> * As you know, the current approach to WAL logging is very
> inefficient. It's okay for now, but we'll need a fine-grained approach
> for the patch to be commitable. I think that this is subtly related to
> the last item (i.e. the one about alignment). I have done basic
> performance tests using unlogged tables. The patch seems to either
> make big INSERT queries run as fast or faster than before when
> inserting into unlogged tables, which is a very good start.
>
> * Since we can now split a posting list in two, we may also have to
> reconsider BTMaxItemSize, or some similar mechanism that worries about
> extreme cases where it becomes impossible to split because even two
> pages are not enough to fit everything. Think of what happens when
> there is a tuple with a single large datum, that gets split in two
> (the tuple is split, not the page), with each half receiving its own
> copy of the datum. I haven't proven to myself that this is broken, but
> that may just be because I haven't spent any time on it. OTOH, maybe
> you already have it right, in which case it seems like it should be
> explained somewhere. Possibly in nbtree.h. This is tricky stuff.
Hmm, I can't get the problem.
In current implementation each posting tuple is smaller than BTMaxItemSize,
so no split can lead to having tuple of larger size.
> * I agree with all of your existing TODO items -- most of them seem
> very important to me.
>
> * Do we really need to keep BTreeTupleGetHeapTID(), now that we have
> BTreeTupleGetMinTID()? Can't we combine the two macros into one, so
> that callers don't need to think about the pivot vs posting list thing
> themselves? See the new code added to _bt_mkscankey() by v3, for
> example. It now handles both cases/macros at once, in order to keep
> its amcheck caller happy. amcheck's verify_nbtree.c received similar
> ugly code in v3.
No, we don't need them both. I don't mind combining them into one macro.
Actually, we never needed BTreeTupleGetMinTID(),
since its functionality is covered by BTreeTupleGetHeapTID.
On the other hand, in some cases BTreeTupleGetMinTID() looks more readable.
For example here:

 >        Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup),
                                   BTreeTupleGetMinTID(righttup)) < 0);

> * We should at least experiment with applying compression when
> inserting into unique indexes. Like Alexander, I think that
> compression in unique indexes might work well, given how they must
> work in Postgres.

The main reason why I decided to avoid applying compression to unique 
indexes
is the performance of microvacuum. It is not applied to items inside a 
posting
tuple. And I expect it to be important for unique indexes, which ideally
contain only a few live values.

One more thing I want to discuss:
  /*
* We do not expect to meet any DEAD items, since this function is
* called right after _bt_vacuum_one_page(). If for some reason we
* found dead item, don't compress it, to allow upcoming microvacuum
* or vacuum clean it up.
*/
if (ItemIdIsDead(itemId))
continue;

In the previous review Rafia asked about "some reason".
Trying to figure out if this situation possible, I changed this line to
Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
performance
test. Unfortunately, I was not able to reproduce it.
The explanation I see is that page had DEAD items, but for some reason
BTP_HAS_GARBAGE was not set so _bt_vacuum_one_page() was not called.
I find it difficult to understand what could lead to this situation,
so probably we need to inspect it closer to exclude the possibility of a 
bug.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v4-0001-Compression-deduplication-in-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

06 August 2019, 01:28:18

On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> > * Included my own pageinspect hack to visualize the minimum TIDs in
> > posting lists. It's broken out into a separate patch file. The code is
> > very rough, but it might help someone else, so I thought I'd include
> > it.
> Cool, I think we should add it to the final patchset,
> probably, as separate function by analogy with tuple_data_split.

Good idea.

Attached is v5, which is based on your v4. The three main differences
between this and v4 are:

* Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
July 24 e-mail. We can always add something like this back during
performance validation of the patch. Right now, having no
BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
certain important cases, which seems more important than the
uncertain/speculative downside.

* We now have experimental support for unique indexes. This is broken
out into its own patch.

* We now handle LP_DEAD items in a special way within
_bt_insertonpg_in_posting().

As you pointed out already, we do need to think about LP_DEAD items
directly, rather than assuming that they cannot be on the page that
_bt_insertonpg_in_posting() must process. More on that later.

> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
> can be
> larger. It may happen if keysize <= 4 byte.
> In this situation original tuples must have been aligned to size 16
> bytes each,
> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
> also safe.

I still need to think about the exact details of alignment within
_bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
could be wrong.

> I changed DEBUG message to ERROR in v4 and it passes all regression tests.
> I doubt that it covers all corner cases, so I'll try to add more special
> tests.

It also passes my tests, FWIW.

> Hmm, I can't get the problem.
> In current implementation each posting tuple is smaller than BTMaxItemSize,
> so no split can lead to having tuple of larger size.

That sounds correct, then.

> No, we don't need them both. I don't mind combining them into one macro.
> Actually, we never needed BTreeTupleGetMinTID(),
> since its functionality is covered by BTreeTupleGetHeapTID.

I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just
have a comment next to BTreeTupleGetHeapTID(), and another comment
next to BTreeTupleGetMaxTID().

> The main reason why I decided to avoid applying compression to unique
> indexes
> is the performance of microvacuum. It is not applied to items inside a
> posting
> tuple. And I expect it to be important for unique indexes, which ideally
> contain only a few live values.

I found that the performance of my experimental patch with unique
index was significantly worse. It looks like this is a bad idea, as
you predicted, though we may still want to do
deduplication/compression with NULL values in unique indexes. I did
learn a few things from implementing unique index support, though.

BTW, there is a subtle bug in how my unique index patch does
WAL-logging -- see my comments within
index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
replication isn't used. I don't think that we're going to use this
experimental patch at all, so I didn't bother fixing the bug.

> if (ItemIdIsDead(itemId))
> continue;
>
> In the previous review Rafia asked about "some reason".
> Trying to figure out if this situation possible, I changed this line to
> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
> performance
> test. Unfortunately, I was not able to reproduce it.

I found it easy enough to see LP_DEAD items within
_bt_insertonpg_in_posting() when running pgbench with the extra unique
index patch. To give you a simple example of how this can happen,
consider the comments about BTP_HAS_GARBAGE within
_bt_delitems_vacuum(). That probably isn't the only way it can happen,
either. ISTM that we need to be prepared for LP_DEAD items during
deduplication, rather than trying to prevent deduplication from ever
having to see an LP_DEAD item.

v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
existing item if it's an LP_DEAD item that falls in the same TID range
(that's _bt_compare()-wise "equal" to an existing tuple, which may or
may not be a posting list tuple already). I haven't made this code do
something like call  index_compute_xid_horizon_for_tuples(), even
though that's needed for correctness (i.e. this new code is currently
broken in the same way that I mentioned unique index support is
broken). I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

How do you feel about officially calling this deduplication, not
compression? I think that it's a more accurate name for the technique.
-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

13 August 2019, 15:45:15

06.08.2019 4:28, Peter Geoghegan wrote:
> Attached is v5, which is based on your v4. The three main differences
> between this and v4 are:
>
> * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my
> July 24 e-mail. We can always add something like this back during
> performance validation of the patch. Right now, having no
> BT_COMPRESS_THRESHOLD limit definitely improves space utilization for
> certain important cases, which seems more important than the
> uncertain/speculative downside.
Fair enough.
I think we can measure performance and make a decision, when patch will 
stabilize.

> * We now have experimental support for unique indexes. This is broken
> out into its own patch.
>
> * We now handle LP_DEAD items in a special way within
> _bt_insertonpg_in_posting().
>
> As you pointed out already, we do need to think about LP_DEAD items
> directly, rather than assuming that they cannot be on the page that
> _bt_insertonpg_in_posting() must process. More on that later.
>
>> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple
>> can be
>> larger. It may happen if keysize <= 4 byte.
>> In this situation original tuples must have been aligned to size 16
>> bytes each,
>> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is
>> also safe.
> I still need to think about the exact details of alignment within
> _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> could be wrong.
Could you explain more about these cases?
Now I don't understand the problem.

>> The main reason why I decided to avoid applying compression to unique
>> indexes
>> is the performance of microvacuum. It is not applied to items inside a
>> posting
>> tuple. And I expect it to be important for unique indexes, which ideally
>> contain only a few live values.
> I found that the performance of my experimental patch with unique
> index was significantly worse. It looks like this is a bad idea, as
> you predicted, though we may still want to do
> deduplication/compression with NULL values in unique indexes. I did
> learn a few things from implementing unique index support, though.
>
> BTW, there is a subtle bug in how my unique index patch does
> WAL-logging -- see my comments within
> index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if
> replication isn't used. I don't think that we're going to use this
> experimental patch at all, so I didn't bother fixing the bug.
Thank you for the patch.
Still, I'd suggest to leave it as a possible future improvement, so that 
it doesn't
distract us from the original feature.
>> if (ItemIdIsDead(itemId))
>> continue;
>>
>> In the previous review Rafia asked about "some reason".
>> Trying to figure out if this situation possible, I changed this line to
>> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a
>> performance
>> test. Unfortunately, I was not able to reproduce it.
> I found it easy enough to see LP_DEAD items within
> _bt_insertonpg_in_posting() when running pgbench with the extra unique
> index patch. To give you a simple example of how this can happen,
> consider the comments about BTP_HAS_GARBAGE within
> _bt_delitems_vacuum(). That probably isn't the only way it can happen,
> either. ISTM that we need to be prepared for LP_DEAD items during
> deduplication, rather than trying to prevent deduplication from ever
> having to see an LP_DEAD item.

I added to v6 another related fix for _bt_compress_one_page().
Previous code was implicitly deleted DEAD items without
calling index_compute_xid_horizon_for_tuples().
New code has a check whether DEAD items on the page exist and remove 
them if any.
Another possible solution is to copy dead items as is from old page to 
the new one,
but I think it's good to remove dead tuples as fast as possible.

> v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> existing item if it's an LP_DEAD item that falls in the same TID range
> (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> may not be a posting list tuple already). I haven't made this code do
> something like call  index_compute_xid_horizon_for_tuples(), even
> though that's needed for correctness (i.e. this new code is currently
> broken in the same way that I mentioned unique index support is
> broken).
Is it possible that DEAD tuple to delete was smaller than itup?
>   I also added a nearby FIXME comment to
> _bt_insertonpg_in_posting() -- I don't think think that the code for
> splitting a posting list in two is currently crash-safe.
>
Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

> How do you feel about officially calling this deduplication, not
> compression? I think that it's a more accurate name for the technique.
I agree.
Should I rename all related names of functions and variables in the patch?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v6-0001-Compression-deduplication-in-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

16 August 2019, 15:56:12

13.08.2019 18:45, Anastasia Lubennikova wrote:

I also added a nearby FIXME comment to
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.

Good catch. It seems, that I need to rearrange the code.
I'll send updated patch this week.

Attached is v7.

In this version of the patch, I heavily refactored the code of insertion into
posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.

Now the algorithm is the following:

- If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()

- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.

It can result in one tuple neworigtup, that must replace origtup; or two tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.

- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.

It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.

In the meantime, I'll run more stress-tests.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v7-0001-Compression-deduplication-in-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

20 August 2019, 00:52:05

On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> > I still need to think about the exact details of alignment within
> > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I
> > could be wrong.
> Could you explain more about these cases?
> Now I don't understand the problem.

Maybe there is no problem.

> Thank you for the patch.
> Still, I'd suggest to leave it as a possible future improvement, so that
> it doesn't
> distract us from the original feature.

I don't even think that it's useful work for the future. It's just
nice to be sure that we could support unique index deduplication if it
made sense. Which it doesn't. If I didn't write the patch that
implements deduplication for unique indexes, I might still not realize
that we need the index_compute_xid_horizon_for_tuples() stuff in
certain other places. I'm not serious about it at all, except as a
learning exercise/experiment.

> I added to v6 another related fix for _bt_compress_one_page().
> Previous code was implicitly deleted DEAD items without
> calling index_compute_xid_horizon_for_tuples().
> New code has a check whether DEAD items on the page exist and remove
> them if any.
> Another possible solution is to copy dead items as is from old page to
> the new one,
> but I think it's good to remove dead tuples as fast as possible.

I think that what you've done in v7 is probably the best way to do it.
It's certainly simple, which is appropriate given that we're not
really expecting to see LP_DEAD items within _bt_compress_one_page()
(we just need to be prepared for them).

> > v5 makes  _bt_insertonpg_in_posting() prepared to overwrite an
> > existing item if it's an LP_DEAD item that falls in the same TID range
> > (that's _bt_compare()-wise "equal" to an existing tuple, which may or
> > may not be a posting list tuple already). I haven't made this code do
> > something like call  index_compute_xid_horizon_for_tuples(), even
> > though that's needed for correctness (i.e. this new code is currently
> > broken in the same way that I mentioned unique index support is
> > broken).
> Is it possible that DEAD tuple to delete was smaller than itup?

I'm not sure what you mean by this. I suppose that it doesn't matter,
since we both prefer the alternative that you came up with anyway.

> > How do you feel about officially calling this deduplication, not
> > compression? I think that it's a more accurate name for the technique.
> I agree.
> Should I rename all related names of functions and variables in the patch?

Please rename them when convenient.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

20 August 2019, 01:04:18

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Now the algorithm is the following:
>
> - If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
> TID interval, it sets 'in_posting_offset' variable and passes it to
> _bt_insertonpg()
>
> - If 'in_posting_offset' is valid and origtup is valid,
> merge our itup into origtup.
>
> It can result in one tuple neworigtup, that must replace origtup; or two tuples:
> neworigtup and newrighttup, if the result exceeds BTMaxItemSize,

That sounds like the right way to do it.

> - If two new tuple(s) fit into the old page, we're lucky.
> call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
> atomically replace oldtup with new tuple(s) and generate xlog record.
>
> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.

That makes sense, since _bt_split() is responsible for both splitting
the page, and inserting the new item on either the left or right page,
as part of the first phase of a page split. In other words, if you're
adding something new to _bt_insertonpg(), you probably also need to
add something new to _bt_split(). So that's what you did.

> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I'm looking at it now. I'm going to spend a significant amount of time
on this tomorrow.

I think that we should start to think about efficient WAL-logging now.

> In the meantime, I'll run more stress-tests.

As you probably realize, wal_consistency_checking is a good thing to
use with your tests here.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

21 August 2019, 17:19:38

20.08.2019 4:04, Peter Geoghegan wrote:
> On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>
>> It seems that now all replace operations are crash-safe. The new patch passes
>> all regression tests, so I think it's ready for review again.
> I'm looking at it now. I'm going to spend a significant amount of time
> on this tomorrow.
>
> I think that we should start to think about efficient WAL-logging now.

Thank you for the review.

The new version v8 is attached. Compared to previous version, this patch 
includes
updated btree_xlog_insert() and btree_xlog_split() so that WAL records 
now only contain data
about updated posting tuple and don't require full page writes.
I haven't updated pg_waldump yet. It is postponed until we agree on 
nbtxlog changes.

Also in this patch I renamed all 'compress' keywords to 'deduplicate' 
and did minor cleanup
of outdated comments.

I'm going to look through the patch once more to update nbtxlog 
comments, where needed and
answer to your remarks that are still left in the comments.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v8-0001-Deduplication-in-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

23 August 2019, 04:33:41

On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I'm going to look through the patch once more to update nbtxlog
> comments, where needed and
> answer to your remarks that are still left in the comments.

Have you been using amcheck's rootdescend verification? I see this
problem with v8, with the TPC-H test data:

DEBUG:  finished verifying presence of 1500000 tuples from table
"customer" with bitset 51.09% set
ERROR:  could not find tuple using search from root page in index
"idx_customer_nationkey2"

I've been running my standard amcheck query with these databases, which is:

SELECT bt_index_parent_check(index => c.oid, heapallindexed => true,
rootdescend => true),
c.relname,
c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree'
AND c.relpersistence != 't'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC;

There were many large indexes that amcheck didn't detect a problem
with. I don't yet understand what the problem is, or why we only see
the problem for a small number of indexes. Note that all of these
indexes passed verification with v5, so this is some kind of
regression.

I also noticed that there were some regressions in the size of indexes
-- indexes were not nearly as small as they were in v5 in some cases.
The overall picture was a clear regression in how effective
deduplication is.

I think that it would save time if you had direct access to my test
data, even though it's a bit cumbersome. You'll have to download about
10GB of dumps, which require plenty of disk space when restored:

regression=# \l+
                                                                List
of databases
    Name    | Owner | Encoding |  Collate   |   Ctype    | Access
privileges |  Size   | Tablespace |                Description

------------+-------+----------+------------+------------+-------------------+---------+------------+--------------------------------------------
 land       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 6425 MB | pg_default |
 mgd        | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 61 GB   | pg_default |
 postgres   | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 7753 kB | pg_default | default administrative connection
database
 regression | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 886 MB  | pg_default |
 template0  | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/pg
     +| 7609 kB | pg_default | unmodifiable empty database
            |       |          |            |            | pg=CTc/pg
      |         |            |
 template1  | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 | =c/pg
     +| 7609 kB | pg_default | default template for new databases
            |       |          |            |            | pg=CTc/pg
      |         |            |
 tpcc       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 10 GB   | pg_default |
 tpce       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 26 GB   | pg_default |
 tpch       | pg    | UTF8     | en_US.UTF8 | en_US.UTF8 |
      | 32 GB   | pg_default |
(9 rows)

I have found it very valuable to use this test data when changing
nbtsplitloc.c, or anything that could affect where page splits make
free space available. If this is too much data to handle conveniently,
then you could skip "mgd" and almost have as much test coverage. There
really does seem to be a benefit to using diverse test cases like
this, because sometimes regressions only affect a small number of
specific indexes for specific reasons. For example, only TPC-H has a
small number of indexes that have tuples that are inserted in order,
but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD
stuff really helped with those indexes.

Want me to send this data and the associated tests script over to you?

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

23 August 2019, 11:45:47

23.08.2019 7:33, Peter Geoghegan wrote:
> On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> I'm going to look through the patch once more to update nbtxlog
>> comments, where needed and
>> answer to your remarks that are still left in the comments.
> Have you been using amcheck's rootdescend verification?

No, I haven't checked it with the latest version yet.

> There were many large indexes that amcheck didn't detect a problem
> with. I don't yet understand what the problem is, or why we only see
> the problem for a small number of indexes. Note that all of these
> indexes passed verification with v5, so this is some kind of
> regression.
>
> I also noticed that there were some regressions in the size of indexes
> -- indexes were not nearly as small as they were in v5 in some cases.
> The overall picture was a clear regression in how effective
> deduplication is.
Do these indexes have something in common? Maybe some specific workload?
Are there any error messages in log?

I'd like to specify what caused the problem.
There were several major changes between v5 and v8:
- dead tuples handling added in v6;
- _bt_split changes for posting tuples in v7;
- WAL logging of posting tuple changes in v8.

I don't think the last one could break regular indexes on master.
Do you see the same regression in v6, v7?

> I think that it would save time if you had direct access to my test
> data, even though it's a bit cumbersome. You'll have to download about
> 10GB of dumps, which require plenty of disk space when restored:
>
>
> Want me to send this data and the associated tests script over to you?
>
Yes, I think it will help me to debug the patch faster.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

28 August 2019, 03:19:43

On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Now the algorithm is the following:

> - In case page split is needed, pass both tuples to _bt_split().
>  _bt_findsplitloc() is now aware of upcoming replacement of origtup with
> neworigtup, so it uses correct item size where needed.
>
> It seems that now all replace operations are crash-safe. The new patch passes
> all regression tests, so I think it's ready for review again.

I think that the way this works within nbtsplitloc.c is too
complicated. In v5, the only thing that  nbtsplitloc.c knew about
deduplication was that it could be sure that suffix truncation would
at least make a posting list into a single heap TID in the worst case.
This consideration was mostly about suffix truncation, not
deduplication, which seemed like a good thing to me. _bt_split() and
_bt_findsplitloc() should know as little as possible about posting
lists.

Obviously it will sometimes be necessary to deal with the case where a
posting list is about to become too big (i.e. it's about to go over
BTMaxItemSize()), and so must be split. Less often, a page split will
be needed because of one of these posting list splits. These are two
complicated areas (posting list splits and page splits), and it would
be a good idea to find a way to separate them as much as possible.
Remember, nbtsplitloc.c works by pretending that the new item that
cannot fit on the page is already on its own imaginary version of the
page that *can* fit the new item, along with everything else from the
original/actual page. That gets *way* too complicated when it has to
deal with the fact that the new item is being merged with an existing
item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
always a plain tuple, without knowing anything about posting lists.
Almost like how it worked in v5.

We always want posting lists to be as close to the BTMaxItemSize()
size as possible, because that helps with space utilization. In v5 of
the patch, this was what happened, because, in effect, we didn't try
to do anything complicated with the new item. This worked well, apart
from the crash safety issue. Maybe we can simulate the v5 approach,
giving us the best of all worlds (good space utilization, simplicity,
and crash safety). Something like this:

* Posting list splits should always result in one posting list that is
at or just under BTMaxItemSize() in size, plus one plain tuple to its
immediate right on the page. This is similar to the more common case
where we cannot add additional tuples to a posting list due to the
BTMaxItemSize() restriction, and so end up with a single tuple (or a
smaller posting list with the same value) to the right of a
BTMaxItemSize()-sized posting list tuple. I don't see a reason to
split a posting list in the middle -- we should always split to the
right, leaving the posting list as large as possible.

* When there is a simple posting list split, with no page split, the
logic required is fairly straightforward: We rewrite the posting list
in-place so that our new item goes wherever it belongs in the existing
posting list on the page (we memmove() the posting list to make space
for the new TID, basically). The old last/rightmost TID in the
original posting list becomes a new, plain tuple. We may need a new
WAL record for this, but it's not that different to a regular leaf
page insert.

* When this happens to result in a page split, we then have a "fake"
new item -- the right half of the posting list that we split, which is
always a plain item. Obviously we need to be a bit careful with the
WAL logging, but the space accounting within _bt_split() and
_bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
work like it did in v5, when the only thing it knew about posting
lists was that _bt_truncate() always removes them, maybe leaving a
single TID behind in the new high key. (Note also that it's not okay
to remove the conservative assumption about at least having space for
one heap TID within _bt_recsplitloc() -- that needs to be restored to
its v5 state in the next version of the patch.)

Because deduplication is lazy, there is little value in doing
deduplication of the new item (which may or may not be the fake new
item). The nbtsplitloc.c logic will "trap" duplicates on the same page
today, so we can just let deduplication of the new item happen at a
later time. _bt_split() can almost pretend that posting lists don't
exist, and nbtsplitloc.c needs to know nothing about posting lists
(apart from the way that _bt_truncate() behaves with posting lists).
We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
fake new item -- it doesn't do anything that will be broken by that
lie, because it doesn't care about the actual content of posting
lists. And, we can fix the "fake new item is not actually real new
item" issue at one point within _bt_split(), just as we're about to
WAL log.

What do you think of that approach?

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

29 August 2019, 12:13:39

28.08.2019 6:19, Peter Geoghegan wrote:
> On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru>  wrote:
>> Now the algorithm is the following:
>> - In case page split is needed, pass both tuples to _bt_split().
>>   _bt_findsplitloc() is now aware of upcoming replacement of origtup with
>> neworigtup, so it uses correct item size where needed.
>>
>> It seems that now all replace operations are crash-safe. The new patch passes
>> all regression tests, so I think it's ready for review again.
> I think that the way this works within nbtsplitloc.c is too
> complicated. In v5, the only thing that  nbtsplitloc.c knew about
> deduplication was that it could be sure that suffix truncation would
> at least make a posting list into a single heap TID in the worst case.
> This consideration was mostly about suffix truncation, not
> deduplication, which seemed like a good thing to me. _bt_split() and
> _bt_findsplitloc() should know as little as possible about posting
> lists.
>
> Obviously it will sometimes be necessary to deal with the case where a
> posting list is about to become too big (i.e. it's about to go over
> BTMaxItemSize()), and so must be split. Less often, a page split will
> be needed because of one of these posting list splits. These are two
> complicated areas (posting list splits and page splits), and it would
> be a good idea to find a way to separate them as much as possible.
> Remember, nbtsplitloc.c works by pretending that the new item that
> cannot fit on the page is already on its own imaginary version of the
> page that *can* fit the new item, along with everything else from the
> original/actual page. That gets *way* too complicated when it has to
> deal with the fact that the new item is being merged with an existing
> item. Perhaps nbtsplitloc.c could also "pretend" that the new item is
> always a plain tuple, without knowing anything about posting lists.
> Almost like how it worked in v5.
>
> We always want posting lists to be as close to the BTMaxItemSize()
> size as possible, because that helps with space utilization. In v5 of
> the patch, this was what happened, because, in effect, we didn't try
> to do anything complicated with the new item. This worked well, apart
> from the crash safety issue. Maybe we can simulate the v5 approach,
> giving us the best of all worlds (good space utilization, simplicity,
> and crash safety). Something like this:
>
> * Posting list splits should always result in one posting list that is
> at or just under BTMaxItemSize() in size, plus one plain tuple to its
> immediate right on the page. This is similar to the more common case
> where we cannot add additional tuples to a posting list due to the
> BTMaxItemSize() restriction, and so end up with a single tuple (or a
> smaller posting list with the same value) to the right of a
> BTMaxItemSize()-sized posting list tuple. I don't see a reason to
> split a posting list in the middle -- we should always split to the
> right, leaving the posting list as large as possible.
>
> * When there is a simple posting list split, with no page split, the
> logic required is fairly straightforward: We rewrite the posting list
> in-place so that our new item goes wherever it belongs in the existing
> posting list on the page (we memmove() the posting list to make space
> for the new TID, basically). The old last/rightmost TID in the
> original posting list becomes a new, plain tuple. We may need a new
> WAL record for this, but it's not that different to a regular leaf
> page insert.
>
> * When this happens to result in a page split, we then have a "fake"
> new item -- the right half of the posting list that we split, which is
> always a plain item. Obviously we need to be a bit careful with the
> WAL logging, but the space accounting within _bt_split() and
> _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can
> work like it did in v5, when the only thing it knew about posting
> lists was that _bt_truncate() always removes them, maybe leaving a
> single TID behind in the new high key. (Note also that it's not okay
> to remove the conservative assumption about at least having space for
> one heap TID within _bt_recsplitloc() -- that needs to be restored to
> its v5 state in the next version of the patch.)
>
> Because deduplication is lazy, there is little value in doing
> deduplication of the new item (which may or may not be the fake new
> item). The nbtsplitloc.c logic will "trap" duplicates on the same page
> today, so we can just let deduplication of the new item happen at a
> later time. _bt_split() can almost pretend that posting lists don't
> exist, and nbtsplitloc.c needs to know nothing about posting lists
> (apart from the way that _bt_truncate() behaves with posting lists).
> We "lie" to  _bt_findsplitloc(), and tell it that the new item is our
> fake new item -- it doesn't do anything that will be broken by that
> lie, because it doesn't care about the actual content of posting
> lists. And, we can fix the "fake new item is not actually real new
> item" issue at one point within _bt_split(), just as we're about to
> WAL log.
>
> What do you think of that approach?

I think it's a good idea. Thank you for such a detailed description of 
various
cases. I already started to simplify this code, while debugging amcheck 
error
in v8. At first, I rewrote it to split posting tuple into a posting and a
regular tuple instead of two posting tuples.

Your explanation helped me to understand that this approach can be 
extended to
the case of insertion into posting list, that doesn't trigger posting 
split,
and that nbtsplitloc indeed doesn't need to know about posting tuples 
specific.
The code is much cleaner now.

The new version is attached. It passes regression tests. I also run land 
and
tpch test. They pass amcheck rootdescend and if I interpreted results
correctly, the new version shows slightly better compression.
\l+
  tpch      | anastasia | UTF8     | ru_RU.UTF-8 | ru_RU.UTF-8 | | 31 
GB   | pg_default |
  land      | anastasia | UTF8     | ru_RU.UTF-8 | ru_RU.UTF-8 | | 6380 
MB | pg_default |

Some individual indexes are larger, some are smaller compared to the 
expected output.

This patch is based on v6, so it again contains "compression" instead of 
"deduplication"
in variable names and comments. I will rename them when code becomes 
more stable.

-- 
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Attachment

v9-0001-Compression-deduplication-in-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

30 August 2019, 00:07:53

On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Your explanation helped me to understand that this approach can be
> extended to
> the case of insertion into posting list, that doesn't trigger posting
> split,
> and that nbtsplitloc indeed doesn't need to know about posting tuples
> specific.
> The code is much cleaner now.

Fantastic!

> Some individual indexes are larger, some are smaller compared to the
> expected output.

I agree that v9 might be ever so slightly more space efficient than v5
was, on balance. In any case v9 completely fixes the regression that I
saw in the last version. I have pushed the changes to the test output
for the serial tests that I privately maintain, that I gave you access
to. The MGD test output also looks perfect.

We may find that deduplication is a little too effective, in the sense
that it packs so many tuples on to leaf pages that *concurrent*
inserters will tend to get excessive page splits. We may find that it
makes sense to aim for posting lists that are maybe 96% of
BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this
reason. Concurrent inserters will tend to have heap TIDs that are
slightly out of order, so we want to at least have enough space
remaining on the left half of a "single value mode" split. We may end
up with a design where deduplication anticipates what will be useful
for nbtsplitloc.c.

I still think that it's too early to start worrying about problems
like this one -- I feel it will be useful to continue to focus on the
code and the space utilization of the serial test cases for now. We
can look at it at the same time that we think about adding back
something like BT_COMPRESS_THRESHOLD. I am mentioning it now because
it's probably a good time for you to start thinking about it, if you
haven't already (actually, maybe I'm just describing what
BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll
need to have a good benchmark to assess these questions, and it's not
obvious what that will be. Two possible candidates are TPC-H and
TPC-E. (Of course, I mean running them for real -- not using their
indexes to make sure that the nbtsplitloc.c stuff works well in
isolation.)

Any thoughts on a conventional benchmark that allows us to understand
the patch's impact on both throughput and latency?

BTW, I notice that we often have indexes that are quite a lot smaller
when they were created with retail insertions rather than with CREATE
INDEX/REINDEX. This is not new, but the difference is much larger than
it typically is without the patch. For example, the TPC-E index on
trade.t_ca_id (which is named "i_t_ca_id" or  "i_t_ca_id2" in my test)
is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions
(assuming the insertions use the actual order from the test). I'm not
sure what to do about this, if anything. I mean, the reason that the
retail insertions do better is that they have the nbtsplitloc.c stuff,
and because we don't split the page until it's 100% full and until
deduplication stops helping -- we could apply several rounds of
deduplication before we actually have to split the cage. So the
difference that we see here is both logical and surprising.

How do you feel about this CREATE INDEX index-size-is-larger business?

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

30 August 2019, 05:10:14

On Thu, Aug 29, 2019 at 5:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I agree that v9 might be ever so slightly more space efficient than v5
> was, on balance.

I see some Valgrind errors on v9, all of which look like the following
two sample errors I go into below.

First one:

==11193== VALGRINDERROR-BEGIN
==11193== Unaddressable byte(s) found during client check request
==11193==    at 0x4C0E03: PageAddItemExtended (bufpage.c:332)
==11193==    by 0x20F6C3: _bt_split (nbtinsert.c:1643)
==11193==    by 0x20F6C3: _bt_insertonpg (nbtinsert.c:1206)
==11193==    by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==    by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==    by 0x3B85A1: main (main.c:210)
==11193==  Address 0x9c11350 is 0 bytes after a recently re-allocated
block of size 8,192 alloc'd
==11193==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193==    by 0x61085A: AllocSetAlloc (aset.c:914)
==11193==    by 0x617AD8: palloc (mcxt.c:938)
==11193==    by 0x21A829: _bt_mkscankey (nbtutils.c:107)
==11193==    by 0x2118F3: _bt_doinsert (nbtinsert.c:93)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==    by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==
==11193== VALGRINDERROR-END
{
   <insert_a_suppression_name_here>
   Memcheck:User
   fun:PageAddItemExtended
   fun:_bt_split
   fun:_bt_insertonpg
   fun:_bt_doinsert
   fun:btinsert
   fun:index_insert
   fun:ExecInsertIndexTuples
   fun:ExecInsert
   fun:ExecModifyTable
   fun:ExecProcNodeFirst
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:ProcessQuery
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}

nbtinsert.c:1643 is the first PageAddItem() in _bt_split() -- the
lefthikey call.

Second one:

==11193== VALGRINDERROR-BEGIN
==11193== Invalid read of size 2
==11193==    at 0x20FDF5: _bt_insertonpg (nbtinsert.c:1126)
==11193==    by 0x21239B: _bt_doinsert (nbtinsert.c:306)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==    by 0x454CAC: PostmasterMain (postmaster.c:1377)
==11193==    by 0x3B85A1: main (main.c:210)
==11193==  Address 0x9905b90 is 11,088 bytes inside a recently
re-allocated block of size 524,288 alloc'd
==11193==    at 0x4C2FB0F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11193==    by 0x61085A: AllocSetAlloc (aset.c:914)
==11193==    by 0x617AD8: palloc (mcxt.c:938)
==11193==    by 0x1C5677: CopyIndexTuple (indextuple.c:508)
==11193==    by 0x20E887: _bt_compress_one_page (nbtinsert.c:2751)
==11193==    by 0x21241E: _bt_findinsertloc (nbtinsert.c:773)
==11193==    by 0x21241E: _bt_doinsert (nbtinsert.c:303)
==11193==    by 0x2150EE: btinsert (nbtree.c:207)
==11193==    by 0x20D63A: index_insert (indexam.c:186)
==11193==    by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393)
==11193==    by 0x391793: ExecInsert (nodeModifyTable.c:593)
==11193==    by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219)
==11193==    by 0x37306D: ExecProcNodeFirst (execProcnode.c:445)
==11193==    by 0x36C738: ExecProcNode (executor.h:240)
==11193==    by 0x36C738: ExecutePlan (execMain.c:1648)
==11193==    by 0x36C738: standard_ExecutorRun (execMain.c:365)
==11193==    by 0x36C7DD: ExecutorRun (execMain.c:309)
==11193==    by 0x4CC41A: ProcessQuery (pquery.c:161)
==11193==    by 0x4CC5EB: PortalRunMulti (pquery.c:1283)
==11193==    by 0x4CD31C: PortalRun (pquery.c:796)
==11193==    by 0x4C8EFC: exec_simple_query (postgres.c:1231)
==11193==    by 0x4C9EE0: PostgresMain (postgres.c:4256)
==11193==    by 0x453650: BackendRun (postmaster.c:4446)
==11193==    by 0x453650: BackendStartup (postmaster.c:4137)
==11193==    by 0x453650: ServerLoop (postmaster.c:1704)
==11193==
==11193== VALGRINDERROR-END
{
   <insert_a_suppression_name_here>
   Memcheck:Addr2
   fun:_bt_insertonpg
   fun:_bt_doinsert
   fun:btinsert
   fun:index_insert
   fun:ExecInsertIndexTuples
   fun:ExecInsert
   fun:ExecModifyTable
   fun:ExecProcNodeFirst
   fun:ExecProcNode
   fun:ExecutePlan
   fun:standard_ExecutorRun
   fun:ExecutorRun
   fun:ProcessQuery
   fun:PortalRunMulti
   fun:PortalRun
   fun:exec_simple_query
   fun:PostgresMain
   fun:BackendRun
   fun:BackendStartup
   fun:ServerLoop
   fun:PostmasterMain
   fun:main
}

nbtinsert.c:1126 is this code from _bt_insertonpg():

            elog(DEBUG4, "dest before (%u,%u)",
                 ItemPointerGetBlockNumberNoCheck((ItemPointer) dest),
                 ItemPointerGetOffsetNumberNoCheck((ItemPointer) dest));

This is probably harmless, but it needs to be fixed.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

31 August 2019, 08:04:01

On Thu, Aug 29, 2019 at 10:10 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I see some Valgrind errors on v9, all of which look like the following
> two sample errors I go into below.

I've found a fix for these Valgrind issues. It's a matter of making
sure that _bt_truncate() sizes new pivot tuples properly, which is
quite subtle:

--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2155,8 +2155,11 @@ _bt_truncate(Relation rel, IndexTuple lastleft,
IndexTuple firstright,
         {
             BTreeTupleClearBtIsPosting(pivot);
             BTreeTupleSetNAtts(pivot, keepnatts);
-            pivot->t_info &= ~INDEX_SIZE_MASK;
-            pivot->t_info |= BTreeTupleGetPostingOffset(firstright);
+            if (keepnatts == natts)
+            {
+                pivot->t_info &= ~INDEX_SIZE_MASK;
+                pivot->t_info |=
MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+            }
         }

I'm varying how the new pivot tuple is sized here according to whether
or not index_truncate_tuple() just does a CopyIndexTuple(). This very
slightly changes the behavior of the nbtsplitloc.c stuff, but that's
not a concern for me.

I will post a patch with this and other tweaks next week.

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

03 September 2019, 01:53:09

On Sat, Aug 31, 2019 at 1:04 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I've found a fix for these Valgrind issues.

Attach is v10, which fixes the Valgrind issue.

Other changes:

* The code now fully embraces the idea that posting list splits
involve "changing the incoming item" in a way that "avoids" having the
new/incoming item overlap with an existing posting list tuple. This
allowed me to cut down on the changes required within nbtinsert.c
considerably.

* Streamlined a lot of the code in nbtsearch.c. I was able to
significantly simplify _bt_compare() and _bt_binsrch_insert().

* Removed the DEBUG4 traces. A lot of these had to go when I
refactored nbtsearch.c code, so I thought I might as well removed the
remaining ones. I hope that you don't mind (go ahead and add them back
where that makes sense).

* A backwards scan will return "logical tuples" in descending order
now. We should do this on general principle, and also because of the
possibility of future external code that expects and takes advantage
of consistent heap TID order.

This change might even have a small performance benefit today, though:
Index scans that visit multiple heap pages but only match on a single
key will only pin each heap page visited once. Visiting the heap pages
in descending order within a B-Tree page full of duplicates, but
ascending order within individual posting lists could result in
unnecessary extra pinning.

* Standardized terminology. We consistently call what the patch adds
"deduplication" rather than "compression".

* Added a new section on the design to the nbtree README. This is
fairly high level, and talks about dynamics that we can't really talk
about anywhere else, such as how nbtsplitloc.c "cooperates" with
deduplication, producing an effect that is greater than the sum of its
parts.

* I also made some changes to the WAL logging for leaf page insertions
and page splits.

I didn't add the optimization that you anticipated in your nbtxlog.h
comments (i.e. only WAL-log a rewritten posting list when it will go
on the left half of the split, just like the new/incoming item thing
we have already). I agree that that's a good idea, and should be added
soon. Actually, I think the whole "new item vs. rewritten posting list
item" thing makes the WAL logging confusing, so this is not really
about performance.

Maybe the easiest way to do this is also the way that performs best.
I'm thinking of this: maybe we could completely avoid WAL-logging the
entire rewritten/split posting list. After all, the contents of the
rewritten posting list are derived from the existing/original posting
list, as well as the new/incoming item. We can make the WAL record
much smaller on average by making standbys repeat a little bit of the
work performed on the primary. Maybe we could WAL-log
"in_posting_offset" itself, and an ItemPointerData (obviously the new
item offset number tells us the offset number of the posting list that
must be replaced/memmoved()'d). Then have the standby repeat some of
the work performed on the primary -- at least the work of swapping a
heap TID could be repeated on standbys, since it's very little extra
work for standbys, but could really reduce the WAL volume. This might
actually be simpler.

The WAL logging that I didn't touch in v10 is the most important thing
to improve. I am talking about the WAL-logging that is performed as
part of deduplicating all items on a page, to avoid a page split (i.e.
the WAL-logging within _bt_dedup_one_page()). That still just does a
log_newpage_buffer() in v10, which is pretty inefficient. Much like
the posting list split WAL logging stuff, WAL logging in
_bt_dedup_one_page() can probably be made more efficient by describing
deduplication in terms of logical changes. For example, the WAL
records should consist of metadata that could be read by a human as
"merge the tuples from offset number 15 until offset number 27".
Perhaps this could also share code with the posting list split stuff.
What do you think?

Once we make the WAL-logging within _bt_dedup_one_page() more
efficient, that also makes it fairly easy to make the deduplication
that it performs occur incrementally, maybe even very incrementally. I
can imagine the _bt_dedup_one_page() caller specifying "my new tuple
is 32 bytes, and I'd really like to not have to split the page, so
please at least do enough deduplication to make it fit". Delaying
deduplication increases the amount of time that we have to set the
LP_DEAD bit for remaining items on the page, which might be important.
Also, spreading out the volume of WAL produced by deduplication over
time might be important with certain workloads. We would still
probably do somewhat more work than strictly necessary to avoid a page
split if we were to make _bt_dedup_one_page() incremental like this,
though not by a huge amount.

OTOH, maybe I am completely wrong about "incremental deduplication"
being a good idea. It seems worth experimenting with, though. It's not
that much more work on top of making the _bt_dedup_one_page()
WAL-logging efficient, which seems like the thing we should focus on
now.

Thoughts?
-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

09 September 2019, 19:54:08

On Mon, Sep 2, 2019 at 6:53 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attach is v10, which fixes the Valgrind issue.

Attached is v11, which makes the kill_prior_tuple optimization work
with posting list tuples. The only catch is that it can only work when
all "logical tuples" within a posting list are known-dead, since of
course there is only one LP_DEAD bit available for each posting list.

The hardest part of this kill_prior_tuple work was writing the new
_bt_killitems() code, which I'm still not 100% happy with. Still, it
seems to work well -- new pageinspect LP_DEAD status info was added to
the second patch to verify that we're setting LP_DEAD bits as needed
for posting list tuples. I also had to add a new nbtree-specific,
posting-list-aware version of index_compute_xid_horizon_for_tuples()
-- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
avoid splitting a posting list with the LP_DEAD bit set. I took a
naive approach to avoiding that problem, adding code to
_bt_findinsertloc() to prevent it. Posting list splits are generally
assumed to be rare, so the fact that this is slightly inefficient
should be fine IMV.

I also refactored deduplication itself in anticipation of making the
WAL logging more efficient, and incremental. So, the structure of the
code within _bt_dedup_one_page() was simplified, without really
changing it very much (I think). I also fixed a bug in
_bt_dedup_one_page(). The check for dead items was broken in previous
versions, because the loop examined the high key tuple in every
iteration.

Making _bt_dedup_one_page() more efficient and incremental is still
the most important open item for the patch.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

11 September 2019, 12:37:59

09.09.2019 22:54, Peter Geoghegan wrote:
>
> Attached is v11, which makes the kill_prior_tuple optimization work
> with posting list tuples. The only catch is that it can only work when
> all "logical tuples" within a posting list are known-dead, since of
> course there is only one LP_DEAD bit available for each posting list.
>
> The hardest part of this kill_prior_tuple work was writing the new
> _bt_killitems() code, which I'm still not 100% happy with. Still, it
> seems to work well -- new pageinspect LP_DEAD status info was added to
> the second patch to verify that we're setting LP_DEAD bits as needed
> for posting list tuples. I also had to add a new nbtree-specific,
> posting-list-aware version of index_compute_xid_horizon_for_tuples()
> -- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to
> avoid splitting a posting list with the LP_DEAD bit set. I took a
> naive approach to avoiding that problem, adding code to
> _bt_findinsertloc() to prevent it. Posting list splits are generally
> assumed to be rare, so the fact that this is slightly inefficient
> should be fine IMV.
>
> I also refactored deduplication itself in anticipation of making the
> WAL logging more efficient, and incremental. So, the structure of the
> code within _bt_dedup_one_page() was simplified, without really
> changing it very much (I think). I also fixed a bug in
> _bt_dedup_one_page(). The check for dead items was broken in previous
> versions, because the loop examined the high key tuple in every
> iteration.
>
> Making _bt_dedup_one_page() more efficient and incremental is still
> the most important open item for the patch.

Hi, thank you for the fixes and improvements.
I reviewed them and everything looks good except the idea of not 
splitting dead posting tuples.
According to comments to scan->ignore_killed_tuples in genam.c:107,
it may lead to incorrect tuple order on a replica.
I don't sure, if it leads to any real problem, though, or it will be 
resolved
by subsequent visibility checks. Anyway, it's worth to add more comments in
_bt_killitems() explaining why it's safe.

Attached is v12, which contains WAL optimizations for posting split and 
page
deduplication. Changes to prior version:

* xl_btree_split record doesn't contain posting tuple anymore, instead 
it keeps
'in_posting offset'  and repeats the logic of _bt_insertonpg() as you 
proposed
upthread.

* I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains 
info about
groups of tuples deduplicated into posting tuples. In principle, it is 
possible
to fit it into some existing record, but I preferred to keep things clear.

I haven't measured how these changes affect WAL size yet.
Do you have any suggestions on how to automate testing of new WAL records?
Is there any suitable place in regression tests?

* I also noticed that _bt_dedup_one_page() can be optimized to return early
when none tuples were deduplicated. I wonder if we can introduce inner
statistic to tune deduplication? That is returning to the idea of
BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that 
have
very few duplicates or pages that are already full of posting lists.

To be honest, I don't believe that incremental deduplication can really 
improve
something, because no matter how many items were compressed we still 
rewrite
all items from the original page to the new one, so, why not do our best.
What do we save by this incremental approach?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v12-0001-Add-deduplication-to-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

11 September 2019, 21:04:59

On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I reviewed them and everything looks good except the idea of not
> splitting dead posting tuples.
> According to comments to scan->ignore_killed_tuples in genam.c:107,
> it may lead to incorrect tuple order on a replica.
> I don't sure, if it leads to any real problem, though, or it will be
> resolved
> by subsequent visibility checks.

Fair enough, but I didn't do that because it's compelling on its own
-- it isn't. I did it because it seemed like the best way to handle
posting list splits in a version of the patch where LP_DEAD bits can
be set on posting list tuples. I think that we have 3 high level
options here:

1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting
lists at all. This is clearly the easiest approach.

2. We do what I did in v11 of the patch -- we make it so that
_bt_insertonpg() and _bt_split() never have to deal with LP_DEAD
posting lists that they must split in passing.

3. We add additional code to _bt_insertonpg() and _bt_split() to deal
with the rare case where they must split an LP_DEAD posting list,
probably by unsetting the bit or something like that. Obviously it
would be wrong to leave the LP_DEAD bit set for the newly inserted
heap tuples TID that must go in a posting list that had its LP_DEAD
bit set -- that would make it dead to index scans even after its xact
successfully committed.

I think that you already agree that we want to have the
kill_prior_tuple optimizations with posting lists, so #1 isn't really
an option. That just leaves #2 and #3. Since posting list splits are
already assumed to be quite rare, it seemed far simpler to take the
conservative approach of forcing clean-up that removes LP_DEAD bits so
that _bt_insertonpg() and _bt_split() don't have to think about it.
Obviously I think it's important that we make as few changes as
possible to _bt_insertonpg() and _bt_split(), in general.

I don't understand what you mean about visibility checks. There is
nothing truly special about the way in which _bt_findinsertloc() will
sometimes have to kill LP_DEAD items so that _bt_insertonpg() and
_bt_split() don't have to think about LP_DEAD posting lists. As far as
recovery is concerned, it is just another XLOG_BTREE_DELETE record,
like any other. Note that there is a second call to
_bt_binsrch_insert() within _bt_findinsertloc() when it has to
generate a new XLOG_BTREE_DELETE record (by calling
_bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that
isn't dependent on the BTP_HAS_GARBAGE status bit being set).

> Anyway, it's worth to add more comments in
> _bt_killitems() explaining why it's safe.

There is no question that the little snippet of code I added to
_bt_killitems() in v11 is still too complicated. We also have to
consider cases where the array overflows because the scan direction
was changed (see the kill_prior_tuple comment block in btgetuple()).
Yeah, it's messy.

> Attached is v12, which contains WAL optimizations for posting split and
> page
> deduplication.

Cool.

> * xl_btree_split record doesn't contain posting tuple anymore, instead
> it keeps
> 'in_posting offset'  and repeats the logic of _bt_insertonpg() as you
> proposed
> upthread.

That looks good.

> * I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains
> info about
> groups of tuples deduplicated into posting tuples. In principle, it is
> possible
> to fit it into some existing record, but I preferred to keep things clear.

I definitely think that inventing a new WAL record was the right thing to do.

> I haven't measured how these changes affect WAL size yet.
> Do you have any suggestions on how to automate testing of new WAL records?
> Is there any suitable place in regression tests?

I don't know about the regression tests (I doubt that there is a
natural place for such a test), but I came up with a rough test case.
I more or less copied the approach that you took with the index build
WAL reduction patches, though I also figured out a way of subtracting
heapam WAL overhead to get a real figure. I attach the test case --
note that you'll need to use the "land" database with this. (This test
case might need to be improved, but it's a good start.)

> * I also noticed that _bt_dedup_one_page() can be optimized to return early
> when none tuples were deduplicated. I wonder if we can introduce inner
> statistic to tune deduplication? That is returning to the idea of
> BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that
> have
> very few duplicates or pages that are already full of posting lists.

I think that the BT_COMPRESS_THRESHOLD idea is closely related to
making _bt_dedup_one_page() behave incrementally.

On my machine, v12 of the patch actually uses slightly more WAL than
v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of
nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL
compression, so if I turned that off v12 would probably win by a small
amount). Both numbers are wildly excessive, though. The master branch
figure is only 2011 MB, which is only about 1.8x the size of the index
on the master branch. And this is for a test case that makes the index
6.5x smaller, so the gap between total index size and total WAL volume
is huge here -- the volume of WAL is nearly 40x greater than the index
size!

You are right to wonder what the result would be if we put
BT_COMPRESS_THRESHOLD back in. It would probably significantly reduce
the volume of WAL, because _bt_dedup_one_page() would no longer
"thrash". However, I strongly suspect that that wouldn't be good
enough at reducing the WAL volume down to something acceptable. That
will require an approach to WAL-logging that is much more logical than
physical. The nbtree_wal_test.sql test case involves a case where page
splits mostly don't WAL-log things that were previously WAL-logged by
simple inserts, because nbtsplitloc.c has us split in a right-heavy
fashion when there are lots of duplicates. In other words, the
_bt_split() optimization to WAL volume naturally works very well with
the test case, or really any case with lots of duplicates, so the
"write amplification" to the total volume of WAL is relatively small
on the master branch.

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally. Otherwise you end up
needlessly rewriting things that didn't actually change way too often.
That's definitely not okay. Why worry about bringing 40x down to 20x,
or even 10x? It needs to be comparable to the master branch.

> To be honest, I don't believe that incremental deduplication can really
> improve
> something, because no matter how many items were compressed we still
> rewrite
> all items from the original page to the new one, so, why not do our best.
> What do we save by this incremental approach?

The point of being incremental is not to save work in cases where a
page split is inevitable anyway. Rather, the idea is that we can be
even more lazy, and avoid doing work that will never be needed --
maybe delaying page splits actually means preventing them entirely.
Or, we can spread out the work over time, so that the amount of WAL
per checkpoint is smoother than what we would get with a batch
approach. My mental model of page splits is that there are sometimes
many of them on the same page again and again in a very short time
period, but more often the chances of any individual page being split
is low. Even the rightmost page of a serial PK index isn't truly an
exception, because a new rightmost page isn't "the same page" as the
original rightmost page -- it is its new right sibling.

Since we're going to have to optimize the WAL logging anyway, it will
be relatively easy to experiment with incremental deduplication within
_bt_dedup_one_page(). The WAL logging is the the hard part, so let's
focus on that rather than worrying too much about whether or not
incrementally doing all the work (not just the WAL logging) makes
sense. It's still too early to be sure about whether or not that's a
good idea.

-- 
Peter Geoghegan

Attachment

nbtree_wal_test.sql

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

11 September 2019, 22:09:33

On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Attached is v12, which contains WAL optimizations for posting split and
> page
> deduplication.

Hmm. So v12 seems to have some problems with the WAL logging for
posting list splits. With wal_debug = on and
wal_consistency_checking='all', I can get a replica to fail
consistency checking very quickly when "make installcheck" is run on
the primary:

4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30423A0; LSN 0/30425A0:
prev 0/3041C78; xid 506; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 20 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30425A0; LSN 0/3042F78:
prev 0/30423A0; xid 506; len 4; blkref #0: rel 1663/16385/2673, blk 13
FPW - Btree/INSERT_LEAF: off 138; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3042F78; LSN 0/3043788:
prev 0/30425A0; xid 506; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 68; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3043788; LSN 0/30437C0:
prev 0/3042F78; xid 506; len 28 - Transaction/ABORT: 2019-09-11
15:01:06.291717-07; rels: pg_tblspc/16388/PG_13_201909071/16385/16399
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30437C0; LSN 0/3043A30:
prev 0/3043788; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 9 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3043A30; LSN 0/3043D08:
prev 0/30437C0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3043D08; LSN 0/3044948:
prev 0/3043A30; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 169; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3044948; LSN 0/3044B58:
prev 0/3043D08; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 21 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3044B58; LSN 0/30454A0:
prev 0/3044948; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30454A0; LSN 0/3045CC0:
prev 0/3044B58; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 37
FPW - Btree/INSERT_LEAF: off 71; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3045CC0; LSN 0/3045F48:
prev 0/30454A0; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9
FPW - Heap/INSERT: off 10 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3045F48; LSN 0/3046240:
prev 0/3045CC0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2
FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3046240; LSN 0/3046E70:
prev 0/3045F48; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1
FPW - Btree/INSERT_LEAF: off 44; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3046E70; LSN 0/3047090:
prev 0/3046240; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56
FPW - Heap/INSERT: off 22 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3047090; LSN 0/30479E0:
prev 0/3046E70; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8
FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30479E0; LSN 0/3048420:
prev 0/3047090; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 38
FPW - Btree/INSERT_LEAF: off 10; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3048420; LSN 0/30486B0:
prev 0/30479E0; xid 507; len 3; blkref #0: rel 1663/16385/1259, blk 0
FPW - Heap/INSERT: off 6 flags 0x00
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/30486B0; LSN 0/3048C30:
prev 0/3048420; xid 507; len 4; blkref #0: rel 1663/16385/2662, blk 2
FPW - Btree/INSERT_LEAF: off 119; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3048C30; LSN 0/3049668:
prev 0/30486B0; xid 507; len 4; blkref #0: rel 1663/16385/2663, blk 1
FPW - Btree/INSERT_LEAF: off 42; in_posting_offset 0
4448/2019-09-11 15:01:06 PDT LOG:  REDO @ 0/3049668; LSN 0/304A550:
prev 0/3048C30; xid 507; len 4; blkref #0: rel 1663/16385/3455, blk 1
FPW - Btree/INSERT_LEAF: off 2; in_posting_offset 1
4448/2019-09-11 15:01:06 PDT FATAL:  inconsistent page found, rel
1663/16385/3455, forknum 0, blkno 1
4448/2019-09-11 15:01:06 PDT CONTEXT:  WAL redo at 0/3049668 for
Btree/INSERT_LEAF: off 2; in_posting_offset 1
4447/2019-09-11 15:01:06 PDT LOG:  startup process (PID 4448) exited
with exit code 1
4447/2019-09-11 15:01:06 PDT LOG:  terminating any other active server processes
4447/2019-09-11 15:01:06 PDT LOG:  database system is shut down

I regularly use this test case for the patch -- I think that I fixed a
similar problem in v11, when I changed the same WAL logging, but I
didn't mention it until now. I will debug this myself in a few days,
though you may prefer to do it before then.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

12 September 2019, 00:56:44

On Wed, Sep 11, 2019 at 3:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Hmm. So v12 seems to have some problems with the WAL logging for
> posting list splits. With wal_debug = on and
> wal_consistency_checking='all', I can get a replica to fail
> consistency checking very quickly when "make installcheck" is run on
> the primary

I see the bug here. The problem is that we WAL-log a version of the
new item that already has its heap TID changed. On the primary, the
call to _bt_form_newposting() has a new item with the original heap
TID, which is then rewritten before being inserted -- that's correct.
But during recovery, we *start out with* a version of the new item
that *already* had its heap TID swapped. So we have nowhere to get the
original heap TID from during recovery.

Attached patch fixes the problem in a hacky way -- it WAL-logs the
original heap TID, just in case. Obviously this fix isn't usable, but
it should make the problem clearer.

Can you come up with a proper fix, please? I can think of one way of
doing it, but I'll leave the details to you.

The same issue exists in _bt_split(), so the tests will still fail
with wal_consistency_checking -- it just takes a lot longer to reach a
point where an inconsistent page is found, because posting list splits
that occur at the same point that we need to split a page are much
rarer than posting list splits that occur when we simply need to
insert, without splitting the page. I suggest using
wal_consistency_checking to test the fix that you come up with. As I
mentioned, I regularly use it. Also note that there are further
subtleties to doing this within _bt_split() -- see the FIXME comments
there.

Thanks
-- 
Peter Geoghegan

Attachment

0001-Save-original-new-heap-TID-in-insert-WAL-record.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

13 September 2019, 01:04:25

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that the new WAL record has to be created once per posting
> list that is generated, not once per page that is deduplicated --
> that's the only way that I can see that avoids a huge increase in
> total WAL volume. Even if we assume that I am wrong about there being
> value in making deduplication incremental, it is still necessary to
> make the WAL-logging behave incrementally.

Attached is v13 of the patch, which shows what I mean. You could say
that v13 makes _bt_dedup_one_page() do a few extra things that are
kind of similar to the things that nbtsplitloc.c does for _bt_split().

More specifically, the v13-0001-* patch includes code that makes
_bt_dedup_one_page() "goal orientated" -- it calculates how much space
will be freed when _bt_dedup_one_page() goes on to deduplicate those
items on the page that it has already "decided to deduplicate". The
v13-0002-* patch makes _bt_dedup_one_page() actually use this ability
-- it makes _bt_dedup_one_page() give up on deduplication when it is
clear that the items that are already "pending deduplication" will
free enough space for its caller to at least avoid a page split. This
revision of the patch doesn't truly make deduplication incremental. It
is only a proof of concept that shows how _bt_dedup_one_page() can
*decide* that it will free "enough" space, whatever that may mean, so
that it can finish early. The task of making _bt_dedup_one_page()
actually avoid lots of work when it finishes early remains.

As I said yesterday, I'm not asking you to accept that v13-0002-* is
an improvement. At least not yet. In fact, "finishes early" due to the
v13-0002-* logic clearly makes everything a lot slower, since
_bt_dedup_one_page() will "thrash" even more than earlier versions of
the patch. This is especially problematic with WAL-logged relations --
the test case that I shared yesterday goes from about 6GB to 10GB with
v13-0002-* applied. But we need to fundamentally rethink the approach
to the rewriting + WAL-logging by _bt_dedup_one_page() anyway. (Note
that total index space utilization is barely affected by the
v13-0002-* patch, so clearly that much works well.)

Other changes:

* Small tweaks to amcheck (nothing interesting, really).

* Small tweaks to the _bt_killitems() stuff.

* Moved all of the deduplication helper functions to nbtinsert.c. This
is where deduplication gets complicated, so I think that it should all
live there. (i.e. nbtsort.c will call nbtinsert.c code, never the
other way around.)

Note that I haven't merged any of the changes from v12 of the patch
from yesterday. I didn't merge the posting list WAL logging changes
because of the bug I reported, but I would have were it not for that.
The WAL logging for _bt_dedup_one_page() added to v12 didn't appear to
be more efficient than your original approach (i.e. calling
log_newpage_buffer()), so I have stuck with your original approach.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.

From

Oleg Bartunov

Date:

15 September 2019, 10:47:15

On Tue, Sep 1, 2015 at 12:33 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>
> Hi, Tomas!
>
> On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:
>>>
>>> I'm going to begin work on effective storage of duplicate keys in B-tree
>>> index.
>>> The main idea is to implement posting lists and posting trees for B-tree
>>> index pages as it's already done for GIN.
>>>
>>> In a nutshell, effective storing of duplicates in GIN is organised as
>>> follows.
>>> Index stores single index tuple for each unique key. That index tuple
>>> points to posting list which contains pointers to heap tuples (TIDs). If
>>> too many rows having the same key, multiple pages are allocated for the
>>> TIDs and these constitute so called posting tree.
>>> You can find wonderful detailed descriptions in gin readme
>>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
>>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>>> It also makes possible to apply compression algorithm to posting
>>> list/tree and significantly decrease index size. Read more in
>>> presentation (part 1)
>>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>>
>>> Now new B-tree index tuple must be inserted for each table row that we
>>> index.
>>> It can possibly cause page split. Because of MVCC even unique index
>>> could contain duplicates.
>>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>>
>>> So it seems to be very useful improvement. Of course it requires a lot
>>> of changes in B-tree implementation, so I need approval from community.
>>
>>
>> In general, index size is often a serious issue - cases where indexes need more space than tables are not quite
uncommonin my experience. So I think the efforts to lower space requirements for indexes are good. 
>>
>> But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I
createa GIN index (using btree_gin), I do get mostly the same thing you propose, no? 
>
>
> Yes, In general GIN is a btree with effective duplicates handling + support of splitting single datums into multiple
keys.
> This proposal is mostly porting duplicates handling from GIN to btree.

Is it worth to make a provision to add an ability to control how
duplicates are sorted ?  If we speak about GIN, why not take into
account our experiments with RUM (https://github.com/postgrespro/rum)
?

>
>> Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
>
>
> The difference between btree_gin and btree is not only UNIQUE feature.
> 1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's not feasible to add gingettuple to GIN.
Atleast with same semantics as it is in btree. 
> 2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn GIN is more like set of separate
singlecolumnGINs: it doesn't have composite keys. 
> 3) btree_gin can't effectively handle range searches. "a < x < b" would be hangle as "a < x" intersect "x < b". That
isextremely inefficient. It is possible to fix. However, there is no clear proposal how to fit this case into GIN
interface,yet. 
>
>>
>> but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the
b-treefeature is not needed), or there are many updates. 
>
>
> From my observations users can use btree_gin only in some cases. They like compression, but can't use btree_gin
mostlybecause of #1. 
>
>> Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this
goingto be affected by introducing the posting lists? 
>
>
> I'd notice that current duplicates handling in PostgreSQL is hack over original btree. It is designed so in btree
accessmethod in PostgreSQL, not btree in general. 
> Posting lists shouldn't change concurrency much. Currently, in btree you have to lock one page exclusively when
you'reinserting new value. 
> When posting list is small and fits one page you have to do similar thing: exclusive lock of one page to insert new
value.
> When you have posting tree, you have to do exclusive lock on one page of posting tree.
>
> One can say that concurrency would became worse because index would become smaller and number of pages would became
smallertoo. Since number of pages would be smaller, backends are more likely concur for the same page. But this
argumentcan be user against any compression and for any bloat. 
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company



--
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

16 September 2019, 15:48:44

13.09.2019 4:04, Peter Geoghegan wrote:

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.

Attached is v14 based on v12 (v13 changes are not merged).

In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.

I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they trigger.
Here are test results (config is standard, except fsync=off to speedup tests):

'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

+-------------------+-----------+-----------+----------------+-----------+

|        ---        |   FPW on  |  FPW off  | FORCE_NO_IMAGE |   master  |

+-------------------+-----------+-----------+----------------+-----------+

| time              | 09:12 min | 06:56 min | 06:24 min      | 08:10 min |

| nbtree_wal_volume | 8083 MB   | 2128 MB   | 2327 MB        | 2439 MB   |

| index_size        | 169 MB    | 169 MB    | 169 MB         | 1118 MB   |

+-------------------+-----------+-----------+----------------+-----------+

With random insertions into btree it's highly possible that deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer() approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

Also, since the index is packed way better than before, it probably benefits less of wal_compression.

One possible "fix" to decrease WAL amplification is to add REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of inadequate WAL amount.
However, I doubt that it is a crash-safe idea.

Another, and more realistic approach is to make deduplication less intensive:
if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.

Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because WAL load decreased.

So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?

I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;

+-----------------+---------+---------+----------------+--------+

|       ---       |  FPW on | FPW off | FORCE_NO_IMAGE | master |

+-----------------+---------+---------+----------------+--------+

| checkpoints_req |      16 |       7 |              8 |     10 |

+-----------------+---------+---------+----------------+--------+

And I struggle to explain the reason of this.
Do you understand what can cause the difference?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v14-0001-Add-deduplication-to-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

16 September 2019, 18:58:50

On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Attached is v14 based on v12 (v13 changes are not merged).
>
> In this version, I fixed the bug you mentioned and also fixed nbtinsert,
> so that it doesn't save newposting in xlog record anymore.

Cool.

> I tested patch with nbtree_wal_test, and found out that the real issue is
> not the dedup WAL records themselves, but the full page writes that they trigger.
> Here are test results (config is standard, except fsync=off to speedup tests):
>
> 'FPW on' and 'FPW off' are tests on v14.
> NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

I think that is makes sense to focus on synthetic cases without
FPWs/FPIs from checkpoints. At least for now.

> With random insertions into btree it's highly possible that deduplication will often be
> the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.

I find that hard to believe. Deduplication only occurs when we're
about to split the page. If that's almost as likely to occur as a
simple insert, then we're in big trouble (maybe it's actually true,
but if it is then that's the real problem). Also, fewer pages for the
index naturally leads to far fewer FPIs after a checkpoint.

I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the
same case on v13. It was practically the same as the master branch,
apart from the huge difference in FPIs for the XLOG rmgr. Aside from
that one huge difference, there was a similar volume of the same types
of WAL records in each case. Mostly leaf inserts, and far fewer
internal page inserts. I suppose this isn't surprising.

It probably makes sense for the final version of the patch to increase
the volume of WAL a little overall, since the savings for internal
page stuff cannot make up for the cost of having to WAL log something
extra (deduplication operations) on leaf pages, regardless of the size
of those extra dedup WAL records (I am ignoring FPIs after a
checkpoint in this analysis). So the patch is more or less certain to
add *some* WAL overhead in cases that benefit, and that's okay. But,
it adds way too much WAL overhead today (even in v14), for reasons
that we don't understand yet, which is not okay.

I may have misunderstood your approach to WAL-logging in v12. I
thought that you were WAL-logging things that didn't change, which
doesn't seem to be the case with v14. I thought that v12 was very
similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does
its WAL-logging. v14 looks good, though.

 "pg_waldump -z" and "pg_waldump --stats=record" will break down the
contributing factor of FPIs, so it should be possible to account for
the overhead in the test case exactly. We can debug the problem by
using pg_waldump to count the absolute number of each type of record,
and the size of each type of record.

(Thinks some more...)

I think that the problem here is that you didn't copy this old code
from _bt_split() over to _bt_dedup_one_page():

    /*
     * Copy the original page's LSN into leftpage, which will become the
     * updated version of the page.  We need this because XLogInsert will
     * examine the LSN and possibly dump it in a page image.
     */
    PageSetLSN(leftpage, PageGetLSN(origpage));
    isleaf = P_ISLEAF(oopaque);

Note that this happens at the start of _bt_split() -- the temp page
buffer based on origpage starts out with the same LSN as origpage.
This is an important step of the WAL volume optimization used by
_bt_split().

> That's why there is no significant difference with log_newpage_buffer() approach.
> And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

The term "lazy deduplication" is seriously overloaded here. I think
that this could cause miscommunications. Let me list the possible
meanings of that term here:

1. First of all, the basic approach to deduplication is already lazy,
unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
a page split. I'm 100% sure that we both think that that works well
compared to an eager approach (like GIN's).

2. Second of all, there is the need to incrementally WAL log. It looks
like v14 does that well, in that it doesn't create
"xlrec_dedup.n_intervals" space when it isn't truly needed. That's
good.

3. Third, there is incremental writing of the page itself -- avoiding
using a temp buffer. Not sure where I stand on this.

4. Finally, there is the possibility that we could make deduplication
incremental, in order to avoid work that won't be needed altogether --
this would probably be combined with 3. Not sure where I stand on
this, either.

We should try to be careful when using these terms, as there is a very
real danger of talking past each other.

> Another, and more realistic approach is to make deduplication less intensive:
> if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.

I see that v14 uses the "dedupInterval" struct, which provides a
logical description of a deduplicated set of tuples. That general
approach is at least 95% of what I wanted from the
_bt_dedup_one_page() WAL-logging.

> Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
> not because deduplication itself is cpu bound or something, but because WAL load decreased.

I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
faster as things are. I am not against bringing back
BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
think that it's a distraction. It may hide problems that we want to
fix. Like the PageSetLSN() problem I mentioned just now, and maybe
others.

We will definitely need to have page space accounting that's a bit
similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
full (or has 4 bytes left, or something). That happens regularly now.
That must start with teaching _bt_dedup_one_page() about how much
space it will free. Basing it on the number of items on the page or
whatever is not going to work that well.

I think that it would be possible to have something like
BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
deduplication incremental, in the sense that it can give up on
deduplication when it frees enough space (i.e. something like v13's
0002-* patch). I said that these two things are closely related, which
is true, but it's also true that they don't overlap.

Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
made a handful of specific indexes (mostly from TPC-H) significantly
smaller. I never tried to debug the problem. It's possible that we
could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
but not use it on rightmost pages, and get the best of both worlds.
IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
debug that when the time comes.

> So I propose to develop this idea. The question is how to choose threshold.
> I wouldn't like to introduce new user settings.  Any ideas?

I think that there should be a target fill factor that sometimes makes
deduplication leave a small amount of free space. Maybe that means
that the last posting list on the page is made a bit smaller than the
other ones. It should be "goal orientated".

The loop within _bt_dedup_one_page() is very confusing in both v13 and
v14 -- I couldn't figure out why the accounting worked like this:

> +           /*
> +            * Project size of new posting list that would result from merging
> +            * current tup with pending posting list (could just be prev item
> +            * that's "pending").
> +            *
> +            * This accounting looks odd, but it's correct because ...
> +            */
> +           projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
> +                                    (dedupState->ntuples + itup_ntuples + 1) *
> +                                    sizeof(ItemPointerData));

Why the "+1" here?

I have significantly refactored the _bt_dedup_one_page() loop in a way
that seems like a big improvement. It allowed me to remove all of the
small palloc() calls inside the loop, apart from the
BTreeFormPostingTuple() palloc()s. It's also a lot faster --  it seems
to have shaved about 2 seconds off the "land" unlogged table test,
which was originally about 1 minute 2 seconds with v13's 0001-* patch
(and without v13's 0002-* patch).

It seems like can easily be integrated with the approach to WAL
logging taken in v14, so everything can be integrated soon. I'll work
on that.

> I also noticed that the number of checkpoints differ between tests:
> select checkpoints_req from pg_stat_bgwriter ;

> And I struggle to explain the reason of this.
> Do you understand what can cause the difference?

I imagine that the additional WAL volume triggered a checkpoint
earlier than in the more favorable test, which indirectly triggered
more FPIs, which contributed to triggering a checkpoint even
earlier...and so on. Synthetic test cases can avoid this. A useful
synthetic test should have no checkpoints at all, so that we can see
the broken down costs, without any second order effects that add more
cost in weird ways.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

16 September 2019, 21:11:29

On Mon, Sep 16, 2019 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that the problem here is that you didn't copy this old code
> from _bt_split() over to _bt_dedup_one_page():
>
>     /*
>      * Copy the original page's LSN into leftpage, which will become the
>      * updated version of the page.  We need this because XLogInsert will
>      * examine the LSN and possibly dump it in a page image.
>      */
>     PageSetLSN(leftpage, PageGetLSN(origpage));
>     isleaf = P_ISLEAF(oopaque);

I can confirm that this is what the problem was. Attached are two patches:

* A version of your v14 from today with a couple of tiny changes to
make it work against the current master branch -- I had to rebase the
patch, but the changes made while rebasing were totally trivial. (I
like to keep CFTester green.)

* The second patch actually fixes the PageSetLSN() thing, setting the
temp page buffer's LSN to match the original page before any real work
is done, and before XLogInsert() is called. Just like _bt_split().

The test case now shows exactly what you reported for "FPWs off" when
FPWs are turned on, at least on my machine and with my checkpoint
settings. That is, there are *zero* FPIs/FPWs, so the final nbtree
volume is 2128 MB. This means that the volume of additional WAL
required over what the master branch requires for the same test case
is very small  (2128 MB compares well with master's 2011 MB of WAL).
Maybe we could do better than 2128 MB with more work, but this is
definitely already low enough overhead to be acceptable. This also
passes "make check-world" testing.

However, my usual wal_consistency_checking smoke test fails pretty
quickly with the two patches applied:

3634/2019-09-16 13:53:22 PDT FATAL:  inconsistent page found, rel
1663/16385/2673, forknum 0, blkno 13
3634/2019-09-16 13:53:22 PDT CONTEXT:  WAL redo at 0/3202370 for
Btree/DEDUPLICATE: items were deduplicated to 12 items
3633/2019-09-16 13:53:22 PDT LOG:  startup process (PID 3634) exited
with exit code 1

Maybe the lack of the PageSetLSN() thing masked a bug in WAL replay,
since without that we effectively always just replay FPIs, never truly
relying on redo. (I didn't try wal_consistency_checking without the
second patch, but I assume that you did, and found no problems for
this reason.)

Can you produce a new version that integrates the PageSetLSN() thing,
and fixes this bug?

Thanks
-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

17 September 2019, 16:43:35

16.09.2019 21:58, Peter Geoghegan wrote:
> On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> I tested patch with nbtree_wal_test, and found out that the real issue is
>> not the dedup WAL records themselves, but the full page writes that they trigger.
>> Here are test results (config is standard, except fsync=off to speedup tests):
>>
>> 'FPW on' and 'FPW off' are tests on v14.
>> NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().
> I think that is makes sense to focus on synthetic cases without
> FPWs/FPIs from checkpoints. At least for now.
>
>> With random insertions into btree it's highly possible that deduplication will often be
>> the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.
> <...>
>
> I think that the problem here is that you didn't copy this old code
> from _bt_split() over to _bt_dedup_one_page():
>
>      /*
>       * Copy the original page's LSN into leftpage, which will become the
>       * updated version of the page.  We need this because XLogInsert will
>       * examine the LSN and possibly dump it in a page image.
>       */
>      PageSetLSN(leftpage, PageGetLSN(origpage));
>      isleaf = P_ISLEAF(oopaque);
>
> Note that this happens at the start of _bt_split() -- the temp page
> buffer based on origpage starts out with the same LSN as origpage.
> This is an important step of the WAL volume optimization used by
> _bt_split().

That's it. I suspected that such enormous amount of FPW reflects some bug.
>> That's why there is no significant difference with log_newpage_buffer() approach.
>> And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
My point was that the problem is extra FPWs, so it doesn't matter
whether we deduplicate just several entries to free enough space or all 
of them.

> The term "lazy deduplication" is seriously overloaded here. I think
> that this could cause miscommunications. Let me list the possible
> meanings of that term here:
>
> 1. First of all, the basic approach to deduplication is already lazy,
> unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid
> a page split. I'm 100% sure that we both think that that works well
> compared to an eager approach (like GIN's).
Sure.
> 2. Second of all, there is the need to incrementally WAL log. It looks
> like v14 does that well, in that it doesn't create
> "xlrec_dedup.n_intervals" space when it isn't truly needed. That's
> good.
In v12-v15 I mostly concentrated on this feature.
The last version looks good to me.

> 3. Third, there is incremental writing of the page itself -- avoiding
> using a temp buffer. Not sure where I stand on this.

I think it's a good idea.  memmove must be much faster than copying 
items tuple by tuple.
I'll send next patch by the end of the week.

> 4. Finally, there is the possibility that we could make deduplication
> incremental, in order to avoid work that won't be needed altogether --
> this would probably be combined with 3. Not sure where I stand on
> this, either.
>
> We should try to be careful when using these terms, as there is a very
> real danger of talking past each other.
>
>> Another, and more realistic approach is to make deduplication less intensive:
>> if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.
> I see that v14 uses the "dedupInterval" struct, which provides a
> logical description of a deduplicated set of tuples. That general
> approach is at least 95% of what I wanted from the
> _bt_dedup_one_page() WAL-logging.
>
>> Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
>> not because deduplication itself is cpu bound or something, but because WAL load decreased.
> I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression
> faster as things are. I am not against bringing back
> BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I
> think that it's a distraction. It may hide problems that we want to
> fix. Like the PageSetLSN() problem I mentioned just now, and maybe
> others.
>
> We will definitely need to have page space accounting that's a bit
> similar to nbtsplitloc.c, to avoid the case where a leaf page is 100%
> full (or has 4 bytes left, or something). That happens regularly now.
> That must start with teaching _bt_dedup_one_page() about how much
> space it will free. Basing it on the number of items on the page or
> whatever is not going to work that well.
>
> I think that it would be possible to have something like
> BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the
> deduplication incremental, in the sense that it can give up on
> deduplication when it frees enough space (i.e. something like v13's
> 0002-* patch). I said that these two things are closely related, which
> is true, but it's also true that they don't overlap.
>
> Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so
> made a handful of specific indexes (mostly from TPC-H) significantly
> smaller. I never tried to debug the problem. It's possible that we
> could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like),
> but not use it on rightmost pages, and get the best of both worlds.
> IIRC right-heavy low cardinality indexes (e.g. a low cardinality date
> column) were improved by removing BT_COMPRESS_THRESHOLD, but we can
> debug that when the time comes.
Now that extra FPW are proven to be a bug, I agree that giving up on 
deduplication early is not necessary.
My previous considerations were based on the idea that deduplication 
always adds considerable overhead,
which is not true after recent optimizations.

>> So I propose to develop this idea. The question is how to choose threshold.
>> I wouldn't like to introduce new user settings.  Any ideas?
> I think that there should be a target fill factor that sometimes makes
> deduplication leave a small amount of free space. Maybe that means
> that the last posting list on the page is made a bit smaller than the
> other ones. It should be "goal orientated".
>
> The loop within _bt_dedup_one_page() is very confusing in both v13 and
> v14 -- I couldn't figure out why the accounting worked like this:
>
>> +           /*
>> +            * Project size of new posting list that would result from merging
>> +            * current tup with pending posting list (could just be prev item
>> +            * that's "pending").
>> +            *
>> +            * This accounting looks odd, but it's correct because ...
>> +            */
>> +           projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
>> +                                    (dedupState->ntuples + itup_ntuples + 1) *
>> +                                    sizeof(ItemPointerData));
> Why the "+1" here?
I'll look at it.
> I have significantly refactored the _bt_dedup_one_page() loop in a way
> that seems like a big improvement. It allowed me to remove all of the
> small palloc() calls inside the loop, apart from the
> BTreeFormPostingTuple() palloc()s. It's also a lot faster --  it seems
> to have shaved about 2 seconds off the "land" unlogged table test,
> which was originally about 1 minute 2 seconds with v13's 0001-* patch
> (and without v13's 0002-* patch).
>
> It seems like can easily be integrated with the approach to WAL
> logging taken in v14, so everything can be integrated soon. I'll work
> on that.

New version is attached.
It is v14 (with PageSetLSN fix) merged with v13.

I also fixed a bug in btree_xlog_dedup(), that was previously masked by FPW.
v15 passes make installcheck.
I haven't tested it with land test yet. Will do it later this week.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v15-0001-Add-deduplication-to-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

18 September 2019, 17:43:18

On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> > 3. Third, there is incremental writing of the page itself -- avoiding
> > using a temp buffer. Not sure where I stand on this.
>
> I think it's a good idea.  memmove must be much faster than copying
> items tuple by tuple.
> I'll send next patch by the end of the week.

I think that the biggest problem is that we copy all of the tuples,
including existing posting list tuples that can't be merged with
anything. Even if you assume that we'll never finish early (e.g. by
using logic like the "if (pagesaving >= newitemsz) deduplicate =
false;" thing), this can still unnecessarily slow down deduplication.
Very often,  _bt_dedup_one_page() is called when 1/2 - 2/3 of the
space on the page is already used by a small number of very large
posting list tuples.

> > The loop within _bt_dedup_one_page() is very confusing in both v13 and
> > v14 -- I couldn't figure out why the accounting worked like this:

> I'll look at it.

I'm currently working on merging my refactored version of
_bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
(I have finished merging the other WAL-logging stuff, though -- that
was easy.)

The general idea is that the loop in _bt_dedup_one_page() now
explicitly operates with a "base" tuple, rather than *always* saving
the prev tuple from the last loop iteration. We always have a "pending
posting list", which won't be written-out as a posting list if it
isn't possible to merge at least one existing page item. The "base"
tuple doesn't change. "pagesaving" space accounting works in a way
that doesn't care about whether or not the base tuple was already a
posting list -- it saves the size of the IndexTuple without any
existing posting list size, and calculates the contribution to the
total size of the new posting list separately (heap TIDs from the
original base tuple and subsequent tuples are counted together).

This has a number of advantages:

* The loop is a lot clearer now, and seems to have slightly better
space utilization because of improved accounting (with or without the
"if (pagesaving >= newitemsz) deduplicate = false;" thing).

* I think that we're going to need to be disciplined about which tuple
is the "base" tuple for correctness reasons -- we should always use
the leftmost existing tuple to form a new posting list tuple. I am
concerned about rare cases where we deduplicate tuples that are equal
according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless
have different sizes (and are not bitwise equal). There are rare cases
involving TOAST compression where that is just about possible (see the
temp comments I added to  _bt_keep_natts_fast() in the patch).

* It's clearly faster, because there is far less palloc() overhead --
the "land" unlogged table test completes in about 95.5% of the time
taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate =
false;" for both versions here, to keep it simple and fair).

This also suggests that making _bt_dedup_one_page() do raw page adds
and page deletes to the page in shared_buffers (i.e. don't use a temp
buffer page) could pay off. As I went into at the start of this
e-mail, unnecessarily doing expensive things like copying large
posting lists around is a real concern. Even if it isn't truly useful
for _bt_dedup_one_page() to operate in a very incremental fashion,
incrementalism is probably still a good thing to aim for -- it seems
to make deduplication faster in all cases.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

18 September 2019, 18:01:19

On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
> This also suggests that making _bt_dedup_one_page() do raw page adds
> and page deletes to the page in shared_buffers (i.e. don't use a temp
> buffer page) could pay off. As I went into at the start of this
> e-mail, unnecessarily doing expensive things like copying large
> posting lists around is a real concern. Even if it isn't truly useful
> for _bt_dedup_one_page() to operate in a very incremental fashion,
> incrementalism is probably still a good thing to aim for -- it seems
> to make deduplication faster in all cases.

I think that I forgot to mention that I am concerned that the
kill_prior_tuple/LP_DEAD optimization could be applied less often
because _bt_dedup_one_page() operates too aggressively. That is a big
part of my general concern.

Maybe I'm wrong about this -- who knows? I definitely think that
LP_DEAD setting by _bt_check_unique() is generally a lot more
important than LP_DEAD setting by the kill_prior_tuple optimization,
and the patch won't affect unique indexes. Only very serious
benchmarking can give us a clear answer, though.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

19 September 2019, 02:25:23

On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm currently working on merging my refactored version of
> _bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky.
> (I have finished merging the other WAL-logging stuff, though -- that
> was easy.)

I attach version 16. This revision merges your recent work on WAL
logging with my recent work on simplifying _bt_dedup_one_page(). See
my e-mail from earlier today for details.

Hopefully this will be a bit easier to work with when you go to make
_bt_dedup_one_page() do raw PageIndexMultiDelete() + PageAddItem()
calls against the page contained in a buffer directly (rather than
using a temp version of the page in local memory in the style of
_bt_split()). I find the loop within _bt_dedup_one_page() much easier
to follow now.

While I'm looking forward to seeing the
PageIndexMultiDelete()/PageAddItem() approach that you come up with,
the basic design of _bt_dedup_one_page() seems to be in much better
shape today than it was a few weeks ago. I am going to spend the next
few days teaching _bt_dedup_one_page() about space utilization. I'll
probably make it respect a fillfactor-style target. I've noticed that
it is often too aggressive about filling a page, though less often it
actually shows the opposite problem: it fails to use more than about
2/3 of the page for the same value, again and again (must be something
to do with the exact width of the tuples). In general,
_bt_dedup_one_page() should know a few things about what nbtsplitloc.c
will do when the page is very likely to be split soon.

I'll also spend some more time working on the opclass infrastructure
that we need to disable deduplication with datatypes where it is
unsafe [1].

Other changes:

* qsort() is no longer used by BTreeFormPostingTuple() in v16 -- we
can easily sorting the array of heap TIDs the caller's responsibility.
Since the heap TID column is sorted in ascending order among
duplicates on a page, and since TIDs within individual posting lists
are also sorted in ascending order, there is no need to resort. I
added a new assertion to BTreeFormPostingTuple() that verifies that
its caller actually gets it right.

* The new nbtpage.c/VACUUM code has been tweaked to minimize the
changes required against master. Nothing significant, though.

It was easier to refactor the _bt_dedup_one_page() stuff by
temporarily making nbtsort.c not use it. I didn't want to delay
getting v16 to you, so I didn't take the time to fix-up nbtsort.c to
use the new stuff. It's actually using its own old copy of stuff that
it should get from nbtinsert.c in v16 -- it calls
_bt_dedup_item_tid_sort(), not the new _bt_dedup_save_htid() function.
I'll update it soon, though.

[1] https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
--
Peter Geoghegan

Attachment

v16-0001-Add-deduplication-to-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

24 September 2019, 00:13:32

On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I attach version 16. This revision merges your recent work on WAL
> logging with my recent work on simplifying _bt_dedup_one_page(). See
> my e-mail from earlier today for details.

I attach version 17. This version has changes that are focussed on
further polishing certain things, including fixing some minor bugs. It
seemed worth creating a new version for that. (I didn't get very far
with the space utilization stuff I talked about, so no changes there.)

Changes in v17:

* nbtsort.c now has a loop structure that closely matches
_bt_dedup_one_page() (I put this off in v16).

We now reuse most of the nbtinsert.c deduplication routines.

* Further simplification of btree_xlog_dedup() loop.

Recovery no longer relies on local variables to track the progress of
deduplication -- it uses dedup state (the state managed by
nbtinsert.c's dedup routines) instead. This is easier to follow.

* Reworked _bt_split() comments on posting list splits that coincide
with page splits.

* Fixed memory leaks in recovery code by creating a dedicated memory
context that gets reset regularly. The context is create in a new rmgr
"startup" callback I created for the B-Tree rmgr. We already do this
for both GIN and GiST.

More specifically, the REDO code calls MemoryContextReset() against
its dedicated memory context after every record is processed by REDO,
no matter what. The MemoryContextReset() call usually won't have to
actually free anything, but that's okay because the no-free case does
almost no work. I think that it makes sense to keep things as simple
as possible for memory management during recovery -- it's too easy for
a new memory leak to get introduced when a small change is made to the
nbtinsert.c routines later on.

* Optimize VACUUMing of posting lists: we now only allocate memory for
an array of still-live posting list items when the array will actually
be needed. It is only needed when there are tuples to remove from the
posting list, because only then do we need to create a replacement
posting list that lacks the heap TIDs that VACUUM needs to delete.

It seemed like a really good idea to not allocate any memory in the
common case where VACUUM doesn't need to change a posting list tuple
at all. ginVacuumItemPointers() has exactly the same optimization.

* Fixed an accounting bug in the output of VACCUM VERBOSE by changing
some code in nbtree.c.

The tuples_removed and num_index_tuples fields in
IndexBulkDeleteResult are reported as "index row versions" by VACUUM
VERBOSE. Everything but the index pages stat works at the level of
"index row versions", which should not be affected by the
deduplication patch. Of course, deduplication only changes the
physical representation of items in the index -- never the logical
contents of the index. This is what GIN does already.

Another infrastructure thing that the patch needs to handle to be committable:

We still haven't added an "off" switch to deduplication, which seems
necessary. I suppose that this should look like GIN's "fastupdate"
storage parameter. It's not obvious how to do this in a way that's
easy to work with, though. Maybe we could do something like copy GIN's
GinGetUseFastUpdate() macro, but the situation with nbtree is actually
quite different. There are two questions for nbtree when it comes to
deduplication within an inde: 1) Does the user want to use
deduplication, because that will help performance?, and 2) Is it
safe/possible to use deduplication at all?

I think that we should probably stash this information (deduplication
is both possible and safe) in the metapage. Maybe we can copy it over
to our insertion scankey, just like the "heapkeyspace" field -- that
information also comes from the metapage (it's based on the nbtree
version). The "heapkeyspace" field is a bit ugly, so maybe we
shouldn't go further by adding something similar, but I don't see any
great alternative right now.

--
Peter Geoghegan

Attachment

v17-0001-Add-deduplication-to-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

24 September 2019, 19:41:04

On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I attach version 17.

I attach a patch that applies on top of v17. It adds support for
deduplication within unique indexes. Actually, this is a snippet of
code that appeared in my prototype from August 5 (we need very little
extra code for this now). Unique index support kind of looked like a
bad idea at the time, but things have changed a lot since then.

I benchmarked this overnight using a custom pgbench-based test that
used a Zipfian distribution, with a single-row SELECT and an UPDATE of
pgbench_accounts per xact. I used regular/logged tables this time
around, since WAL-logging is now fairly efficient. I added a separate
low cardinality index on pgbench_accounts(abalance). A low cardinality
index is the most interesting case for this patch, obviously, but it
also serves to prevent all HOT updates, increasing bloat for both
indexes. We want a realistic case that creates a lot of index bloat.

This wasn't a rigorous enough benchmark to present here in full, but
the results were very encouraging. With reasonable client counts for
the underlying hardware, we seem to have a small increase in TPS, and
a small decrease in latency. There is a regression with 128 clients,
when contention is ridiculously high (this is my home server, which
only has 4 cores). More importantly:

* The low cardinality index is almost 3x smaller with the patch -- no
surprises there.

* The read latency is where latency goes up, if it goes up at all.
Whatever it is that might increase latency, it doesn't look like it's
deduplication itself. Yeah, deduplication is expensive, but it's not
nearly as expensive as a page split. (I'm talking about the immediate
cost, not the bigger picture, though the bigger picture matters even
more.)

* The growth in primary key size over time is the thing I find really
interesting. The patch seems to really control the number of pages
splits over many hours with lots of non-HOT updates. I think that a
timeline of days or weeks could be really interesting.

I am now about 75% convinced that adding deduplication to unique
indexes is a good idea, at least as an option that is disabled by
default. We're already doing well here, even though there has been no
work on optimizing deduplication in unique indexes. Further
optimizations seem quite possible, though. I'm mostly thinking about
optimizing non-HOT updates by teaching nbtree some basic things about
versioning with unique indexes.

For example, we could remember "recently dead" duplicates of the value
we are about to insert (as part of an UPDATE statement) from within
_bt_check_unique(). Then, when it looks like a page split may be
necessary, we can hint to  _bt_dedup_one_page(): "please just
deduplicate the group of duplicates starting from this offset, which
are duplicates of the this new item I am inserting -- do not create a
posting list that I will have to split, though". We already cache the
binary search bounds established within _bt_check_unique() in
insertstate, so perhaps we could reuse that to make this work. The
goal here is that the the old/recently dead versions end up together
in their own posting list (or maybe two posting lists), whereas our
new/most current tuple is on its own. There is a very good chance that
our transaction will commit, leaving somebody else to set the LP_DEAD
bit on the posting list that contains those old versions. In short,
we'd be making deduplication and opportunistic garbage collection
cooperate closely.

This can work both ways -- maybe we should also teach
_bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is,
if we add the mechanism I just described in the last paragraph, maybe
_bt_dedup_one_page() marks the posting list that is likely to get its
LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here,
LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit
on this posting list tuple very soon". That way, if nobody actually
does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If
it goes to the heap and finds the latest version, and that that
version is visible to any possible MVCC snapshot, that means that it's
safe to kill all the other versions, even without the LP_DEAD bit set
-- this is a unique index. So, it often gets to kill lots of extra
garbage that it wouldn't get to kill, preventing page splits. The cost
is pretty low: the risk that the single heap page check will be a
wasted effort. (Of course, we still have to visit the heap pages of
things that we go on to kill, to get the XIDs to generate recovery
conflicts -- the important point is that we only need to visit one
heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do
all this -- cases that don't benefit at all also don't pay very much.)

I don't think that we need to do either of these two other things to
justify committing the patch with unique index support. But, teaching
nbtree a little bit about versioning like this could work rather well
in practice, without it really mattering that it will get the wrong
idea at times (e.g. when transactions abort a lot). This all seems
promising as a family of techniques for unique indexes. It's worth
doing extra work if it might delay a page split, since delaying can
actually fully prevent page splits that are mostly caused by non-HOT
updates. Most primary key indexes are serial PKs, or some kind of
counter. Postgres should mostly do page splits for these kinds of
primary keys indexes in the places that make sense based on the
dataset, and not because of "write amplification".

-- 
Peter Geoghegan

Attachment

v17-0005-Reintroduce-unique-index-support.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

25 September 2019, 15:05:03

24.09.2019 3:13, Peter Geoghegan wrote:
> On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> I attach version 16. This revision merges your recent work on WAL
>> logging with my recent work on simplifying _bt_dedup_one_page(). See
>> my e-mail from earlier today for details.
> I attach version 17. This version has changes that are focussed on
> further polishing certain things, including fixing some minor bugs. It
> seemed worth creating a new version for that. (I didn't get very far
> with the space utilization stuff I talked about, so no changes there.)
Attached is v18. In this version bt_dedup_one_page() is refactored so that:
- no temp page is used, all updates are applied to the original page.
- each posting tuple wal logged separately.
This also allowed to simplify btree_xlog_dedup significantly.

> Another infrastructure thing that the patch needs to handle to be committable:
>
> We still haven't added an "off" switch to deduplication, which seems
> necessary. I suppose that this should look like GIN's "fastupdate"
> storage parameter. It's not obvious how to do this in a way that's
> easy to work with, though. Maybe we could do something like copy GIN's
> GinGetUseFastUpdate() macro, but the situation with nbtree is actually
> quite different. There are two questions for nbtree when it comes to
> deduplication within an inde: 1) Does the user want to use
> deduplication, because that will help performance?, and 2) Is it
> safe/possible to use deduplication at all?
I'll send another version with dedup option soon.

> I think that we should probably stash this information (deduplication
> is both possible and safe) in the metapage. Maybe we can copy it over
> to our insertion scankey, just like the "heapkeyspace" field -- that
> information also comes from the metapage (it's based on the nbtree
> version). The "heapkeyspace" field is a bit ugly, so maybe we
> shouldn't go further by adding something similar, but I don't see any
> great alternative right now.
>
Why is it necessary to save this information somewhere but rel->rd_options,
while we can easily access this field from _bt_findinsertloc() and 
_bt_load().

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v18-0001-Add-deduplication-to-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

25 September 2019, 19:14:08

On Wed, Sep 25, 2019 at 8:05 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Attached is v18. In this version bt_dedup_one_page() is refactored so that:
> - no temp page is used, all updates are applied to the original page.
> - each posting tuple wal logged separately.
> This also allowed to simplify btree_xlog_dedup significantly.

This looks great! Even if it isn't faster than using a temp page
buffer, the flexibility seems like an important advantage. We can do
things like have the _bt_dedup_one_page() caller hint that
deduplication should start at a particular offset number. If that
doesn't work out by the time the end of the page is reached (whatever
"works out" may mean), then we can just start at the beginning of the
page, and work through the items we skipped over initially.

> > We still haven't added an "off" switch to deduplication, which seems
> > necessary. I suppose that this should look like GIN's "fastupdate"
> > storage parameter.

> Why is it necessary to save this information somewhere but rel->rd_options,
> while we can easily access this field from _bt_findinsertloc() and
> _bt_load().

Maybe, but we also need to access a flag that says it's safe to use
deduplication. Obviously deduplication is not safe for datatypes like
numeric and text with a nondeterministic collation. The "is
deduplication safe for this index?" mechanism will probably work by
doing several catalog lookups. This doesn't seem like something we
want to do very often, especially with a buffer lock held -- ideally
it will be somewhere that's convenient to access.

Do we want to do that separately, and have a storage parameter that
says "I would like to use deduplication in principle, if it's safe"?
Or, do we store both pieces of information together, and forbid
setting the storage parameter to on when it's known to be unsafe for
the underlying opclasses used by the index? I don't know.

I think that you can start working on this without knowing exactly how
we'll do those catalog lookups. What you come up with has to work with
that before the patch can be committed, though.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Anastasia Lubennikova

Date:

27 September 2019, 16:43:08

25.09.2019 22:14, Peter Geoghegan wrote:
>
>>> We still haven't added an "off" switch to deduplication, which seems
>>> necessary. I suppose that this should look like GIN's "fastupdate"
>>> storage parameter.
>> Why is it necessary to save this information somewhere but rel->rd_options,
>> while we can easily access this field from _bt_findinsertloc() and
>> _bt_load().
> Maybe, but we also need to access a flag that says it's safe to use
> deduplication. Obviously deduplication is not safe for datatypes like
> numeric and text with a nondeterministic collation. The "is
> deduplication safe for this index?" mechanism will probably work by
> doing several catalog lookups. This doesn't seem like something we
> want to do very often, especially with a buffer lock held -- ideally
> it will be somewhere that's convenient to access.
>
> Do we want to do that separately, and have a storage parameter that
> says "I would like to use deduplication in principle, if it's safe"?
> Or, do we store both pieces of information together, and forbid
> setting the storage parameter to on when it's known to be unsafe for
> the underlying opclasses used by the index? I don't know.
>
> I think that you can start working on this without knowing exactly how
> we'll do those catalog lookups. What you come up with has to work with
> that before the patch can be committed, though.
>
Attached is v19.

* It adds new btree reloption "deduplication".
I decided to refactor the code and move BtreeOptions into a separate 
structure,
rather than adding new btree specific value to StdRelOptions.
Now it can be set even for indexes that do not support deduplication.
In that case it will be ignored. Should we add this check to option 
validation?

* By default deduplication is on for non-unique indexes and off for 
unique ones.

* New function _bt_dedup_is_possible() is intended to be a single place
to perform all the checks. Now it's just a stub to ensure that it works.

Is there a way to extract this from existing opclass information,
or we need to add new opclass field? Have you already started this work?
I recall there was another thread, but didn't manage to find it.

* I also integrated into this version your latest patch that enables 
deduplication on unique indexes,
since now it can be easily switched on/off.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

v19-0001-Add-deduplication-to-nbtree.patch

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

28 September 2019, 02:02:15

On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Attached is v19.

Cool.

> * By default deduplication is on for non-unique indexes and off for
> unique ones.

I think that it makes sense to enable deduplication by default -- even
with unique indexes. It looks like deduplication can be very helpful
with non-HOT updates. I have been benchmarking this using more or less
standard pgbench at scale 200, with one big difference -- I also
create an index on "pgbench_accounts (abalance)". This is a low
cardinality index, which ends up about 3x smaller with the patch, as
expected. It also makes all updates non-HOT updates, greatly
increasing index bloat in the primary key of the accounts table --
this is what I found really interesting about this workload.

The theory behind deduplication within unique indexes seems quite
different to the cases we've focussed on so far -- that's why my
working copy of the patch makes a few small changes to how
_bt_dedup_one_page() works with unique indexes specifically (more on
that later). With unique indexes, deduplication doesn't help by
creating space -- it helps by creating *time* for garbage collection
to run before the real "damage" is done -- it delays page splits. This
is only truly valuable when page splits caused by non-HOT updates are
delayed by so much that they're actually prevented entirely, typically
because the _bt_vacuum_one_page() stuff can now happen before pages
split, not after. In general, these page splits are bad because they
degrade the B-Tree structure, more or less permanently (it's certainly
permanent with this workload). Having a huge number of page splits
*purely* because of non-HOT updates is particular bad -- it's just
awful. I believe that this is the single biggest problem with the
Postgres approach to versioned storage (we know that other DB systems
have no primary key page splits with this kind of workload).

Anyway, if you run this pgbench workload without rate-limiting, so
that a patched Postgres does as much work as physically possible, the
accounts table primary key (pgbench_accounts_pkey) at least grows at a
slower rate -- the patch clearly beats master at the start of the
benchmark/test (as measured by index size). As the clients are ramped
up by my testing script, and as time goes on, eventually the size of
the pgbench_accounts_pkey index "catches up" with master. The patch
delays page splits, but ultimately the system as a whole cannot
prevent the page splits altogether, since the server is being
absolutely hammered by pgbench. Actually, the index is *exactly* the
same size for both the master case and the patch case when we reach
this "bloat saturation point". We can delay the problem, but we cannot
prevent it. But what about a more realistic workload, with
rate-limiting?

When I add some rate limiting, so that the TPS/throughput is at about
50% of what I got the first time around (i.e. 50% of what is
possible), or 15k TPS, it's very different. _bt_dedup_one_page() can
now effectively cooperate with _bt_vacuum_one_page(). Now
deduplication is able to "soak up all the extra garbage tuples" for
long enough to delay and ultimately *prevent* almost all page splits.
pgbench_accounts_pkey starts off at 428 MB for both master and patch
(CREATE INDEX makes it that size). After about an hour, the index is
447 MB with the patch. The master case ends up with a
pgbench_accounts_pkey size of 854 MB, though (this is very close to
857 MB, the "saturation point" index size from before).

This is a very significant improvement, obviously -- the patch has an
index that is ~52% of the size seen for the same index with the master
branch!

Here is how I changed _bt_dedup_one_page() for unique indexes to get
this result:

* We limit the size of posting lists to 5 heap TIDs in the
checkingunique case. Right now, we will actually accept a
checkingunique page split before we'll merge together items that
result in a posting list with more heap TIDs than that (not sure about
these details at all, though).

* Avoid creating a new posting list that caller will have to split
immediately anyway (this is based on details of _bt_dedup_one_page()
caller's newitem tuple).

(Not sure how much this customization contributes to this favorable
test result -- maybe it doesn't make that much difference.)

The goal here is for duplicates that are close together in both time
and space to get "clumped together" into their own distinct, small-ish
posting list tuples with no more than 5 TIDs. This is intended to help
_bt_vacuum_one_page(), which is known to be a very important mechanism
for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set
very frequently within _bt_check_unique()). The general idea is to
balance deduplication against LP_DEAD killing, and to increase
spatial/temporal locality within these smaller posting lists. If we
have one huge posting list for each value, then we can't set the
LP_DEAD bit on anything at all, which is very bad. If we have a few
posting lists that are not so big for each distinct value, we can
often kill most of them within _bt_vacuum_one_page(), which is very
good, and has minimal downside (i.e. we still get most of the benefits
of aggressive deduplication).

Interestingly, these non-HOT page splits all seem to "come in waves".
I noticed this because I carefully monitored the benchmark/test case
over time. The patch doesn't prevent the "waves of page splits"
pattern, but it does make it much much less noticeable.

> * New function _bt_dedup_is_possible() is intended to be a single place
> to perform all the checks. Now it's just a stub to ensure that it works.
>
> Is there a way to extract this from existing opclass information,
> or we need to add new opclass field? Have you already started this work?
> I recall there was another thread, but didn't manage to find it.

The thread is here:
https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com

--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

01 October 2019, 02:39:34

On Fri, Sep 27, 2019 at 7:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that it makes sense to enable deduplication by default -- even
> with unique indexes. It looks like deduplication can be very helpful
> with non-HOT updates.

Attached is v20, which adds a custom strategy for the checkingunique
(unique index) case to _bt_dedup_one_page(). It also makes
deduplication the default for both unique and non-unique indexes. I
simply altered your new BtreeDefaultDoDedup() macro from v19 to make
nbtree use deduplication wherever it is safe to do so. This default
may not be the best one in the end, though deduplication in unique
indexes now looks very compelling.

The new checkingunique heuristics added to _bt_dedup_one_page() were
developed experimentally, based on pgbench tests. The general idea
with the new checkingunique stuff is to make deduplication *extremely*
lazy. We want to avoid making _bt_vacuum_one_page() garbage collection
less effective by being too aggressive with deduplication -- workloads
with lots of non-HOT-updates into unique indexes are greatly dependent
on the LP_DEAD bit setting in _bt_check_unique(). At the same time,
_bt_dedup_one_page() can be just as effective at delaying page splits
as it is with non-unique indexes.

I've found that my "regular pgbench, but with a low cardinality index
on pgbench_accounts(abalance)" benchmark works best with the specific
heuristics used in the patch, especially over many hours. I spent
nearly 24 hours running the test at full speed (no throttling this
time), at scale 500, and with very very aggressive autovacuum settings
(autovacuum_vacuum_cost_delay=0ms,
autovacuum_vacuum_scale_factor=0.02). Each run lasted one hour, with
alternating runs of 4, 8, and 16 clients. Towards the end, the patch
had about 5% greater throughput at lower client counts, and never
seemed to be significantly slower (it was very slightly slower once or
twice, but I think that that was just noise).

More importantly, the indexes looked like this on master:

bloated_abalance: 3017 MB
pgbench_accounts_pkey: 2142 MB
pgbench_branches_pkey: 1352 kB
pgbench_tellers_pkey: 3416 kB

And like this with the patch:

bloated_abalance: 1015 MB
pgbench_accounts_pkey: 1745 MB
pgbench_branches_pkey: 296 kB
pgbench_tellers_pkey: 888 kB

* bloated_abalance is about 3x smaller here, as usual -- no surprises there.

* pgbench_accounts_pkey is the most interesting case.

You might think that it isn't that great that pgbench_accounts_pkey is
1745 MB with the patch, since it starts out at only 1071 MB (and would
go back down to 1071 MB again if we were to do a REINDEX). However,
you have to bear in mind that it takes a long time for it to get that
big -- the growth over time is very important here. Even after the
first run with 16 clients, it only reached 1160 MB -- that's an
increase of ~8%. The master case had already reached 2142 MB ("bloat
saturation point") by then, though. I could easily have stopped the
benchmark there, or used rate-limiting, or excluded the 16 client case
-- that would have allowed me to claim that the growth was under 10%
for a workload where the master case has an index that doubles in
size. On the other hand, if autovacuum wasn't configured to run very
frequently, then the patch wouldn't look nearly this good.
Deduplication helped autovacuum by "soaking up" the "recently dead"
index tuples that cannot be killed right away. In short, the patch
ameliorates weaknesses of the existing garbage collection mechanisms
without changing them. The patch smoothed out the growth of
pgbench_accounts_pkey over many hours. As I said, it was only 1160 MB
after the first 3 hours/first 16 client run. It was 1356 MB after the
second 16 client run (i.e. after running another round of one hour
4/8/16 client runs), finally finishing up at 1745 MB. So the growth in
the size of pgbench_accounts_pkey for the patch was significantly
improved, and the *rate* of growth over time was also improved.

The master branch had a terrible jerky growth in the size of
pgbench_accounts_pkey. The master branch did mostly keep up at first
(i.e. the size of pgbench_accounts_pkey wasn't too different at
first). But once we got to 16 clients for the first time, after a
couple of hours, pgbench_accounts_pkey almost doubled in size over a
period of only 10 or 20 minutes! The index size *exploded* in a very
short period of time, starting only a few hours into the benchmark.
(Maybe we don't see this anything like this with the patch because
with the patch backends are more concerned about helping VACUUM, and
less concerned about creating a mess that VACUUM must clean up. Not
sure.)

* We also manage to make the small pgbench indexes
(pgbench_branches_pkey and pgbench_tellers_pkey) over 4x smaller here
(without doing anything to force more non-HOT updates on the
underlying tables).

This result for the two small indexes looks good, but I should point
out that we still only fit ~15 or so tuples on each leaf page with the
patch when everything is over -- far far less than the number that
CREATE INDEX stored on the leaf pages immediately (it leaves 366 items
on each leaf page). This is kind of an extreme case, because there is
so much contention, but space utilization with the patch is actually
very bad here. The master branch is very very very bad, though, so
we're at least down to only a single "very" here. Progress.

Any thoughts on the approach taken for unique indexes within
_bt_dedup_one_page() in v20? Obviously that stuff needs to be examined
critically -- it's possible that it wouldn't do as well as it could or
should with other workloads that I haven't thought about. Please take
a look at the details.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

02 October 2019, 17:43:51

On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I've found that my "regular pgbench, but with a low cardinality index
> on pgbench_accounts(abalance)" benchmark works best with the specific
> heuristics used in the patch, especially over many hours.

I ran pgbench without the pgbench_accounts(abalance) index, and with
slightly adjusted client counts -- you could say that this was a
classic pgbench benchmark of v20 of the patch. Still scale 500, with
single hour runs.

Here are the results for each 1 hour run, with client counts of 8, 16,
and 32, with two rounds of runs total:

master_1_run_8.out: "tps = 25156.689415 (including connections establishing)"
patch_1_run_8.out: "tps = 25135.472084 (including connections establishing)"
master_1_run_16.out: "tps = 30947.053714 (including connections establishing)"
patch_1_run_16.out: "tps = 31225.044305 (including connections establishing)"
master_1_run_32.out: "tps = 29550.231969 (including connections establishing)"
patch_1_run_32.out: "tps = 29425.011249 (including connections establishing)"

master_2_run_8.out: "tps = 24678.792084 (including connections establishing)"
patch_2_run_8.out: "tps = 24891.130465 (including connections establishing)"
master_2_run_16.out: "tps = 30878.930585 (including connections establishing)"
patch_2_run_16.out: "tps = 30982.306091 (including connections establishing)"
master_2_run_32.out: "tps = 29555.453436 (including connections establishing)"
patch_2_run_32.out: "tps = 29591.767136 (including connections establishing)"

This interlaced order is the same order that each 1 hour pgbench run
actually ran in. The patch wasn't expected to do any better here -- it
was expected to not be any slower for a workload that it cannot really
help. Though the two small pgbench indexes do remain a lot smaller
with the patch.

While a lot of work remains to validate the performance of the patch,
this looks good to me.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

04 November 2019, 19:52:14

On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v20, which adds a custom strategy for the checkingunique
> (unique index) case to _bt_dedup_one_page(). It also makes
> deduplication the default for both unique and non-unique indexes. I
> simply altered your new BtreeDefaultDoDedup() macro from v19 to make
> nbtree use deduplication wherever it is safe to do so. This default
> may not be the best one in the end, though deduplication in unique
> indexes now looks very compelling.

Attached is v21, which fixes some bitrot -- v20 of the patch was made
totally unusable by today's commit 8557a6f1. Other changes:

* New datum_image_eq() patch fixes up datum_image_eq() to work with
cstring/name columns, which we rely on. No need for a Valgrind
suppressions anymore. The suppression was only needed to paper over
the fact that datum_image_eq() would not really work properly with
cstring datums (the suppression was papering over a legitimate
complaint, but we fix the underlying problem with 8557a6f1 and the
v21-0001-* patch).

* New nbtdedup.c file added. This has all of the functions that dealt
with deduplication and posting lists that were previously in
nbtinsert.c and nbtutils.c. I think that this separation is somewhat
cleaner.

* Additional tweaks to the custom checkingunique algorithm used by
deduplication. This is based on further tuning from benchmarking. This
is certainly not final yet.

* Greatly simplified the code for unique index LP_DEAD killing in
_bt_check_unique(). This was pretty sloppy in v20 of the patch (it had
two "goto" labels). Now it works with the existing loop conditions
that advance to the next equal item on the page.

* Additional adjustments to the nbtree.h comments about the on-disk format.

Can you take a quick look at the first patch (the v21-0001-* patch),
Anastasia? I would like to get that one out of the way soon.

--
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

08 November 2019, 18:35:57

On Mon, Nov 4, 2019 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v21, which fixes some bitrot -- v20 of the patch was made
> totally unusable by today's commit 8557a6f1. Other changes:

There is more bitrot, so I attach v22. This also has some new changes
centered around fixing particular issues with space utilization. These
changes are:

* nbtsort.c now intelligently considers the contribution of suffix
truncation of posting list tuples when considering whether or not a
leaf page is "full". I mean "full" in the sense that it has exceeded
the soft limit (fillfactor-wise limit) on space utilization for the
page (no change in how the hard limit in _bt_buildadd() works).

We don't usually bother predicting the space saving from suffix
truncation when considering split points, even in nbtsplitloc.c, but
it's worth making an exception for posting lists (actually, this is
the same exception that nbtsplitloc.c already had in much earlier
versions of the patch). Posting lists are very often large enough to
really make a big contribution to how balanced free space is. I now
observe that weird cases where CREATE INDEX packs leaf pages too empty
(or too full) are now all but eliminated. CREATE INDEX now does a
pretty good job of respecting leaf fillfactor, while also allowing
deduplication to be very effective (CREATE INDEX did neither of these
two things in earlier versions of the patch).

* Added "single value" strategy for retail insert deduplication --
this is closely related to nbtsplitloc.c's single value strategy.

The general idea is that _bt_dedup_one_page() anticipates that a
future "single value" page split is likely to occur, and therefore
limits deduplication after two "1/3 of a page"-wide posting lists at
the start of the page. It arranges for deduplication to leave a neat
split point for nbtsplitloc.c to use when the time comes. In other
words, the patch now allows "single value" page splits to leave leaf
pages BTREE_SINGLEVAL_FILLFACTOR% full, just like v12/master. Leaving
a small amount of free space on pages that are packed full of
duplicates is always a good idea. Also, we no longer force page splits
to leave pages 2/3 full (only two large posting lists plus a high
key), which sometimes happened with v21. On balance, this change seems
to slightly improve space utilization.

In general, it's now unusual for retail insertions to get better space
utilization than CREATE INDEX -- in that sense normality/balance has
been restored in v22. Actually, I wrote the v22 changes by working
through a list of weird space utilization issues from my personal
notes. I'm pretty sure I've fixed all of those -- only nbtsplitloc.c's
single value strategy wants to split at a point that leaves a heap TID
in the new high key for the page, so that's the only thing we need to
worry about within nbtdedup.c.

* "deduplication" storage parameter now has psql completion.

I intend to push the datum_image_eq() preparatory patch soon. I will
also push a commit that makes _bt_keep_natts_fast() use
datum_image_eq() separately. Anybody have an opinion on that?

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

12 November 2019, 23:21:53

On Fri, Nov 8, 2019 at 10:35 AM Peter Geoghegan <pg@bowt.ie> wrote:
> There is more bitrot, so I attach v22.

The patch has stopped applying once again, so I attach v23.

One reason for the bitrot is that I pushed preparatory commits,
including today's "Make _bt_keep_natts_fast() use datum_image_eq()"
commit. Good to get that out of the way.

Other changes:

* Decided to go back to turning deduplication on by default with
non-unique indexes, and off by default using unique indexes.

The unique index stuff was regressed enough with INSERT-heavy
workloads that I was put off, despite my initial enthusiasm for
enabling deduplication everywhere.

* Disabled deduplication in system catalog indexes by deeming it
generally unsafe.

I realized that it would be impossible to provide a way to disable
deduplication in system catalog indexes if it was enabled at all. The
reason for this is simple: in general, it's not possible to set
storage parameters for system catalog indexes.

While I think that deduplication should work with system catalog
indexes on general principle, this is about an existing limitation.
Deduplication in catalog indexes can be revisited if and when somebody
figures out a way to make storage parameters work with system catalog
indexes.

* Basic user documentation -- this still needs work, but the basic
shape is now in place. I think that we should outline how the feature
works by describing the internals, including details of the data
structures. This provides guidance to users on when they should
disable or enable the feature.

This is discussed in the existing chapter on B-Tree internals. This
felt natural because it's similar to how GIN explains its compression
related features -- the discussion of the storage parameters in the
CREATE INDEX page of the docs links to a description of GIN internals
from "66.4. Implementation [of GIN]".

* nbtdedup.c "single value" strategy stuff now considers the
contribution of the page high key when considering how to deduplicate
such that nbtsplitloc.c's "single value" strategy has a usable split
point that helps it to hit its target free space. Not a very important
detail. It's nice to be consistent with the corresponding code within
nbtsplitloc.c.

* Worked through all remaining XXX/TODO/FIXME comments, except one:
The one that talks about the need for opclass infrastructure to deal
with cases like btree/numeric_ops, or text with a nondeterministic
collation.

The user docs now reference the BITWISE opclass stuff that we're
discussing over on the other thread. That's the only really notable
open item now IMV.

--
Peter Geoghegan

On Sun, Sep 15, 2019 at 3:47 AM Oleg Bartunov <obartunov@postgrespro.ru> wrote:
> Is it worth to make a provision to add an ability to control how
> duplicates are sorted ?

Duplicates will continue to be sorted based on TID, in effect. We want
to preserve the ability to perform retail index tuple deletion. I
believe that that will become important in the future.

> If we speak about GIN, why not take into
> account our experiments with RUM (https://github.com/postgrespro/rum)
> ?

FWIW, I think that it's confusing that RUM almost shares its name with
the "RUM conjecture":

http://daslab.seas.harvard.edu/rum-conjecture/

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Date:

16 November 2019, 13:46:02

Moin,

On 2019-11-16 01:04, Peter Geoghegan wrote:
> On Fri, Nov 15, 2019 at 5:16 AM Robert Haas <robertmhaas@gmail.com> 
> wrote:
>> Hmm. Well, maybe I'm just behind the times. But that same wikipedia
>> article also says that deduplication works on large chunks "such as
>> entire files or large sections of files" thus differentiating it from
>> compression algorithms which work on the byte level, so it seems to me
>> that what you are doing still sounds more like ad-hoc compression.
> 
> I see your point.
> 
> One reason for my avoiding the word "compression" is that other DB
> systems that have something similar don't use the word compression
> either. Actually, they don't really call it *anything*. Posting lists
> are simply the way that secondary indexes work. The "Modern B-Tree
> techniques" book/survey paper mentions the idea of using a TID list in
> its "3.7 Duplicate Key Values" section, not in the two related
> sections that follow ("Bitmap Indexes", and "Data Compression").
> 
> That doesn't seem like a very good argument, now that I've typed it
> out. The patch applies deduplication/compression/whatever at the point
> where we'd otherwise have to split the page, unlike GIN. GIN eagerly
> maintains posting lists (doing in-place updates for most insertions
> seems pretty bad to me). My argument could reasonably be made about
> GIN, which really does consider posting lists the natural way to store
> duplicate tuples. I cannot really make that argument about nbtree with
> this patch, though -- delaying a page split by re-encoding tuples
> (changing their physical representation without changing their logical
> contents) justifies using the word "compression" in the name.
> 
>> > Can you suggest an alternative?
>> 
>> My instinct is to pick a name that somehow involves compression and
>> just put enough other words in there to make it clear e.g. duplicate
>> value compression, or something of that sort.
> 
> Does anyone else want to weigh in on this? Anastasia?
> 
> I will go along with whatever the consensus is. I'm very close to the
> problem we're trying to solve, which probably isn't helping me here.

I'm in favor of deduplication and not compression. Compression is a more 
generic term and can involve deduplication, but it hasn't to do so. (It 
could for instance just encode things in a more compact form). While 
deduplication does not involve compression, it just means store multiple 
things once, which by coincidence also amounts to using less space like 
compression can do.

ZFS also follows this by having both deduplication (store the same 
blocks only once with references) and compression (compress block 
contents, regardless wether they are stored once or many times).

So my vote is for deduplication (if I understand the thread correctly 
this is what the code no does, by storing the exact same key not that 
many times but only once with references or a count?).

best regards,

Tels

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

19 November 2019, 01:26:37

On Fri, Nov 15, 2019 at 5:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
> What I saw suggests that we will need to remove the new "postingoff"
> field from xl_btree_insert. (We can create a new XLog record for leaf
> page inserts that also need to split a posting list, without changing
> much else.)

Attached is v24. This revision doesn't fix the problem with
xl_btree_insert record bloat, but it does fix the bitrot against the
master branch that was caused by commit 50d22de9. (This patch has had
a surprisingly large number of conflicts against the master branch
recently.)

Other changes:

* The pageinspect patch has been cleaned up. I now propose that it be
committed alongside the main patch.

The big change here is that posting lists are represented as an array
of TIDs within bt_page_items(), much like gin_leafpage_items(). Also
added documentation that goes into the ways in which ctid can be used
to encode information (arguably some of this should have been included
with the Postgres 12 B-Tree work).

* Basic tests that cover deduplication within unique indexes. We ought
to have code coverage of the case where _bt_check_unique() has to step
right (actually, we don't have that on the master branch either).

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Michael Paquier

Date:

25 November 2019, 08:10:55

On Mon, Nov 18, 2019 at 05:26:37PM -0800, Peter Geoghegan wrote:
> Attached is v24. This revision doesn't fix the problem with
> xl_btree_insert record bloat, but it does fix the bitrot against the
> master branch that was caused by commit 50d22de9. (This patch has had
> a surprisingly large number of conflicts against the master branch
> recently.)

Please note that I have moved this patch to next CF per this last
update.  Anastasia, the ball is waiting on your side of the field, as
the CF entry is marked as waiting on author for some time now.
--
Michael

Attachment

signature.asc

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

27 November 2019, 01:28:50

On Mon, Nov 18, 2019 at 5:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Attached is v24. This revision doesn't fix the problem with
> xl_btree_insert record bloat

Attached is v25. This version:

* Adds more documentation.

* Adds a new GUC -- bree_deduplication.

A new GUC seems necessary. Users will want to be able to configure the
feature system-wide. A storage parameter won't let them do that --
only a GUC will. This also makes it easy to enable the feature with
unique indexes.

* Fixes the xl_btree_insert record bloat issue.

* Fixes a smaller issue with VACUUM/xl_btree_vacuum record bloat.

We shouldn't be using noticeably more WAL than before, at least in
cases that don't use deduplication. These two items fix cases where
that was possible.

There is a new refactoring patch including with v25 that helps with
the xl_btree_vacuum issue. This new patch removes unnecessary "pin
scan" code used by B-Tree VACUUMs, which was effectively disabled by
commit 3e4b7d87 without being removed. This is independently useful
work that I planned on doing already, that also cleans things up for
VACUUM with posting list tuples. It reclaims some space within the
xl_btree_vacuum record type that was wasted (we don't even use the
lastBlockVacuumed field anymore), allowing us to use that space for
new deduplication-related fields without increasing total WAL space.

Anastasia: I hope to be able to commit the first patch before too
long. It would be great if you could review that.

--
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

03 December 2019, 20:13:54

On Tue, Nov 12, 2019 at 3:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> * Decided to go back to turning deduplication on by default with
> non-unique indexes, and off by default using unique indexes.
>
> The unique index stuff was regressed enough with INSERT-heavy
> workloads that I was put off, despite my initial enthusiasm for
> enabling deduplication everywhere.

I have changed my mind about this again. I now think that it would
make sense to treat deduplication within unique indexes as a separate
feature that cannot be disabled by the GUC at all (though we'd
probably still respect the storage parameter for debugging purposes).
I have found that fixing the WAL record size issue has helped remove
what looked like a performance penalty for deduplication (but was
actually just a general regression). Also, I have found a way of
selectively applying deduplication within unique indexes that seems to
have no downside, and considerable upside.

The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page. We already cache the bounds of our initial binary
search in insert state, so we can reuse that information within
_bt_findinsertloc() when considering deduplication in unique indexes.

This heuristic makes sense because deduplication within unique indexes
should only target leaf pages that cannot possibly receive new values.
In many cases, the only reason why almost all primary key leaf pages
can ever split is because of non-HOT updates whose new HOT chain needs
a new, equal entry in the primary key. This is the case with your
standard identity column/serial primary key, for example (only the
rightmost page will have a page split due to the insertion of new
logical rows -- everything other variety of page split must be due to
new physical tuples/versions). I imagine that it is possible for a
leaf page to be a "mixture"  of these two basic/general tendencies,
but not for long. It really doesn't matter if we occasionally fail to
delay a page split where that was possible, nor does it matter if we
occasionally apply deduplication when that won't delay a split for
very long -- pretty soon the page will split anyway. A split ought to
separate the parts of the keyspace that exhibit each tendency. In
general, we're only interested in delaying page splits in unique
indexes *indefinitely*, since in effect that will prevent them
*entirely*. (So the goal is *significantly* different to our general
goal for deduplication -- it's about buying time for VACUUM to run or
whatever, rather than buying space.)

This heuristic helps the TPC-C "old order" tables PK from bloating
quite noticeably, since that was the only unique index that is really
affected by non-HOT UPDATEs (i.e. the UPDATE queries that touch that
table happen to not be HOT-safe in general, which is not the case for
any other table). It doesn't regress anything else from TPC-C, since
there really isn't a benefit for other tables. More importantly, the
working/draft version of the patch will often avoid a huge amount of
bloat in a pgbench-style workload that has an extra index on the
pgbench_accounts table, to prevent HOT updates. The accounts primary
key (pgbench_accounts_pkey) hardly grows at all with the patch, but
grows 2x on master.

This 2x space saving seems to occur reliably, unless there is a lot of
contention on individual *pages*, in which case the bloat can be
delayed but not prevented. We get that 2x space saving with either
uniformly distributed random updates on pgbench_accounts (i.e. the
pgbench default), or with a skewed distribution that hashes the PRNG's
value. Hashing like this simulates a workload where there the skew
isn't concentrated in one part of the key space (i.e. there is skew,
but very popular values are scattered throughout the index evenly,
rather than being concentrated together in just a few leaf pages).

Can anyone think of an adversarial case, that we may not do so well on
with the new "only deduplicate within unique indexes when new item
already has a duplicate" strategy? I'm having difficulty identifying
some kind of worst case.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

05 December 2019, 00:54:58

On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The new criteria/heuristic for unique indexes is very simple: If a
> unique index has an existing item that is a duplicate on the incoming
> item at the point that we might have to split the page, then apply
> deduplication. Otherwise (when the incoming item has no duplicates),
> don't apply deduplication at all -- just accept that we'll have to
> split the page.

> the working/draft version of the patch will often avoid a huge amount of
> bloat in a pgbench-style workload that has an extra index on the
> pgbench_accounts table, to prevent HOT updates. The accounts primary
> key (pgbench_accounts_pkey) hardly grows at all with the patch, but
> grows 2x on master.

I have numbers from my benchmark against my working copy of the patch,
with this enhanced design for unique index deduplication.

With an extra index on pgbench_accounts's abalance column (that is
configured to not use deduplication for the test), and with the aid
variable (i.e. UPDATEs on pgbench_accounts) configured to use skew, I
have a variant of the standard pgbench TPC-B like benchmark. The
pgbench script I used was as follows:

\set r random_gaussian(1, 100000 * :scale, 4.0)
\set aid abs(hash(:r)) % (100000 * :scale)
\set bid random(1, 1 * :scale)
\set tid random(1, 10 * :scale)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
END;

Results from interlaced 2 hour runs at pgbench scale 5,000 are as
follows (shown in reverse chronological order):

master_2_run_16.out: "tps = 7263.948703 (including connections establishing)"
patch_2_run_16.out: "tps = 7505.358148 (including connections establishing)"
master_1_run_32.out: "tps = 9998.868764 (including connections establishing)"
patch_1_run_32.out: "tps = 9781.798606 (including connections establishing)"
master_1_run_16.out: "tps = 8812.269270 (including connections establishing)"
patch_1_run_16.out: "tps = 9455.476883 (including connections establishing)"

The patch comes out ahead in the first 2 hour run, with later runs
looking like a more even match. I think that each run didn't last long
enough to even out the effects of autovacuum, but this is really about
index size rather than overall throughput, so it's not that important.
(I need to get a large server to do further performance validation
work, rather than just running overnight benchmarks on my main work
machine like this.)

The primary key index (pgbench_accounts_pkey) starts out at 10.45 GiB
in size, and ends at 12.695 GiB in size with the patch. Whereas with
master, it also starts out at 10.45 GiB, but finishes off at 19.392
GiB.

Clearly this is a significant difference -- the index is only ~65% of
its master-branch size with the patch. See attached tar archive with
logs, and pg_buffercache output after each run. (The extra index on
pgbench_accounts.abalance is pretty much the same size for
patch/master, since deduplication was disabled for the patch runs.)
And, as I said, I believe that we can make this unique index
deduplication stuff an internal thing that isn't even documented
(maybe a passing reference is appropriate when talking about general
deduplication).

-- 
Peter Geoghegan

Attachment

overnight-benchmark.tar.gz

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

13 December 2019, 02:21:20

On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The new criteria/heuristic for unique indexes is very simple: If a
> unique index has an existing item that is a duplicate on the incoming
> item at the point that we might have to split the page, then apply
> deduplication. Otherwise (when the incoming item has no duplicates),
> don't apply deduplication at all -- just accept that we'll have to
> split the page. We already cache the bounds of our initial binary
> search in insert state, so we can reuse that information within
> _bt_findinsertloc() when considering deduplication in unique indexes.

Attached is v26, which adds this new criteria/heuristic for unique
indexes. We now seem to consistently get good results with unique
indexes.

Other changes:

* A commit message is now included for the main patch/commit.

* The btree_deduplication GUC is now a boolean, since it is no longer
up to the user to indicate when deduplication is appropriate in unique
indexes (the new heuristic does that instead). The GUC now only
affects non-unique indexes.

* Simplified the user docs. They now only mention deduplication of
unique indexes in passing, in line with the general idea that
deduplication in unique indexes is an internal optimization.

* Fixed bug that made backwards scans that touch posting lists fail to
set LP_DEAD bits when that was possible (i.e. the kill_prior_tuple
optimization wasn't always applied there with posting lists, for no
good reason). Also documented the assumptions made by the new code in
_bt_readpage()/_bt_killitems() -- if that was clearer in the first
place, then the LP_DEAD/kill_prior_tuple bug might never have
happened.

* Fixed some memory leaks in nbtree VACUUM.

Still waiting for some review of the first patch, to get it out of the
way. Anastasia?

-- 
Peter Geoghegan

On Tue, Dec 17, 2019 at 7:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I plan to commit this first patch [1] in the next day or two, barring
> any objections.

I pushed this earlier today -- it became commit 9f83468b. Attached is
v27, which fixes the bitrot against the master branch.

Other changes:

* Updated _bt_form_posting() to consistently MAXALIGN(). No behavioral
changes here. The defensive SHORTALIGN()s we had in v26 should have
been defensive MAXALIGN()s -- this has been fixed.  Also, we now also
explain our precise assumptions around alignment.

* Cleared up the situation around _bt_dedup_one_page()'s
responsibilities as far as LP_DEAD items go.

* Fixed bug in 32 KiB BLCKSZ builds. We now apply an additional
INDEX_SIZE_MASK cap on posting list tuple size.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

04 January 2020, 01:47:01

On Thu, Dec 19, 2019 at 6:55 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I pushed this earlier today -- it became commit 9f83468b. Attached is
> v27, which fixes the bitrot against the master branch.

Attached is v28, which fixes bitrot from my recent commits to refactor
VACUUM-related code in nbtpage.c.

Other changes:

* A big overhaul of the nbtree README changes -- "posting list splits"
now becomes its own section.

I tried to get the general idea across about posting lists in this new
section without repeating myself too much. Posting list splits are
probably the most subtle part of the overall design of the patch.
Posting lists piggy-back on a standard atomic action (insertion into a
leaf page, or leaf page split) on the one hand. On the other hand,
they're a separate and independent step at the conceptual level.

Hopefully the general idea comes across as clearly as possible. Some
feedback on that would be good.

* PageIndexTupleOverwrite() is now used for VACUUM's "updates", and
has been taught to not unset an LP_DEAD bit that happens to already be
set.

As the comments added by my recent commit 4b25f5d0 now mention, it's
important that VACUUM not unset LP_DEAD bits accidentally. VACUUM will
falsely unset the BTP_HAS_GARBAGE page flag at times, which isn't
ideal. Even still, unsetting LP_DEAD bits themselves is much worse
(even though BTP_HAS_GARBAGE exists purely to hint that one or more
LP_DEAD bits are set on the page).

Maybe we should go further here, and reconsider whether or not VACUUM
should *ever* unset BTP_HAS_GARBAGE. AFAICT, the only advantage of
nbtree VACUUM clearing it is that doing so might save a backend a
useless scan of the line pointer array to check for the LP_DEAD bits
directly. But the backend will have to split the page when that
happens anyway, which is a far greater cost. It's probably not even
noticeable, since we're already doing lots of stuff with the page when
it happens.

The BTP_HAS_GARBAGE hint probably mattered back when the "getting
tired" mechanism was used (i.e. prior to commit dd299df8). VACUUM
sometimes had a choice to make about which page to use, so quickly
getting an idea about LP_DEAD bits made a certain amount of
sense...but that's not how it works anymore. (Granted, we still do it
that way with pg_upgrade'd indexes from before Postgres 12, but I
don't think that that needs to be given any weight now.)

Thoughts on this?
--
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Heikki Linnakangas

Date:

08 January 2020, 13:56:23

On 04/01/2020 03:47, Peter Geoghegan wrote:
> Attached is v28, which fixes bitrot from my recent commits to refactor
> VACUUM-related code in nbtpage.c.

I started to read through this gigantic patch. I got about 1/3 way 
through. I wrote minor comments directly in the attached patch file, 
search for "HEIKKI:". I wrote them as I read the patch from beginning to 
end, so it's possible that some of my questions are answered later in 
the patch. I didn't have the stamina to read through the whole patch 
yet, I'll continue later.

One major design question here is about the LP_DEAD tuples. There's 
quite a lot of logic and heuristics and explanations related to unique 
indexes. To make them behave differently from non-unique indexes, to 
keep the LP_DEAD optimization effective. What if we had a separate 
LP_DEAD flag for every item in a posting list, instead? I think we 
wouldn't need to treat unique indexes differently from non-unique 
indexes, then. I tried to search this thread to see if that had been 
discussed already, but I didn't see anyone proposing that approach.

Another important decision here is the on-disk format of these tuples. 
The format of IndexTuples on a b-tree page has become really 
complicated. The v12 changes to store TIDs in order did a lot of that, 
but this makes it even more complicated. I know there are strong 
backwards-compatibility reasons for the current format, but 
nevertheless, if we were to design this from scratch, what would the 
B-tree page and tuple format be like?

- Heikki

Attachment

v28-0001-Add-deduplication-to-nbtree.patch-with-heikki-comments

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

08 January 2020, 22:56:22

On Wed, Jan 8, 2020 at 5:56 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 04/01/2020 03:47, Peter Geoghegan wrote:
> > Attached is v28, which fixes bitrot from my recent commits to refactor
> > VACUUM-related code in nbtpage.c.
>
> I started to read through this gigantic patch.

Oh come on, it's not that big.   :-)

> I got about 1/3 way
> through. I wrote minor comments directly in the attached patch file,
> search for "HEIKKI:". I wrote them as I read the patch from beginning to
> end, so it's possible that some of my questions are answered later in
> the patch. I didn't have the stamina to read through the whole patch
> yet, I'll continue later.

Thanks for the review! Anything that you've written that I do not
respond to directly can be assumed to have been accepted by me.

I'll start with responses to the points that you raise in your patch
that need a response

Patch comments
==============

* Furthermore, deduplication can be turned on or off as needed, or
applied    HEIKKI: When would it be needed?

I believe that hardly anybody will want to turn off deduplication in
practice. My point here is that we're flexible -- we're not
maintaining posting lists like GIN. We're just deduplicating as and
when needed. We can change our preference about that any time. Turning
off deduplication won't magically undo past deduplications, of course,
but everything mostly works in the same way when deduplication is on
or off. We're just talking about an alternative physical
representation of the same logical contents.

* HEIKKI: How do LP_DEAD work on posting list tuples?

Same as before, except that it applies to all TIDs in the tuple
together (will mention this in commit message, though). Note that the
fact that we delay deduplication also means that we delay merging the
LP_DEAD bits. And we always prefer to remove LP_DEAD items. Finally,
we refuse to do a posting list split when its LP_DEAD bit is set, so
it's now possible to delete LP_DEAD bit set tuples a little early,
before a page split has to be avoided -- see the new code and comments
at the end of _bt_findinsertloc().

See also: my later response to your e-mail remarks on LP_DEAD bits,
unique indexes, and space accounting.

* HEIKKI: When is it [deduplication] not safe?

With opclasses like btree/numeric_ops, where display scale messes
things up. See this thread for more information on the infrastructure
that we need for that:

https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com

* HEIKKI: Why is it safe to read on version 3 indexes? Because unused
space is set to zeros?

Yes. Same applies to version 4 indexes that come from Postgres 12 --
users must REINDEX to call _bt_opclasses_support_dedup() and set
metapage field, but we can rely on the new field being all zeroes
before that happens. (It would be possible to teach pg_upgrade to set
the field for compatible indexes from Postgres 12, but I don't want to
bother with that. We probably cannot safely call
_bt_opclasses_support_dedup() with a buffer lock held, so that seems
like the only way.)

* HEIKKI: Do we need it as a separate flag, isn't it always safe with
version 4 indexes, and never with version 3?

No, it isn't *always* safe with version 4 indexes, for reasons that
have nothing to do with the on-disk representation (like the display
scale issue, nondeterministic collations, etc). It really is a
distinct condition. (Deduplication is never safe with version 3
indexes, obviously.)

It occurs to me now that we probably don't even want to make the
metapage field about deduplication (though that's what it says right
now). Rather, it should be about supporting a general category of
optimizations that include deduplication, and might also include
prefix compression in the future. Note that whether or not we should
actually apply these optimizations is always a separate question.

* + * Non-pivot tuples complement pivot tuples, which only have key
columns.     HEIKKI: What does it mean that they complement pivot
tuples?

It means that all tuples are either pivot tuples, or are non-pivot tuples.

* + * safely (index storage parameter separately indicates if
deduplication is   HEIKKI: Is there really an "index storage
parameter" for that? What is that, something in the WITH clause?

Yes, there is actually an index storage parameter named
"deduplication" (something in the WITH clause). This is deliberately
not named "btree_deduplication", the current name of the GUC. This
exists to make the optimization controllable at the index level.
(Though I should probably mention the GUC first in this code comment,
or not even mention the less significant storage parameter.)

* HEIKKI: How much memory does this [BTScanPosData.items array of
width MaxBTreeIndexTuplesPerPage] need now? Should we consider
pallocing this separately?

But BTScanPosData isn't allocated on the stack anyway.

* HEIKKI: Would it be more clear to have a separate struct for the
posting list split case? (i.e. don't reuse xl_btree_insert)

I doubt it, but I'm open to it. We don't do it that way in a number of
existing cases.

* HEIKKI: Do we only generate one posting list in one WAL record? I
would assume it's better to deduplicate everything on the page, since
we're modifying it anyway.

You might be right about that. Let me get back to you on that.

HEIKKI: Does this [xl_btree_vacuum WAL record] store a whole copy of
the remaining posting list on an updated tuple? Wouldn't it be simpler
and more space-efficient to store just the deleted TIDs?

It would certainly be more space efficient in cases where we delete
some but not all TIDs -- hard to know how much that matters. Don't
think that it would be simpler, though.

I have an open mind about this. I can try it the other way if you like.

* HEIKKI: Do we ever do that? Do we ever set the LP_DEAD bit on a
posting list tuple?

As I said, we are able to set LP_DEAD bits on posting list tuples, if
and only if all the TIDs are dead (i.e. if all-but-one TID is dead, it
cannot be set). This limitation does not seem to matter in practice,
in part because LP_DEAD bits can be set before we deduplicate --
that's another benefit of delaying deduplication until the point where
we'd usually have to split the page.

See also: my later response to your e-mail remarks on LP_DEAD bits,
unique indexes, and space accounting.

* HEIKKI: Well, it's optimized for that today, but if it [a posting
list] was compressed, a btree would be useful in more situations...

I agree, but I think that we should do compression by inventing a new
type of leaf page that only stores TIDs, and use that when we do a
single value mode split in nbtsplitloc.c. So we don't even use tuples
at that point (except the high key), and we compress the entire page.
That way, we don't have to worry about posting list splits and stuff
like that, which seems like the best of both worlds. Maybe we can use
a true bitmap on these special leaf pages.

... Now to answer the feedback from your actual e-mail ...

E-mail
======

> One major design question here is about the LP_DEAD tuples. There's
> quite a lot of logic and heuristics and explanations related to unique
> indexes.

The unique index stuff hasn't really been discussed on the thread
until now. Those parts are all my work.

> To make them behave differently from non-unique indexes, to
> keep the LP_DEAD optimization effective. What if we had a separate
> LP_DEAD flag for every item in a posting list, instead? I think we
> wouldn't need to treat unique indexes differently from non-unique
> indexes, then.

I don't think that that's quite true -- it's not so much about LP_DEAD
bits as it is about our *goals* with unique indexes. We have no reason
to deduplicate other than to delay an immediate page split, so it
isn't really about space efficiency. Having individual LP_DEAD bits
for each TID wouldn't change the picture for _bt_dedup_one_page() -- I
would still want a checkingunique flag there. But individual LP_DEAD
bits would make a lot of other things much more complicated. Unique
indexes are kind of special, in general.

The thing that I prioritized keeping simple in the patch is page space
accounting, particularly the nbtsplitloc.c logic, which doesn't need
any changes to continue to work (it's also important for page space
accounting to be simple within _bt_dedup_one_page()). I did teach
nbtsplitloc.c to take posting lists from the firstright tuple into
account, but only because they're often unusually large, making it a
worthwhile optimization. Exactly the same thing could take place with
non-key INCLUDE columns, but typically the extra payload is not very
large, so I haven't bothered with that before now.

If you had a "supplemental header" to store per-TID LP_DEAD bits, that
would make things complicated for page space accounting. Even if it
was only one byte, you'd have to worry about it taking up an entire
extra MAXALIGN() quantum within _bt_dedup_one_page(). And then there
is the complexity within _bt_killitems(), needed to make the
kill_prior_tuple stuff work. I might actually try to do it that way if
I thought that it would perform better, or be simpler than what I came
up with. I doubt that, though.

In summary: while it would be possible to have per-TID LP_DEAD bits,
but I don't think it would be even remotely worth it. I can go into my
doubts about the performance benefits if you want.

Note also: I tend to think of the LP_DEAD bit setting within
_bt_check_unique() as almost a separate optimization to the
kill_prior_tuple stuff, even though they both involve LP_DEAD bits.
The former is much more important than the latter. The
kill_prior_tuple thing was severely regressed in Postgres 9.5 without
anyone really noticing [1].

> Another important decision here is the on-disk format of these tuples.
> The format of IndexTuples on a b-tree page has become really
> complicated. The v12 changes to store TIDs in order did a lot of that,
> but this makes it even more complicated.

It adds two new functions: BTreeTupleIsPivot(), and
BTreeTupleIsPosting(). This means that there are three basic kinds of
B-Tree tuple layout. We can detect which kind any given tuple is in a
low context way. The three possible cases are:

* Pivot tuples.

* Posting list tuples (non-pivot tuples that have at least two head TIDs).

* Regular/small non-pivot tuples -- this representation has never
changed in all the time I've worked on Postgres.

You'll notice that there are lots of new assertions, including in
places that don't have anything to do with the new code --
BTreeTupleIsPivot() and BTreeTupleIsPosting() assertions.

I think that there is only really one wart here that tends to come up
outside the nbtree.h code itself again and again: the fact that
!heapkeyspace indexes may give false negatives when
BTreeTupleIsPivot() is used. So any BTreeTupleIsPivot() assertion has
to include some nearby heapkeyspace field to cover that case (or else
make sure that the index is a v4+/heapkeyspace index in some other
way). Note, however, that we can safely assert !BTreeTupleIsPivot() --
that won't produce spurious assertion failures with heapkeyspace
indexes. Note also that the new BTreeTupleIsPosting() function works
reliably on all B-Tree versions.

The only future requirements that I can anticipate for the tuple format in are:

1. The need to support wider TIDs. (I am strongly of the opinion that
this shouldn't work all that differently to what we have now.)

2. The need for a page-level prefix compression feature. This can work
by requiring decompression code to assume that the common prefix for
the page just isn't present.

This seems doable within the confines of the current/proposed B-Tree
tuple format. Though we still need to have a serious discussion about
the future of TIDs in light of stuff like ZedStore. I think that fully
logical table identifiers are worth supporting, but they had better
behave pretty much like a TID within index access method code -- they
better show temporal and spatial locality in about the same way TIDs
do. They should be compared as generic integers, and accept reasonable
limits on TID width. It should be possible to do cheap binary searches
on posting lists in about the same way.

> I know there are strong
> backwards-compatibility reasons for the current format, but
> nevertheless, if we were to design this from scratch, what would the
> B-tree page and tuple format be like?

That's a good question, but my answer depends on the scope of the question.

If you define "from scratch" to mean "5 years ago", then I believe
that it would be exactly the same as what we have now. I specifically
anticipated the need to have posting list TIDs (existing v12 era
comments in nbtree.h and amcheck things about posting lists). And what
I came up with is almost the same as the GIN format, except that we
have explicit pivot tuples (to make suffix truncation work), and use
the 13th IndexTupleData header bit (INDEX_ALT_TID_MASK) in a way that
makes it possible to store non-pivot tuples in a space-efficient way
when they are all unique. A plain GIN tuple used an extra MAXALIGN()
quantum to store an entry tree tuple that only has one TID.

If, on the other hand, you're talking about a totally green field
situation, then I would probably not use IndexTuple at all. I think
that a representation that stores offsets right in the tuple header
(so no separate varlena headers) has more advantages than
disadvantages. It would make it easier to do both suffix truncation
and prefix compression. It also makes it cheap to skip to the end of
the tuple. In general, it would be nice if the IndexTupleData TID was
less special, but that assumption is baked into a lot of code -- most
of which is technically not in nbtree.

We expect very specific things about the alignment of TIDs -- they are
assumed to be 3 SHORTALIGN()'d uint16 fields. Within nbtree, we assume
SHORTALIGN()'d access to the t_info field by IndexTupleSize() will be
okay within btree_xlog_split(). I bet that there are a number of
subtle assumptions about our use of IndexTupleData + ItemPointerData
that we have no idea about right now. So changing it won't be that
easy.

As for page level stuff, I believe that we mostly do things the right
way already. I would prefer it if the line pointer array was at the
end of the page so that tuples could go at the start of the page, and
be appending front to back (maybe the special area would still be at
the end). That's a very general issue, though -- Andres says that that
would help branch prediction, though I'm not sure of the details
offhand.

Questions
=========

Finally, I have some specific questions for you about the patch:

1. How do you feel about the design of posting list splits, and my
explanation of that design in the nbtree README?

2. How do you feel about the idea of stopping VACUUM from clearing the
BTP_HAS_GARBAGE page level flag?

I suspect that it's much better to have it falsely set than to have it
falsely unset. The extra cost is that we do a useless extra call to
_bt_vacuum_one_page(), but that's very cheap in the context of having
to deal with a page that's full, that we might have to split (or
deduplicate) anyway. But the extra benefit could perhaps be quite
large. This question doesn't really have that much to do with
deduplication.

[1]
https://www.postgresql.org/message-id/flat/CAH2-Wz%3DSfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA%40mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d
--
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

10 January 2020, 21:36:00

On Wed, Jan 8, 2020 at 2:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Thanks for the review! Anything that you've written that I do not
> respond to directly can be assumed to have been accepted by me.

Here is a version with most of the individual changes you asked for --
this is v29. I just pushed a couple of small tweaks to nbtree.h, that
you suggested I go ahead with immediately. v29 also refactors some of
the "single value strategy" stuff in nbtdedup.c. This is code that
anticipates the needs of nbtsplitloc.c's single value strategy --
deduplication is designed to work together with page
splits/nbtsplitloc.c.

Still, v29 doesn't resolve the following points you've raised, where I
haven't reached a final opinion on what to do myself. These items are
as follows (I'm quoting your modified patch file sent on January 8th
here):

* HEIKKI: Do we only generate one posting list in one WAL record? I
would assume it's better to deduplicate everything on the page, since
we're modifying it anyway.

* HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the
remaining posting list on an updated tuple? Wouldn't it be simpler and
more space-efficient to store just the deleted TIDs?

* HEIKKI: Would it be more clear to have a separate struct for the
posting list split case? (i.e. don't reuse xl_btree_insert)

v29 of the patch also doesn't change anything about how LP_DEAD bits
work, apart from going into the LP_DEAD stuff in the commit message.
This doesn't seem to be in the same category as the other three open
items, since it seems like we disagree here -- that must be worked out
through further discussion and/or benchmarking.

-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

14 January 2020, 00:07:40

On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> * HEIKKI: Do we only generate one posting list in one WAL record? I
> would assume it's better to deduplicate everything on the page, since
> we're modifying it anyway.

Still thinking about this one.

> * HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the
> remaining posting list on an updated tuple? Wouldn't it be simpler and
> more space-efficient to store just the deleted TIDs?

This probably makes sense. The btreevacuumposting() code that
generates "updated" index tuples (tuples that VACUUM uses to replace
existing ones when some but not all of the TIDs need to be removed)
was derived from GIN's ginVacuumItemPointers(). That approach works
well enough, but we can do better now. It shouldn't be that hard.

My preferred approach is a little different to your suggested approach
of storing the deleted TIDs directly. I would like to make it work by
storing an array of uint16 offsets into a posting list, one array per
"updated" tuple (with one offset per deleted TID within each array).
These arrays (which must include an array size indicator at the start)
can appear in the xl_btree_vacuum record, at the same place the patch
currently puts a raw IndexTuple. They'd be equivalent to a raw
IndexTuple -- the REDO routine would reconstruct the same raw posting
list tuple on its own. This approach seems simpler, and is clearly
very space efficient.

This approach is similar to the approach used by REDO routines to
handle posting list splits. Posting list splits must call
_bt_swap_posting() on the primary, while the corresponding REDO
routines also call _bt_swap_posting(). For space efficient "updates",
we'd have to invent a sibling utility function -- we could call it
_bt_delete_posting(), and put it next to _bt_swap_posting() within
nbtdedup.c.

How do you feel about that approach? (And how do you feel about the
existing "REDO routines call _bt_swap_posting()" business that it's
based on?)

> * HEIKKI: Would it be more clear to have a separate struct for the
> posting list split case? (i.e. don't reuse xl_btree_insert)

I've concluded that this one probably isn't worthwhile. We'd have to
carry a totally separate record on the stack within _bt_insertonpg().
If you feel strongly about it, I will reconsider.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

15 January 2020, 02:08:57

On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Still, v29 doesn't resolve the following points you've raised, where I
> haven't reached a final opinion on what to do myself. These items are
> as follows (I'm quoting your modified patch file sent on January 8th
> here):

Still no progress on these items, but I am now posting v30. A new
version seems warranted, because I now want to revive a patch from a
couple of years back as part of the deduplication project -- it would
be good to get feedback on that sooner rather than later. This is a
patch that you [Heikki] are already familiar with -- the patch to
speed up compactify_tuples() [1]. Sokolov Yura is CC'd here, since he
is the original author.

The deduplication patch is much faster with this in place. For
example, with v30:

pg@regression:5432 [25216]=# create unlogged table foo(bar int4);
CREATE TABLE
pg@regression:5432 [25216]=# create index unlogged_foo_idx on foo(bar);
CREATE INDEX
pg@regression:5432 [25216]=# insert into foo select g from
generate_series(1, 1000000) g, generate_series(1,10) i;
INSERT 0 10000000
Time: 17842.455 ms (00:17.842)

If I revert the "Bucket sort for compactify_tuples" commit locally,
then the same insert statement takes 31.614 seconds! In other words,
the insert statement is made ~77% faster by that commit alone. The
improvement is stable and reproducible.

Clearly there is a big compactify_tuples() bottleneck that comes from
PageIndexMultiDelete(). The hot spot is quite visible with "perf top
-e branch-misses".

The compactify_tuples() patch stalled because it wasn't clear if it
was worth the trouble at the time. It was originally written to
address a much smaller PageRepairFragmentation() bottleneck in heap
pruning. ISTM that deduplication alone is a good enough reason to
commit this patch. I haven't really changed anything about the
2017/2018 patch -- I need to do more review of that. We probably don't
need to qsort() inlining stuff (the bucket sort thing is the real
win), but I included it in v30 all the same.

Other changes in v30:

* We now avoid extra _bt_compare() calls within _bt_check_unique() --
no need to call _bt_compare() once per TID (once per equal tuple is
quite enough).

This is a noticeable performance win, even though the change was
originally intended to make the logic in _bt_check_unique() clearer.

* Reduced the limit on the size of a posting list tuple to 1/6 of a
page -- down from 1/3.

This seems like a good idea on the grounds that it keeps our options
open if we split a page full of duplicates due to UPDATEs rather than
INSERTs (i.e. we split a page full of duplicates that isn't also the
rightmost page among pages that store only those duplicates). A lower
limit is more conservative, and yet doesn't cost us that much space.

* Refined nbtsort.c/CREATE INDEX to work sensibly with non-standard
fillfactor settings.

This last item is a minor bugfix, really.

[1] https://commitfest.postgresql.org/14/1138/
-- 
Peter Geoghegan

Attachment

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

From

Peter Geoghegan

Date:

29 January 2020, 01:29:05

On Tue, Jan 14, 2020 at 6:08 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Still no progress on these items, but I am now posting v30. A new
> version seems warranted, because I now want to revive a patch from a
> couple of years back as part of the deduplication project -- it would
> be good to get feedback on that sooner rather than later.

Actually, I decided that this wasn't necessary -- I won't be touching
compactify_tuples() at all (at least not right now). Deduplication
doesn't even need to use PageIndexMultiDelete() in the attached
revision of the patch, v31, so speeding up compactify_tuples() is no
longer relevant.

v31 simplifies everything quite a bit. This is something that I came
up with more or less as a result of following Heikki's feedback. I
found that reviving the v17 approach of using a temp page buffer in
_bt_dedup_one_page() (much like _bt_split() always has) was a good
idea. This approach was initially revived in order to make dedup WAL
logging work on a whole-page basis -- Heikki suggested we do it that
way, and so now we do. But this approach is also a lot faster in
general, and has additional benefits besides that.

When we abandoned the temp buffer approach back in September of last
year, the unique index stuff was totally immature and unsettled, and
it looked like a very incremental approach might make sense for unique
indexes. It doesn't seem like a good idea now, though. In fact, I no
longer even believe that a custom checkingunique/unique index strategy
in _bt_dedup_one_page() is useful. That is also removed in v31, which
will also make Heikki happy -- he expressed a separate concern about
the extra complexity there.

I've done a lot of optimization work since September, making these
simplification possible now. The problems that I saw that justified
the complexity seem to have gone away now. I'm pretty sure that the
recent _bt_check_unique() posting list tuple _bt_compare()
optimization is the biggest part of that. The checkingunique/unique
index strategy in _bt_dedup_one_page() always felt overfit to my
microbenchmarks, so I'm glad to be rid of it.

Note that v31 changes nothing about how we think about deduplication
in unique indexes in general, nor how it is presented to users. There
is still special criteria around how deduplication is *triggered* in
unique indexes. We continue to trigger a deduplication pass based on
seeing a duplicate within _bt_check_unique() + _bt_findinsertloc() --
otherwise we never attempt deduplication in a unique index (same as
before). Plus the GUC still doesn't affect unique indexes, unique
index deduplication still isn't really documented in the user docs (it
just gets a passing mention in B-Tree internals section), etc.

In my opinion, the patch is now pretty close to being committable. I
do have two outstanding open items for the patch, though. These items
are:

* We still need infrastructure that marks B-Tree opclasses as safe for
deduplication, to avoid things like the numeric display scale problem,
collations that are unsafe for deduplication because they're
nondeterministic, etc.

I talked to Anastasia about this over private e-mail recently. This is
going well; I'm expecting a revision later this week. It will be based
on all feedback to date over on the other thread [1] that we have for
this part of the project.

On Thu, Feb 20, 2020 at 12:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I wasn't clear about the implication of what I was saying here, which
> is: I will make the NOTICE a DEBUG1 message, and leave everything else
> as-is in the initial committed version.

Attached is v34, which has this change. My plan is to commit something
very close to this on Wednesday morning (barring any objections).

Other changes:

* Now, equalimage functions take a pg_type OID argument, allowing us
to reuse the same generic pg_proc-wise function across many of the
operator classes from the core distribution.

* Rewrote the docs for equalimage functions in the 0001-* patch.

* Lots of copy-editing of the "Implementation" section of the B-Tree
doc chapter, most of which is about deduplication specifically.

-- 
Peter Geoghegan

On Fri, Mar 6, 2020 at 11:00 AM Andres Freund <andres@anarazel.de> wrote:
> > Pushed.
>
> Congrats!

Thanks Andres!

-- 
Peter Geoghegan