Thread: [PROPOSAL] Effective storage of duplicates in B-tree index.
Hi, hackers!<br /> I'm going to begin work on effective storage of duplicate keys in B-tree index.<br /> The main idea isto implement posting lists and posting trees for B-tree index pages as it's already done for GIN.<br /><br /> In a nutshell,effective storing of duplicates in GIN is organised as follows.<br /> Index stores single index tuple for each uniquekey. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows havingthe same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.<br /> You canfind wonderful detailed descriptions in gin <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README">readme</a>and <a href="http://www.cybertec.at/gin-just-an-index-type/">articles</a>.<br/> It also makes possible to apply compression algorithmto posting list/tree and significantly decrease index size. Read more in <a href="http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf">presentation(part 1)</a>.<br /><br /> Now newB-tree index tuple must be inserted for each table row that we index. <br /> It can possibly cause page split. Becauseof MVCC even unique index could contain duplicates.<br /> Storing duplicates in posting list/tree helps to avoid superfluoussplits.<br /><br /> So it seems to be very useful improvement. Of course it requires a lot of changes in B-treeimplementation, so I need approval from community.<br /><br /> 1. Compatibility.<br /> It's important to save compatibilitywith older index versions.<br /> I'm going to change BTREE_VERSION to 3.<br /> And use new (posting) featuresfor v3, saving old implementation for v2.<br /> Any objections?<br /><br /> 2. There are several tricks to handlenon-unique keys in B-tree.<br /> More info in btree <a href="https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/README">readme</a>(chapter - Differencesto the Lehman & Yao algorithm).<br /> In the new version they'll become useless. Am I right?<br /><br /> 3.Microvacuum.<br /> Killed items are marked LP_DEAD and could be deleted from separate page at time of insertion.<br />Now it's fine, because each item corresponds with separate TID. But posting list implementation requires another way. I'vegot two ideas:<br /> First is to mark LP_DEAD only those tuples where all TIDs are not visible.<br /> Second is to addLP_DEAD flag to each TID in posting list(tree). This way requires a bit more space, but allows to do microvacuum of postinglist/tree.<br /> Which one is better?<br /><pre class="moz-signature" cols="72">-- Anastasia Lubennikova Postgres Professional: <a class="moz-txt-link-freetext" href="http://www.postgrespro.com">http://www.postgrespro.com</a> The Russian Postgres Company</pre>
Hi, On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote: > Hi, hackers! > I'm going to begin work on effective storage of duplicate keys in B-tree > index. > The main idea is to implement posting lists and posting trees for B-tree > index pages as it's already done for GIN. > > In a nutshell, effective storing of duplicates in GIN is organised as > follows. > Index stores single index tuple for each unique key. That index tuple > points to posting list which contains pointers to heap tuples (TIDs). If > too many rows having the same key, multiple pages are allocated for the > TIDs and these constitute so called posting tree. > You can find wonderful detailed descriptions in gin readme > <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> > and articles <http://www.cybertec.at/gin-just-an-index-type/>. > It also makes possible to apply compression algorithm to posting > list/tree and significantly decrease index size. Read more in > presentation (part 1) > <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. > > Now new B-tree index tuple must be inserted for each table row that we > index. > It can possibly cause page split. Because of MVCC even unique index > could contain duplicates. > Storing duplicates in posting list/tree helps to avoid superfluous splits. > > So it seems to be very useful improvement. Of course it requires a lot > of changes in B-tree implementation, so I need approval from community. In general, index size is often a serious issue - cases where indexes need more space than tables are not quite uncommon in my experience. So I think the efforts to lower space requirements for indexes are good. But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I create a GIN index (using btree_gin), I do get mostly the same thing you propose, no? Sure, there are differences - GIN indexes don't handle UNIQUE indexes, but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the b-tree feature is not needed), or there are many updates. Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this going to be affected by introducing the posting lists? kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote:I'm going to begin work on effective storage of duplicate keys in B-tree
index.
The main idea is to implement posting lists and posting trees for B-tree
index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as
follows.
Index stores single index tuple for each unique key. That index tuple
points to posting list which contains pointers to heap tuples (TIDs). If
too many rows having the same key, multiple pages are allocated for the
TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
and articles <http://www.cybertec.at/gin-just-an-index-type/>.
It also makes possible to apply compression algorithm to posting
list/tree and significantly decrease index size. Read more in
presentation (part 1)
<http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index
could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
So it seems to be very useful improvement. Of course it requires a lot
of changes in B-tree implementation, so I need approval from community.
In general, index size is often a serious issue - cases where indexes need more space than tables are not quite uncommon in my experience. So I think the efforts to lower space requirements for indexes are good.
But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I create a GIN index (using btree_gin), I do get mostly the same thing you propose, no?
Yes, In general GIN is a btree with effective duplicates handling + support of splitting single datums into multiple keys.
This proposal is mostly porting duplicates handling from GIN to btree.
Sure, there are differences - GIN indexes don't handle UNIQUE indexes,
The difference between btree_gin and btree is not only UNIQUE feature.
1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's not feasible to add gingettuple to GIN. At least with same semantics as it is in btree.
2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn GIN is more like set of separate singlecolumn GINs: it doesn't have composite keys.
3) btree_gin can't effectively handle range searches. "a < x < b" would be hangle as "a < x" intersect "x < b". That is extremely inefficient. It is possible to fix. However, there is no clear proposal how to fit this case into GIN interface, yet.
but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the b-tree feature is not needed), or there are many updates.
From my observations users can use btree_gin only in some cases. They like compression, but can't use btree_gin mostly because of #1.
Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this going to be affected by introducing the posting lists?
I'd notice that current duplicates handling in PostgreSQL is hack over original btree. It is designed so in btree access method in PostgreSQL, not btree in general.
Posting lists shouldn't change concurrency much. Currently, in btree you have to lock one page exclusively when you're inserting new value.
When posting list is small and fits one page you have to do similar thing: exclusive lock of one page to insert new value.
When you have posting tree, you have to do exclusive lock on one page of posting tree.
One can say that concurrency would became worse because index would become smaller and number of pages would became smaller too. Since number of pages would be smaller, backends are more likely concur for the same page. But this argument can be user against any compression and for any bloat.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 09/01/2015 11:31 AM, Alexander Korotkov wrote: ... > > Yes, In general GIN is a btree with effective duplicates handling + > support of splitting single datums into multiple keys. > This proposal is mostly porting duplicates handling from GIN to btree. > > Sure, there are differences - GIN indexes don't handle UNIQUE indexes, > > > The difference between btree_gin and btree is not only UNIQUE feature. > 1) There is no gingettuple in GIN. GIN supports only bitmap scans. And > it's not feasible to add gingettuple to GIN. At least with same > semantics as it is in btree. > 2) GIN doesn't support multicolumn indexes in the way btree does. > Multicolumn GIN is more like set of separate singlecolumn GINs: it > doesn't have composite keys. > 3) btree_gin can't effectively handle range searches. "a < x < b" would > be hangle as "a < x" intersect "x < b". That is extremely inefficient. > It is possible to fix. However, there is no clear proposal how to fit > this case into GIN interface, yet. > > but the compression can only be effective when there are duplicate > rows. So either the index is not UNIQUE (so the b-tree feature is > not needed), or there are many updates. > > From my observations users can use btree_gin only in some cases. They > like compression, but can't use btree_gin mostly because of #1. Thanks for the explanation! I'm not that familiar with GIN internals, but this mostly matches my understanding. I have only mentioned UNIQUE because the lack of gettuple() method seems obvious - and it works fine when GIN indexes are used as "bitmap indexes". But you're right - we can't do index only scans on GIN indexes, which is a huge benefit of btree indexes. > > Which brings me to the other benefit of btree indexes - they are > designed for high concurrency. How much is this going to be affected > by introducing the posting lists? > > > I'd notice that current duplicates handling in PostgreSQL is hack over > original btree. It is designed so in btree access method in PostgreSQL, > not btree in general. > Posting lists shouldn't change concurrency much. Currently, in btree you > have to lock one page exclusively when you're inserting new value. > When posting list is small and fits one page you have to do similar > thing: exclusive lock of one page to insert new value. > When you have posting tree, you have to do exclusive lock on one page of > posting tree. OK. > > One can say that concurrency would became worse because index would > become smaller and number of pages would became smaller too. Since > number of pages would be smaller, backends are more likely concur for > the same page. But this argument can be user against any compression and > for any bloat. Which might be a problem for some use cases, but I assume we could add an option disabling this per-index. Probably having it "off" by default, and only enabling the compression explicitly. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Now new B-tree index tuple must be inserted for each table row that we > index. > It can possibly cause page split. Because of MVCC even unique index could > contain duplicates. > Storing duplicates in posting list/tree helps to avoid superfluous splits. I'm glad someone is thinking about this, because it is certainly needed. I thought about working on it myself, but there is always something else to do. I should be able to assist with review, though. > So it seems to be very useful improvement. Of course it requires a lot of > changes in B-tree implementation, so I need approval from community. > > 1. Compatibility. > It's important to save compatibility with older index versions. > I'm going to change BTREE_VERSION to 3. > And use new (posting) features for v3, saving old implementation for v2. > Any objections? It might be better to just have a flag bit for pages that are compressed -- there are IIRC 8 free bits in the B-Tree page special area flags variable. But no real opinion on this from me, yet. You have plenty of bitspace to work with to mark B-Tree pages, in any case. > 2. There are several tricks to handle non-unique keys in B-tree. > More info in btree readme (chapter - Differences to the Lehman & Yao > algorithm). > In the new version they'll become useless. Am I right? I think that the L&Y algorithm makes assumptions for the sake of simplicity, rather than because they really believed that there were real problems. For example, they say that deletion can occur offline or something along those lines, even though that's clearly impractical. They say that because they didn't want to write a paper about deletion within B-Trees, I suppose. See also, my opinion of how they claim to not need read locks [1]. Also, note that despite the fact that the GIN README mentions "Lehman & Yao style right links", it doesn't actually do the L&Y trick of avoiding lock coupling -- the whole point of L&Y -- so that remark is misleading. This must be why B-Tree has much better concurrency than GIN in practice. Anyway, the way that I always imagined this would work is a layer "below" the current implementation. In other words, you could easily have prefix compression with a prefix that could end at a point within a reference IndexTuple. It could be any arbitrary point in the second or subsequent attribute, and would not "care" about the structure of the IndexTuple when it comes to where attributes begin and end, etc (although, in reality, in probably would end up caring, because of the complexity -- not caring is the ideal only, at least to me). As Alexander pointed out, GIN does not care about composite keys. That seems quite different to a GIN posting list (something that I know way less about, FYI). So I'm really talking about a slightly different thing -- prefix compression, rather than handling duplicates. Whether or not you should do prefix compression instead of deduplication is certainly not clear to me, but it should be considered. Also, I always imagined that prefix compression would use the highkey as the thing that is offset for each "real" IndexTuple, because it's there anyway, and that's simple. However, I suppose that that means that duplicate handling can't really work in a way that makes duplicates have a fixed cost, which may be a particularly important property to you. > 3. Microvacuum. > Killed items are marked LP_DEAD and could be deleted from separate page at > time of insertion. > Now it's fine, because each item corresponds with separate TID. But posting > list implementation requires another way. I've got two ideas: > First is to mark LP_DEAD only those tuples where all TIDs are not visible. > Second is to add LP_DEAD flag to each TID in posting list(tree). This way > requires a bit more space, but allows to do microvacuum of posting > list/tree. No real opinion on this point, except that I agree that doing something is necessary. Couple of further thoughts on this general topic: * Currently, B-Tree must be able to store at least 3 items on each page, for the benefit of the L&Y algorithm. You need room for 1 "highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree page is redundant if you cannot get to any child page based on the scanKey value differing one way or the other (so 2 downlinks are a sensible minimum), plus a highkey is usually needed (just not on the rightmost page). As you probably know, we enforce this by making sure every IndexTuple is no more than 1/3 of the size that will fit. You should start thinking about how to deal with this in a world where the physical size could actually be quite variable. The solution is probably to simply pretend that every IndexTuple is its original size. This applies to both prefix compression and duplicate suppression, I suppose. * Since everything is aligned within B-Tree, it's probably worth considering the alignment boundaries when doing prefix compression, if you want to go that way. We can probably imagine a world where alignment is not required for B-Tree, which would work on x86 machines, but I can't see it happening soon. It isn't worth compressing unless it compresses enough to cross an "alignment boundary", where we're not actually obliged to store as much data on disk. This point may be obvious, not sure. [1] http://www.postgresql.org/message-id/flat/CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com#CAM3SWZT-T9o_dchK8E4_YbKQ+LPJTpd89E6dtPwhXnBV_5NE3Q@mail.gmail.com -- Peter Geoghegan
01.09.2015 21:23, Peter Geoghegan: > On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Now new B-tree index tuple must be inserted for each table row that we >> index. >> It can possibly cause page split. Because of MVCC even unique index could >> contain duplicates. >> Storing duplicates in posting list/tree helps to avoid superfluous splits. > I'm glad someone is thinking about this, because it is certainly > needed. I thought about working on it myself, but there is always > something else to do. I should be able to assist with review, though. Thank you) >> So it seems to be very useful improvement. Of course it requires a lot of >> changes in B-tree implementation, so I need approval from community. >> >> 1. Compatibility. >> It's important to save compatibility with older index versions. >> I'm going to change BTREE_VERSION to 3. >> And use new (posting) features for v3, saving old implementation for v2. >> Any objections? > It might be better to just have a flag bit for pages that are > compressed -- there are IIRC 8 free bits in the B-Tree page special > area flags variable. But no real opinion on this from me, yet. You > have plenty of bitspace to work with to mark B-Tree pages, in any > case. > Hmm.. If we are talking about storing duplicates in posting lists (and trees) as in GIN, I don't see a way how to apply it to separate pages, while not applying to others. Look some notes below . >> 2. There are several tricks to handle non-unique keys in B-tree. >> More info in btree readme (chapter - Differences to the Lehman & Yao >> algorithm). >> In the new version they'll become useless. Am I right? > I think that the L&Y algorithm makes assumptions for the sake of > simplicity, rather than because they really believed that there were > real problems. For example, they say that deletion can occur offline > or something along those lines, even though that's clearly > impractical. They say that because they didn't want to write a paper > about deletion within B-Trees, I suppose. > > See also, my opinion of how they claim to not need read locks [1]. > Also, note that despite the fact that the GIN README mentions "Lehman > & Yao style right links", it doesn't actually do the L&Y trick of > avoiding lock coupling -- the whole point of L&Y -- so that remark is > misleading. This must be why B-Tree has much better concurrency than > GIN in practice. Yes, thanks for extensive explanation. I mean such tricks as moving right in _bt_findinsertloc(), for example. /*---------- * If we will need to split the page to put the item on this page, * check whether we can put the tuplesomewhere to the right, * instead. Keep scanning right until we * (a) find a page with enough free space, * (b) reach the last page where the tuple can legally go, or * (c) get tired of searching. * (c) is not flippant; it is important because if there are many * pages' worth of equal keys, it's better to splitone of the early * pages than to scan all the way to the end of the run of equal keys * on every insert. Weimplement "get tired" as a random choice, * since stopping after scanning a fixed number of pages wouldn't work * well (we'd never reach the right-hand side of previously split * pages). Currently the probability of moving rightis set at 0.99, * which may seem too high to change the behavior much, but it does an * excellent job of preventingO(N^2) behavior with many equal keys. *---------- */ If there is no multiple tuples with the same key, we shouldn't care about it at all. It would be possible to skip these steps in "effective B-tree implementation". That's why I want to change btree_version. > So I'm really talking about a slightly > different thing -- prefix compression, rather than handling > duplicates. Whether or not you should do prefix compression instead of > deduplication is certainly not clear to me, but it should be > considered. Also, I always imagined that prefix compression would use > the highkey as the thing that is offset for each "real" IndexTuple, > because it's there anyway, and that's simple. However, I suppose that > that means that duplicate handling can't really work in a way that > makes duplicates have a fixed cost, which may be a particularly > important property to you. You're right, that is two different techniques. 1. Effective storing of duplicates, which I propose, works with equal keys. And allow us to delete repeats. Index tuples are stored like this: IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) If all Attrs are equal, it seems reasonable not to repeat them. So we can store it in following structure: MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData It is a posting list. It doesn't require significant changes in index page layout, because we can use ordinary IndexTupleData for meta information. Each IndexTupleData has fixed size, so it's easy to handle posting list as an array. 2. Prefix compression handles different keys and somehow compresses them. I think that it will require non-trivial changes in btree index tuples representation. Furthermore, any compression leads to extra computations. Now, I don't have clear idea about how to implement this technique. > * Currently, B-Tree must be able to store at least 3 items on each > page, for the benefit of the L&Y algorithm. You need room for 1 > "highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree > page is redundant if you cannot get to any child page based on the > scanKey value differing one way or the other (so 2 downlinks are a > sensible minimum), plus a highkey is usually needed (just not on the > rightmost page). As you probably know, we enforce this by making sure > every IndexTuple is no more than 1/3 of the size that will fit. That is the point where too big posting list transforms to a posting tree. But I think, that in the first patch, I'll do it another way. Just by splitting long posting list into 2 lists of appropriate length. > * Since everything is aligned within B-Tree, it's probably worth > considering the alignment boundaries when doing prefix compression, if > you want to go that way. We can probably imagine a world where > alignment is not required for B-Tree, which would work on x86 > machines, but I can't see it happening soon. It isn't worth > compressing unless it compresses enough to cross an "alignment > boundary", where we're not actually obliged to store as much data on > disk. This point may be obvious, not sure. That is another reason, why I doubt prefix compression, whereas effective duplicate storage hasn't this problem. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Sep 3, 2015 at 8:35 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: >> * Since everything is aligned within B-Tree, it's probably worth >> considering the alignment boundaries when doing prefix compression, if >> you want to go that way. We can probably imagine a world where >> alignment is not required for B-Tree, which would work on x86 >> machines, but I can't see it happening soon. It isn't worth >> compressing unless it compresses enough to cross an "alignment >> boundary", where we're not actually obliged to store as much data on >> disk. This point may be obvious, not sure. > > That is another reason, why I doubt prefix compression, whereas effective > duplicate storage hasn't this problem. Okay. That sounds reasonable. I think duplicate handling is a good project. A good learning tool for Postgres B-Trees -- or at least one of the better ones -- is my amcheck tool. See: https://github.com/petergeoghegan/postgres/tree/amcheck This is a tool for verifying B-Tree invariants hold, which is loosely based on pageinspect. It checks that certain conditions hold for B-Trees. A simple example is that all items on each page be in the correct, logical order. Some invariants checked are far more complicated, though, and span multiple pages or multiple levels. See the source code for exact details. This tool works well when running the regression tests (see stress.sql -- I used it with pgbench), with no problems reported last I checked. It often only needs light locks on relations, and single shared locks on buffers. (Buffers are copied to local memory for the tool to operate on, much like contrib/pageinspect). While I have yet to formally submit amcheck to a CF (I once asked for input on the goals for the project on -hackers), the comments are fairly comprehensive, and it wouldn't be too hard to adopt this to guide your work on duplicate handling. Maybe it'll happen for 9.6. Feedback appreciated. The tool calls _bt_compare() for many things currently, but doesn't care about many lower level details, which is (very roughly speaking) the level that duplicate handling will work at. You aren't actually proposing to change anything about the fundamental structure that B-Tree indexes have, so the tool could be quite useful and low-effort for debugging your code during development. Debugging this stuff is sometimes like keyhole surgery. If you could just see at/get to the structure that you care about, it would be 10 times easier. Hopefully this tool makes it easier to identify problems. -- Peter Geoghegan
On Sun, Sep 27, 2015 at 4:11 PM, Peter Geoghegan <pg@heroku.com> wrote: > Debugging this stuff is sometimes like keyhole surgery. If you could > just see at/get to the structure that you care about, it would be 10 > times easier. Hopefully this tool makes it easier to identify problems. I should add that the way that the L&Y technique works, and the way that Postgres code is generally very robust/defensive can make direct testing a difficult thing. I have seen cases where a completely messed up B-Tree still gave correct results most of the time, and was just slower. That can happen, for example, because the "move right" thing results in a degenerate linear scan of the entire index. The comparisons in the internal pages were totally messed up, but it "didn't matter" once a scan could get to leaf pages and could move right and find the value that way. I wrote amcheck because I thought it was scary how B-Tree indexes could be *completely* messed up without it being obvious; what hope is there of a test finding a subtle problem in their structure, then? Testing the invariants directly seemed like the only way to have a chance of not introducing bugs when adding new stuff to the B-Tree code. I believe that adding optimizations to the B-Tree code will be important in the next couple of years, and there is no other way to approach it IMV. -- Peter Geoghegan
31.08.2015 10:41, Anastasia Lubennikova:
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).
Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i | B-tree Old | B-tree New | GIN |
1 | 214,234375 | 87,7109375 | 10,2109375 |
10 | 214,234375 | 87,7109375 | 10,71875 |
100 | 214,234375 | 87,4375 | 15,640625 |
1000 | 214,234375 | 86,2578125 | 31,296875 |
10000 | 214,234375 | 78,421875 | 104,3046875 |
100000 | 214,234375 | 65,359375 | 49,078125 |
1000000 | 214,234375 | 90,140625 | 106,8203125 |
10000000 | 214,234375 | 214,234375 | 534,0625 |
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
31.08.2015 10:41, Anastasia Lubennikova:Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).
Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000 214,234375 86,2578125 31,296875 10000 214,234375 78,421875 104,3046875 100000 214,234375 65,359375 49,078125 1000000 214,234375 90,140625 106,8203125 10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.
This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?
Thom
28.01.2016 18:12, Thom Brown:
Thank you for the notice. New patch is attached.
On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:31.08.2015 10:41, Anastasia Lubennikova:Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).
Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000 214,234375 86,2578125 31,296875 10000 214,234375 78,421875 104,3046875 100000 214,234375 65,359375 49,078125 1000000 214,234375 90,140625 106,8203125 10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?
Thank you for the notice. New patch is attached.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
28.01.2016 18:12, Thom Brown:Thank you for the notice. New patch is attached.On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:31.08.2015 10:41, Anastasia Lubennikova:Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).
Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000 214,234375 86,2578125 31,296875 10000 214,234375 78,421875 104,3046875 100000 214,234375 65,359375 49,078125 1000000 214,234375 90,140625 106,8203125 10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?
Thanks for the quick rebase.
Okay, a quick check with pgbench:
CREATE INDEX ON pgbench_accounts(bid);
Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms
Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)
No performance issues from what I can tell.
I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.
Thom
On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote: > I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000,with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are thesame, but perhaps that's due to my lack of understanding regarding the way they're being stored. I think that you'd need an I/O bound workload to see significant benefits. That seems unsurprising. I believe that random I/O from index writes is a big problem for us. -- Peter Geoghegan
On 28 January 2016 at 17:09, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Jan 28, 2016 at 9:03 AM, Thom Brown <thom@linux.com> wrote: >> I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 /10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are thesame, but perhaps that's due to my lack of understanding regarding the way they're being stored. > > I think that you'd need an I/O bound workload to see significant > benefits. That seems unsurprising. I believe that random I/O from > index writes is a big problem for us. I was thinking more from the point of view of the index size. An index containing 10 million duplicate values is around 40% of the size of an index with 10 million unique values. Thom
On 28 January 2016 at 17:03, Thom Brown <thom@linux.com> wrote:
On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:28.01.2016 18:12, Thom Brown:Thank you for the notice. New patch is attached.On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:31.08.2015 10:41, Anastasia Lubennikova:Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).
Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000 214,234375 86,2578125 31,296875 10000 214,234375 78,421875 104,3046875 100000 214,234375 65,359375 49,078125 1000000 214,234375 90,140625 106,8203125 10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?Thanks for the quick rebase.Okay, a quick check with pgbench:CREATE INDEX ON pgbench_accounts(bid);TimingScale: master / patch100: 10657ms / 13555ms (rechecked and got 9745ms)500: 56909ms / 56985msSizeScale: master / patch100: 214MB / 87MB (40.7%)500: 1071MB / 437MB (40.8%)No performance issues from what I can tell.I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.
Okay, now for some badness. I've restored a database containing 2 tables, one 318MB, another 24kB. The 318MB table contains 5 million rows with a sequential id column. I get a problem if I try to delete many rows from it:
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
# delete from contacts where id % 3 != 0 ;
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
WARNING: out of shared memory
The query completes, but I get this message a lot before it does.
This happens even if I drop the primary key and foreign key constraints, so somehow the memory usage has massively increased with this patch.
Thom
28.01.2016 20:03, Thom Brown:
Thank you for the prompt reply. I see what you're confused about. I'll try to clarify it.
First of all, what is implemented in the patch is not actually compression. It's more about index page layout changes to compact ItemPointers (TIDs).
Instead of TID+key, TID+key, we store now META+key+List_of_TIDs (also known as Posting list).
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
TID (N item pointers, posting list offset) - this is the meta information. So, we have to store this meta information in addition to useful data.
Next point is the requirement of having minimum three tuples in a page. We need at least two tuples to point the children and the highkey as well.
This requirement leads to the limitation of the max index tuple size.
That's the reason, why we have to store more meta information than meets the eye.
For example, we have 100000 of duplicates with the same key. It seems that compression should be really significant.
Something like 1 Meta + 1 key instead of 100000 keys --> 6 bytes (size of meta TID) + keysize instead of 600000.
But, we have to split one huge posting list into the smallest ones to fit it into the index page.
It depends on the key size, of course. As I can see form pageisnpect the index on the single integer key have to split the tuples into the pieces with the size 2704 containing 447 TIDs in one posting list.
So we have 1 Meta + 1 key instead of 447 keys. As you can see, that is really less impressive than expected.
There is an idea of posting trees in GIN. Key is stored just once, and posting list which doesn't fit into the page becomes a tree.
You can find incredible article about it here http://www.cybertec.at/2013/03/gin-just-an-index-type/
But I think, that it's not the best way for the btree am, because it’s not supposed to handle concurrent insertions.
As I mentioned before I'm going to implement prefix compression of posting list, which must be efficient and quite simple, since it's already implemented in GIN. You can find the presentation about it here https://www.pgcon.org/2014/schedule/events/698.en.html
On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:28.01.2016 18:12, Thom Brown:Thank you for the notice. New patch is attached.On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:31.08.2015 10:41, Anastasia Lubennikova:Hi, hackers!
I'm going to begin work on effective storage of duplicate keys in B-tree index.
The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN.
In a nutshell, effective storing of duplicates in GIN is organised as follows.
Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree.
You can find wonderful detailed descriptions in gin readme and articles.
It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1).
Now new B-tree index tuple must be inserted for each table row that we index.
It can possibly cause page split. Because of MVCC even unique index could contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'd like to share the progress of my work. So here is a WIP patch.
It provides effective duplicate handling using posting lists the same way as GIN does it.
Layout of the tuples on the page is changed in the following way:
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
It seems that backward compatibility works well without any changes. But I haven't tested it properly yet.
Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file.
i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different.
The other columns contain the index size (MB).
i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000 214,234375 86,2578125 31,296875 10000 214,234375 78,421875 104,3046875 100000 214,234375 65,359375 49,078125 1000000 214,234375 90,140625 106,8203125 10000000 214,234375 214,234375 534,0625
You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct.
Other cases looks really nice to me.
Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree.
I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing.
I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome.
Here is a couple of useful queries to inspect the data inside the index pages:
create extension pageinspect;
select * from bt_metap('idx');
select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt;
select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt;
And at last, the list of items I'm going to complete in the near future:
1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off'
2. Bring back microvacuum functionality for compressed indexes.
3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want.
4. Clean the code and comments, add related documentation.This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35?Thanks for the quick rebase.Okay, a quick check with pgbench:CREATE INDEX ON pgbench_accounts(bid);TimingScale: master / patch100: 10657ms / 13555ms (rechecked and got 9745ms)500: 56909ms / 56985msSizeScale: master / patch100: 214MB / 87MB (40.7%)500: 1071MB / 437MB (40.8%)No performance issues from what I can tell.I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.
Thank you for the prompt reply. I see what you're confused about. I'll try to clarify it.
First of all, what is implemented in the patch is not actually compression. It's more about index page layout changes to compact ItemPointers (TIDs).
Instead of TID+key, TID+key, we store now META+key+List_of_TIDs (also known as Posting list).
before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
TID (N item pointers, posting list offset) - this is the meta information. So, we have to store this meta information in addition to useful data.
Next point is the requirement of having minimum three tuples in a page. We need at least two tuples to point the children and the highkey as well.
This requirement leads to the limitation of the max index tuple size.
/* * Maximum size of a btree index entry, including its tuple header. * * We actually need to be able to fit three items on every page, * so restrict any one item to 1/3 the per-page available space. */ #define BTMaxItemSize(page) \ MAXALIGN_DOWN((PageGetPageSize(page) - \ MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \ MAXALIGN(sizeof(BTPageOpaqueData))) / 3)Although, I thought just now that this size could be increased for compressed tuples, at least for leaf pages.
That's the reason, why we have to store more meta information than meets the eye.
For example, we have 100000 of duplicates with the same key. It seems that compression should be really significant.
Something like 1 Meta + 1 key instead of 100000 keys --> 6 bytes (size of meta TID) + keysize instead of 600000.
But, we have to split one huge posting list into the smallest ones to fit it into the index page.
It depends on the key size, of course. As I can see form pageisnpect the index on the single integer key have to split the tuples into the pieces with the size 2704 containing 447 TIDs in one posting list.
So we have 1 Meta + 1 key instead of 447 keys. As you can see, that is really less impressive than expected.
There is an idea of posting trees in GIN. Key is stored just once, and posting list which doesn't fit into the page becomes a tree.
You can find incredible article about it here http://www.cybertec.at/2013/03/gin-just-an-index-type/
But I think, that it's not the best way for the btree am, because it’s not supposed to handle concurrent insertions.
As I mentioned before I'm going to implement prefix compression of posting list, which must be efficient and quite simple, since it's already implemented in GIN. You can find the presentation about it here https://www.pgcon.org/2014/schedule/events/698.en.html
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
I tested this patch on x64 and ARM servers for a few hours today. The only problem I could find is that INSERT works considerably slower after applying a patch. Beside that everything looks fine - no crashes, tests pass, memory doesn't seem to leak, etc. > Okay, now for some badness. I've restored a database containing 2 > tables, one 318MB, another 24kB. The 318MB table contains 5 million > rows with a sequential id column. I get a problem if I try to delete > many rows from it: > # delete from contacts where id % 3 != 0 ; > WARNING: out of shared memory > WARNING: out of shared memory > WARNING: out of shared memory I didn't manage to reproduce this. Thom, could you describe exact steps to reproduce this issue please?
On 29 January 2016 at 15:47, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > I tested this patch on x64 and ARM servers for a few hours today. The > only problem I could find is that INSERT works considerably slower after > applying a patch. Beside that everything looks fine - no crashes, tests > pass, memory doesn't seem to leak, etc. > >> Okay, now for some badness. I've restored a database containing 2 >> tables, one 318MB, another 24kB. The 318MB table contains 5 million >> rows with a sequential id column. I get a problem if I try to delete >> many rows from it: >> # delete from contacts where id % 3 != 0 ; >> WARNING: out of shared memory >> WARNING: out of shared memory >> WARNING: out of shared memory > > I didn't manage to reproduce this. Thom, could you describe exact steps > to reproduce this issue please? Sure, I used my pg_rep_test tool to create a primary (pg_rep_test -r0), which creates an instance with a custom config, which is as follows: shared_buffers = 8MB max_connections = 7 wal_level = 'hot_standby' cluster_name = 'primary' max_wal_senders = 3 wal_keep_segments = 6 Then create a pgbench data set (I didn't originally use pgbench, but you can get the same results with it): createdb -p 5530 pgbench pgbench -p 5530 -i -s 100 pgbench And delete some stuff: thom@swift:~/Development/test$ psql -p 5530 pgbench Timing is on. psql (9.6devel) Type "help" for help. ➤ psql://thom@[local]:5530/pgbench # DELETE FROM pgbench_accounts WHERE aid % 3 != 0; WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory ... WARNING: out of shared memory WARNING: out of shared memory DELETE 6666667 Time: 22218.804 ms There were 358 lines of that warning message. I don't get these messages without the patch. Thom
29.01.2016 19:01, Thom Brown: > On 29 January 2016 at 15:47, Aleksander Alekseev > <a.alekseev@postgrespro.ru> wrote: >> I tested this patch on x64 and ARM servers for a few hours today. The >> only problem I could find is that INSERT works considerably slower after >> applying a patch. Beside that everything looks fine - no crashes, tests >> pass, memory doesn't seem to leak, etc. Thank you for testing. I rechecked that, and insertions are really very very very slow. It seems like a bug. >>> Okay, now for some badness. I've restored a database containing 2 >>> tables, one 318MB, another 24kB. The 318MB table contains 5 million >>> rows with a sequential id column. I get a problem if I try to delete >>> many rows from it: >>> # delete from contacts where id % 3 != 0 ; >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> WARNING: out of shared memory >> I didn't manage to reproduce this. Thom, could you describe exact steps >> to reproduce this issue please? > Sure, I used my pg_rep_test tool to create a primary (pg_rep_test > -r0), which creates an instance with a custom config, which is as > follows: > > shared_buffers = 8MB > max_connections = 7 > wal_level = 'hot_standby' > cluster_name = 'primary' > max_wal_senders = 3 > wal_keep_segments = 6 > > Then create a pgbench data set (I didn't originally use pgbench, but > you can get the same results with it): > > createdb -p 5530 pgbench > pgbench -p 5530 -i -s 100 pgbench > > And delete some stuff: > > thom@swift:~/Development/test$ psql -p 5530 pgbench > Timing is on. > psql (9.6devel) > Type "help" for help. > > > ➤ psql://thom@[local]:5530/pgbench > > # DELETE FROM pgbench_accounts WHERE aid % 3 != 0; > WARNING: out of shared memory > WARNING: out of shared memory > WARNING: out of shared memory > WARNING: out of shared memory > WARNING: out of shared memory > WARNING: out of shared memory > WARNING: out of shared memory > ... > WARNING: out of shared memory > WARNING: out of shared memory > DELETE 6666667 > Time: 22218.804 ms > > There were 358 lines of that warning message. I don't get these > messages without the patch. > > Thom Thank you for this report. I tried to reproduce it, but I couldn't. Debug will be much easier now. I hope I'll fix these issueswithin the next few days. BTW, I found a dummy mistake, the previous patch contains some unrelated changes. I fixed it in the new version (attached). -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 29 January 2016 at 16:50, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 29.01.2016 19:01, Thom Brown: >> >> On 29 January 2016 at 15:47, Aleksander Alekseev >> <a.alekseev@postgrespro.ru> wrote: >>> >>> I tested this patch on x64 and ARM servers for a few hours today. The >>> only problem I could find is that INSERT works considerably slower after >>> applying a patch. Beside that everything looks fine - no crashes, tests >>> pass, memory doesn't seem to leak, etc. > > Thank you for testing. I rechecked that, and insertions are really very very > very slow. It seems like a bug. > >>>> Okay, now for some badness. I've restored a database containing 2 >>>> tables, one 318MB, another 24kB. The 318MB table contains 5 million >>>> rows with a sequential id column. I get a problem if I try to delete >>>> many rows from it: >>>> # delete from contacts where id % 3 != 0 ; >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>> >>> I didn't manage to reproduce this. Thom, could you describe exact steps >>> to reproduce this issue please? >> >> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test >> -r0), which creates an instance with a custom config, which is as >> follows: >> >> shared_buffers = 8MB >> max_connections = 7 >> wal_level = 'hot_standby' >> cluster_name = 'primary' >> max_wal_senders = 3 >> wal_keep_segments = 6 >> >> Then create a pgbench data set (I didn't originally use pgbench, but >> you can get the same results with it): >> >> createdb -p 5530 pgbench >> pgbench -p 5530 -i -s 100 pgbench >> >> And delete some stuff: >> >> thom@swift:~/Development/test$ psql -p 5530 pgbench >> Timing is on. >> psql (9.6devel) >> Type "help" for help. >> >> >> ➤ psql://thom@[local]:5530/pgbench >> >> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0; >> WARNING: out of shared memory >> WARNING: out of shared memory >> WARNING: out of shared memory >> WARNING: out of shared memory >> WARNING: out of shared memory >> WARNING: out of shared memory >> WARNING: out of shared memory >> ... >> WARNING: out of shared memory >> WARNING: out of shared memory >> DELETE 6666667 >> Time: 22218.804 ms >> >> There were 358 lines of that warning message. I don't get these >> messages without the patch. >> >> Thom > > > Thank you for this report. > I tried to reproduce it, but I couldn't. Debug will be much easier now. > > I hope I'll fix these issueswithin the next few days. > > BTW, I found a dummy mistake, the previous patch contains some unrelated > changes. I fixed it in the new version (attached). Thanks. Well I've tested this latest patch, and the warnings are no longer generated. However, the index sizes show that the patch doesn't seem to be doing its job, so I'm wondering if you removed too much from it. Thom
29.01.2016 20:43, Thom Brown: > On 29 January 2016 at 16:50, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> 29.01.2016 19:01, Thom Brown: >>> On 29 January 2016 at 15:47, Aleksander Alekseev >>> <a.alekseev@postgrespro.ru> wrote: >>>> I tested this patch on x64 and ARM servers for a few hours today. The >>>> only problem I could find is that INSERT works considerably slower after >>>> applying a patch. Beside that everything looks fine - no crashes, tests >>>> pass, memory doesn't seem to leak, etc. >> Thank you for testing. I rechecked that, and insertions are really very very >> very slow. It seems like a bug. >> >>>>> Okay, now for some badness. I've restored a database containing 2 >>>>> tables, one 318MB, another 24kB. The 318MB table contains 5 million >>>>> rows with a sequential id column. I get a problem if I try to delete >>>>> many rows from it: >>>>> # delete from contacts where id % 3 != 0 ; >>>>> WARNING: out of shared memory >>>>> WARNING: out of shared memory >>>>> WARNING: out of shared memory >>>> I didn't manage to reproduce this. Thom, could you describe exact steps >>>> to reproduce this issue please? >>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test >>> -r0), which creates an instance with a custom config, which is as >>> follows: >>> >>> shared_buffers = 8MB >>> max_connections = 7 >>> wal_level = 'hot_standby' >>> cluster_name = 'primary' >>> max_wal_senders = 3 >>> wal_keep_segments = 6 >>> >>> Then create a pgbench data set (I didn't originally use pgbench, but >>> you can get the same results with it): >>> >>> createdb -p 5530 pgbench >>> pgbench -p 5530 -i -s 100 pgbench >>> >>> And delete some stuff: >>> >>> thom@swift:~/Development/test$ psql -p 5530 pgbench >>> Timing is on. >>> psql (9.6devel) >>> Type "help" for help. >>> >>> >>> ➤ psql://thom@[local]:5530/pgbench >>> >>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0; >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> ... >>> WARNING: out of shared memory >>> WARNING: out of shared memory >>> DELETE 6666667 >>> Time: 22218.804 ms >>> >>> There were 358 lines of that warning message. I don't get these >>> messages without the patch. >>> >>> Thom >> Thank you for this report. >> I tried to reproduce it, but I couldn't. Debug will be much easier now. >> >> I hope I'll fix these issueswithin the next few days. >> >> BTW, I found a dummy mistake, the previous patch contains some unrelated >> changes. I fixed it in the new version (attached). > Thanks. Well I've tested this latest patch, and the warnings are no > longer generated. However, the index sizes show that the patch > doesn't seem to be doing its job, so I'm wondering if you removed too > much from it. Huh, this patch seems to be enchanted) It works fine for me. Did you perform "make distclean"? Anyway, I'll send a new version soon. I just write here to say that I do not disappear and I do remember about the issue. I even almost fixed the insert speed problem. But I'm very very busy this week. I'll send an updated patch next week as soon as possible. Thank you for attention to this work. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
On 2 February 2016 at 11:47, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > > 29.01.2016 20:43, Thom Brown: > >> On 29 January 2016 at 16:50, Anastasia Lubennikova >> <a.lubennikova@postgrespro.ru> wrote: >>> >>> 29.01.2016 19:01, Thom Brown: >>>> >>>> On 29 January 2016 at 15:47, Aleksander Alekseev >>>> <a.alekseev@postgrespro.ru> wrote: >>>>> >>>>> I tested this patch on x64 and ARM servers for a few hours today. The >>>>> only problem I could find is that INSERT works considerably slower >>>>> after >>>>> applying a patch. Beside that everything looks fine - no crashes, tests >>>>> pass, memory doesn't seem to leak, etc. >>> >>> Thank you for testing. I rechecked that, and insertions are really very >>> very >>> very slow. It seems like a bug. >>> >>>>>> Okay, now for some badness. I've restored a database containing 2 >>>>>> tables, one 318MB, another 24kB. The 318MB table contains 5 million >>>>>> rows with a sequential id column. I get a problem if I try to delete >>>>>> many rows from it: >>>>>> # delete from contacts where id % 3 != 0 ; >>>>>> WARNING: out of shared memory >>>>>> WARNING: out of shared memory >>>>>> WARNING: out of shared memory >>>>> >>>>> I didn't manage to reproduce this. Thom, could you describe exact steps >>>>> to reproduce this issue please? >>>> >>>> Sure, I used my pg_rep_test tool to create a primary (pg_rep_test >>>> -r0), which creates an instance with a custom config, which is as >>>> follows: >>>> >>>> shared_buffers = 8MB >>>> max_connections = 7 >>>> wal_level = 'hot_standby' >>>> cluster_name = 'primary' >>>> max_wal_senders = 3 >>>> wal_keep_segments = 6 >>>> >>>> Then create a pgbench data set (I didn't originally use pgbench, but >>>> you can get the same results with it): >>>> >>>> createdb -p 5530 pgbench >>>> pgbench -p 5530 -i -s 100 pgbench >>>> >>>> And delete some stuff: >>>> >>>> thom@swift:~/Development/test$ psql -p 5530 pgbench >>>> Timing is on. >>>> psql (9.6devel) >>>> Type "help" for help. >>>> >>>> >>>> ➤ psql://thom@[local]:5530/pgbench >>>> >>>> # DELETE FROM pgbench_accounts WHERE aid % 3 != 0; >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> ... >>>> WARNING: out of shared memory >>>> WARNING: out of shared memory >>>> DELETE 6666667 >>>> Time: 22218.804 ms >>>> >>>> There were 358 lines of that warning message. I don't get these >>>> messages without the patch. >>>> >>>> Thom >>> >>> Thank you for this report. >>> I tried to reproduce it, but I couldn't. Debug will be much easier now. >>> >>> I hope I'll fix these issueswithin the next few days. >>> >>> BTW, I found a dummy mistake, the previous patch contains some unrelated >>> changes. I fixed it in the new version (attached). >> >> Thanks. Well I've tested this latest patch, and the warnings are no >> longer generated. However, the index sizes show that the patch >> doesn't seem to be doing its job, so I'm wondering if you removed too >> much from it. > > > Huh, this patch seems to be enchanted) It works fine for me. Did you perform > "make distclean"? Yes. Just tried it again: git clean -fd git stash make distclean patch -p1 < ~/Downloads/btree_compression_2.0.patch ../dopg.sh (script I've always used to build with) pg_ctl start createdb pgbench pgbench -i -s 100 pgbench $ psql pgbench Timing is on. psql (9.6devel) Type "help" for help. ➤ psql://thom@[local]:5488/pgbench # \di+ List of relationsSchema | Name | Type | Owner | Table | Size | Description --------+-----------------------+-------+-------+------------------+--------+-------------public | pgbench_accounts_pkey| index | thom | pgbench_accounts | 214 MB |public | pgbench_branches_pkey | index | thom | pgbench_branches| 24 kB |public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB | (3 rows) Previously, this would show an index size of 87MB for pgbench_accounts_pkey. > Anyway, I'll send a new version soon. > I just write here to say that I do not disappear and I do remember about the > issue. > I even almost fixed the insert speed problem. But I'm very very busy this > week. > I'll send an updated patch next week as soon as possible. Thanks. > Thank you for attention to this work. Thanks for your awesome patches. Thom
On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote: > public | pgbench_accounts_pkey | index | thom | pgbench_accounts | 214 MB | > public | pgbench_branches_pkey | index | thom | pgbench_branches | 24 kB | > public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB | I see the same. I use my regular SQL query to see the breakdown of leaf/internal/root pages: postgres=# with tots as ( SELECT count(*) c, avg(live_items) avg_live_items, avg(dead_items) avg_dead_items, u.type, r.oidfrom (select c.oid, c.relpages, generate_series(1, c.relpages - 1) i from pg_index i join pg_opclass op on i.indclass[0] = op.oid join pg_am am on op.opcmethod = am.oid join pg_class c on i.indexrelid= c.oid where am.amname = 'btree') r, lateral (select * from bt_page_stats(r.oid::regclass::text,i)) u group by r.oid, type) select ct.relname table_name, tots.oid::regclass::text index_name, (select relpages - 1 from pg_class c where c.oid = tots.oid)non_meta_pages, upper(type) page_type, c npages, to_char(avg_live_items, '990.999'), to_char(avg_dead_items, '990.999'),to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || ' %' as prop_of_index from tots join pg_index i on i.indexrelid = tots.oid join pg_class ct on ct.oid = i.indrelid where tots.oid= 'pgbench_accounts_pkey'::regclass order by ct.relnamespace, table_name, index_name, npages, type; table_name │ index_name │ non_meta_pages │ page_type │ npages │ to_char │ to_char │ prop_of_index ──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼───────────────pgbench_accounts │pgbench_accounts_pkey │ 27,421 │ R │ 1 │ 97.000 │ 0.000 │ 0.004 %pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I │ 97 │ 282.670 │ 0.000 │ 0.354 %pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L │ 27,323 │ 366.992 │ 0.000 │ 99.643 % (3 rows) But this looks healthy -- I see the same with master. And since the accounts table is listed as 1281 MB, this looks like a plausible ratio in the size of the table to its primary index (which I would not say is true of an 87MB primary key index). Are you sure you have the details right, Thom? -- Peter Geoghegan
On 4 February 2016 at 15:07, Peter Geoghegan <pg@heroku.com> wrote: > On Tue, Feb 2, 2016 at 3:59 AM, Thom Brown <thom@linux.com> wrote: >> public | pgbench_accounts_pkey | index | thom | pgbench_accounts | 214 MB | >> public | pgbench_branches_pkey | index | thom | pgbench_branches | 24 kB | >> public | pgbench_tellers_pkey | index | thom | pgbench_tellers | 48 kB | > > I see the same. > > I use my regular SQL query to see the breakdown of leaf/internal/root pages: > > postgres=# with tots as ( > SELECT count(*) c, > avg(live_items) avg_live_items, > avg(dead_items) avg_dead_items, > u.type, > r.oid > from (select c.oid, > c.relpages, > generate_series(1, c.relpages - 1) i > from pg_index i > join pg_opclass op on i.indclass[0] = op.oid > join pg_am am on op.opcmethod = am.oid > join pg_class c on i.indexrelid = c.oid > where am.amname = 'btree') r, > lateral (select * from bt_page_stats(r.oid::regclass::text, i)) u > group by r.oid, type) > select ct.relname table_name, > tots.oid::regclass::text index_name, > (select relpages - 1 from pg_class c where c.oid = tots.oid) non_meta_pages, > upper(type) page_type, > c npages, > to_char(avg_live_items, '990.999'), > to_char(avg_dead_items, '990.999'), > to_char(c/sum(c) over(partition by tots.oid) * 100, '990.999') || ' > %' as prop_of_index > from tots > join pg_index i on i.indexrelid = tots.oid > join pg_class ct on ct.oid = i.indrelid > where tots.oid = 'pgbench_accounts_pkey'::regclass > order by ct.relnamespace, table_name, index_name, npages, type; > table_name │ index_name │ non_meta_pages │ page_type > │ npages │ to_char │ to_char │ prop_of_index > ──────────────────┼───────────────────────┼────────────────┼───────────┼────────┼──────────┼──────────┼─────────────── > pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ R > │ 1 │ 97.000 │ 0.000 │ 0.004 % > pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ I > │ 97 │ 282.670 │ 0.000 │ 0.354 % > pgbench_accounts │ pgbench_accounts_pkey │ 27,421 │ L > │ 27,323 │ 366.992 │ 0.000 │ 99.643 % > (3 rows) > > But this looks healthy -- I see the same with master. And since the > accounts table is listed as 1281 MB, this looks like a plausible ratio > in the size of the table to its primary index (which I would not say > is true of an 87MB primary key index). > > Are you sure you have the details right, Thom? *facepalm* No, I'm not. I've just realised that all I've been checking is the primary key expecting it to change in size, which is, of course, nonsense. I should have been creating an index on the bid field of pgbench_accounts and reviewing the size of that. Now I've checked it with the latest patch, and can see it working fine. Apologies for the confusion. Thom
On Thu, Feb 4, 2016 at 8:25 AM, Thom Brown <thom@linux.com> wrote: > > No, I'm not. I've just realised that all I've been checking is the > primary key expecting it to change in size, which is, of course, > nonsense. I should have been creating an index on the bid field of > pgbench_accounts and reviewing the size of that. Right. Because, apart from everything else, unique indexes are not currently supported. -- Peter Geoghegan
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I fixed it in the new version (attached). Some quick remarks on your V2.0: * Seems unnecessary that _bt_binsrch() is passed a real pointer by all callers. Maybe the one current posting list caller _bt_findinsertloc(), or its caller, _bt_doinsert(), should do this work itself: @@ -373,7 +377,17 @@ _bt_binsrch(Relation rel, * scan key), which could be the last slot + 1. */ if (P_ISLEAF(opaque)) + { + if (low <= PageGetMaxOffsetNumber(page)) + { + IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low)); + /* one excessive check of equality. for possible posting tuple update or creation */ + if ((_bt_compare(rel, keysz, scankey, page, low) == 0) + && (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page))) + *updposing = true; + } return low; + } * ISTM that you should not use _bt_compare() above, in any case. Consider this: postgres=# select 5.0 = 5.000;?column? ──────────t (1 row) B-Tree operator class indicates equality here. And yet, users will expect to see the original value in an index-only scan, including the trailing zeroes as they were originally input. So this should be a bit closer to HeapSatisfiesHOTandKeyUpdate() (actually, heap_tuple_attr_equals()), which looks for strict binary equality for similar reasons. * Is this correct?: @@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) * it off the old page, not the new one, in case we are not at leaf * level. */ - state->btps_minkey = CopyIndexTuple(oitup); + ItemId iihk = PageGetItemId(opage, P_HIKEY); + IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk); + state->btps_minkey = CopyIndexTuple(hikey); How this code has changed from the master branch is not clear to me. I understand that this code in incomplete/draft: +#define MaxPackedIndexTuplesPerPage \ + ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ + (sizeof(ItemPointerData)))) But why is it different to the old (actually unchanged) MaxIndexTuplesPerPage? I would like to see comments explaining your understanding, even if they are quite rough. Why did GIN never require this change to a generic header (itup.h)? Should such a change live in that generic header file, and not another one more localized to nbtree? * More explanation of the design would be nice. I suggest modifying the nbtree README file, so it's easy to tell what the current design is. It's hard to follow this from the thread. When I reviewed Heikki's B-Tree patches from a couple of years ago, we spent ~75% of the time on design, and only ~25% on code. * I have a paranoid feeling that the deletion locking protocol (VACUUMing index tuples concurrently and safely) may need special consideration here. Basically, with the B-Tree code, there are several complicated locking protocols, like for page splits, page deletion, and interlocking with vacuum ("super exclusive lock" stuff). These are why the B-Tree code is complicated in general, and it's very important to pin down exactly how we deal with each. Ideally, you'd have an explanation for why your code was correct in each of these existing cases (especially deletion). With very complicated and important code like this, it's often wise to be very clear about when we are talking about your design, and when we are talking about your code. It's generally too hard to review both at the same time. Ideally, when you talk about your design, you'll be able to say things like "it's clear that this existing thing is correct; at least we have no complaints from the field. Therefore, it must be true that my new technique is also correct, because it makes that general situation no worse". Obviously that kind of rigor is just something we aspire to, and still fall short of at times. Still, it would be nice to specifically see a reason why the new code isn't special from the point of view of the super-exclusive lock thing (which is what I mean by deletion locking protocol + special consideration). Or why it is special, but that's okay, or whatever. This style of review is normal when writing B-Tree code. Some other things don't need this rigor, or have no invariants that need to be respected/used. Maybe this is obvious to you already, but it isn't obvious to me. It's okay if you don't know why, but knowing that you don't have a strong opinion about something is itself useful information. * I see you disabled the LP_DEAD thing; why? Just because that made bugs go away? * Have you done much stress testing? Using pgbench with many concurrent VACUUM FREEZE operations would be a good idea, if you haven't already, because that is insistent about getting super exclusive locks, unlike regular VACUUM. * Are you keeping the restriction of 1/3 of a buffer page, but that just includes the posting list now? That's the kind of detail I'd like to see in the README now. * Why not support unique indexes? The obvious answer is that it isn't worth it, but why? How useful would that be (a bit, just not enough)? What's the trade-off? Anyway, this is really cool work; I have often thought that we don't have nearly enough people thinking about how to optimize B-Tree indexing. It is hard, but so is anything worthwhile. That's all I have for now. Just a quick review focused on code and correctness (and not on the benefits). I want to do more on this, especially the benefits, because it deserves more attention. -- Peter Geoghegan
04.02.2016 20:16, Peter Geoghegan:
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it looks much better.
I described all details of the compression in this document https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.
On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it looks much better.
I described all details of the compression in this document https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.
Thank you for the notice. Fixed.Some quick remarks on your V2.0: * Seems unnecessary that _bt_binsrch() is passed a real pointer by all callers. Maybe the one current posting list caller _bt_findinsertloc(), or its caller, _bt_doinsert(), should do this work itself: @@ -373,7 +377,17 @@ _bt_binsrch(Relation rel, * scan key), which could be the last slot + 1. */ if (P_ISLEAF(opaque)) + { + if (low <= PageGetMaxOffsetNumber(page)) + { + IndexTuple oitup = (IndexTuple) PageGetItem(page, PageGetItemId(page, low)); + /* one excessive check of equality. for possible posting tuple update or creation */ + if ((_bt_compare(rel, keysz, scankey, page, low) == 0) + && (IndexTupleSize(oitup) + sizeof(ItemPointerData) < BTMaxItemSize(page))) + *updposing = true; + } return low; + } * ISTM that you should not use _bt_compare() above, in any case. Consider this: postgres=# select 5.0 = 5.000;?column? ──────────t (1 row) B-Tree operator class indicates equality here. And yet, users will expect to see the original value in an index-only scan, including the trailing zeroes as they were originally input. So this should be a bit closer to HeapSatisfiesHOTandKeyUpdate() (actually, heap_tuple_attr_equals()), which looks for strict binary equality for similar reasons.
Yes, it is. I completed the comment above.* Is this correct?: @@ -555,7 +662,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) * it off the old page, not the new one, in case we are not at leaf * level. */ - state->btps_minkey = CopyIndexTuple(oitup); + ItemId iihk = PageGetItemId(opage, P_HIKEY); + IndexTuple hikey = (IndexTuple) PageGetItem(opage, iihk); + state->btps_minkey = CopyIndexTuple(hikey); How this code has changed from the master branch is not clear to me.
I agree.I understand that this code in incomplete/draft: +#define MaxPackedIndexTuplesPerPage \ + ((int) ((BLCKSZ - SizeOfPageHeaderData) / \ + (sizeof(ItemPointerData)))) But why is it different to the old (actually unchanged) MaxIndexTuplesPerPage? I would like to see comments explaining your understanding, even if they are quite rough. Why did GIN never require this change to a generic header (itup.h)? Should such a change live in that generic header file, and not another one more localized to nbtree?
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
18.02.2016 20:18, Anastasia Lubennikova:
04.02.2016 20:16, Peter Geoghegan:Sorry, previous patch was dirty. Hotfix is attached.On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:I fixed it in the new version (attached).
Thank you for the review.
At last, there is a new patch version 3.0. After some refactoring it looks much better.
I described all details of the compression in this document https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt).
Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work.
Please don't hesitate to comment it.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi Anastasia, On 2/18/16 12:29 PM, Anastasia Lubennikova wrote: > 18.02.2016 20:18, Anastasia Lubennikova: >> 04.02.2016 20:16, Peter Geoghegan: >>> On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova >>> <a.lubennikova@postgrespro.ru> wrote: >>>> I fixed it in the new version (attached). >> >> Thank you for the review. >> At last, there is a new patch version 3.0. After some refactoring it >> looks much better. >> I described all details of the compression in this document >> https://goo.gl/50O8Q0 (the same text without pictures is attached in >> btc_readme_1.0.txt). >> Consider it as a rough copy of readme. It contains some notes about >> tricky moments of implementation and questions about future work. >> Please don't hesitate to comment it. >> > Sorry, previous patch was dirty. Hotfix is attached. This looks like an extremely valuable optimization for btree indexes but unfortunately it is not getting a lot of attention. It still applies cleanly for anyone interested in reviewing. It's not clear to me that you answered all of Peter's questions in [1]. I understand that you've provided a README but itmay not be clear if the answers are in there (and where). Also, at the end of the README it says: 13. Xlog. TODO. Does that mean the patch is not yet complete? Thanks, -- -David david@pgmasters.net [1] http://www.postgresql.org/message-id/CAM3SWZQ3_PLQCH4w7uQ8q_f2t4HEseKTr2n0rQ5pxA18OeRTJw@mail.gmail.com
14.03.2016 16:02, David Steele: > Hi Anastasia, > > On 2/18/16 12:29 PM, Anastasia Lubennikova wrote: >> 18.02.2016 20:18, Anastasia Lubennikova: >>> 04.02.2016 20:16, Peter Geoghegan: >>>> On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova >>>> <a.lubennikova@postgrespro.ru> wrote: >>>>> I fixed it in the new version (attached). >>> >>> Thank you for the review. >>> At last, there is a new patch version 3.0. After some refactoring it >>> looks much better. >>> I described all details of the compression in this document >>> https://goo.gl/50O8Q0 (the same text without pictures is attached in >>> btc_readme_1.0.txt). >>> Consider it as a rough copy of readme. It contains some notes about >>> tricky moments of implementation and questions about future work. >>> Please don't hesitate to comment it. >>> >> Sorry, previous patch was dirty. Hotfix is attached. > > This looks like an extremely valuable optimization for btree indexes > but unfortunately it is not getting a lot of attention. It still > applies cleanly for anyone interested in reviewing. > Thank you for attention. I would be indebted to all reviewers, who can just try this patch on real data and workload (except WAL for now). B-tree needs very much testing. > It's not clear to me that you answered all of Peter's questions in > [1]. I understand that you've provided a README but it may not be > clear if the answers are in there (and where). I described in README all the points Peter asked. But I see that it'd be better to answer directly. Thanks for reminding, I'll do it tomorrow. > Also, at the end of the README it says: > > 13. Xlog. TODO. > > Does that mean the patch is not yet complete? Yes, you're right. Frankly speaking, I supposed that someone will help me with that stuff, but now I almost completed it. I'll send updated patch in the next letter. I'm still doubtful about some patch details. I mentioned them in readme (bold type). But they are mostly about future improvements. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Please, find the new version of the patch attached. Now it has WAL functionality. Detailed description of the feature you can find in README draft https://goo.gl/50O8Q0 This patch is pretty complicated, so I ask everyone, who interested in this feature, to help with reviewing and testing it. I will be grateful for any feedback. But please, don't complain about code style, it is still work in progress. Next things I'm going to do: 1. More debugging and testing. I'm going to attach in next message couple of sql scripts for testing. 2. Fix NULLs processing 3. Add a flag into pg_index, that allows to enable/disable compression for each particular index. 4. Recheck locking considerations. I tried to write code as less invasive as possible, but we need to make sure that algorithm is still correct. 5. Change BTMaxItemSize 6. Bring back microvacuum functionality. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
<br /><br /><div class="moz-cite-prefix">On 18.03.2016 20:19, Anastasia Lubennikova wrote:<br /></div><blockquote cite="mid:56EC38A9.9030303@postgrespro.ru"type="cite">Please, find the new version of the patch attached. Now it has WALfunctionality. <br /><br /> Detailed description of the feature you can find in README draft <a class="moz-txt-link-freetext"href="https://goo.gl/50O8Q0">https://goo.gl/50O8Q0</a><br /><br /> This patch is pretty complicated,so I ask everyone, who interested in this feature, <br /> to help with reviewing and testing it. I will be gratefulfor any feedback. <br /> But please, don't complain about code style, it is still work in progress. <br /><br />Next things I'm going to do: <br /> 1. More debugging and testing. I'm going to attach in next message couple of sql scriptsfor testing. <br /> 2. Fix NULLs processing <br /> 3. Add a flag into pg_index, that allows to enable/disable compressionfor each particular index. <br /> 4. Recheck locking considerations. I tried to write code as less invasive aspossible, but we need to make sure that algorithm is still correct. <br /> 5. Change BTMaxItemSize <br /> 6. Bring backmicrovacuum functionality. <br /><br /></blockquote><br /><br /> Hi, hackers.<br /><br /> It's my first review, so donot be strict to me.<br /><br /> I have tested this patch on the next table:<br /> create table message<br /> (<br/> id serial,<br /> usr_id integer,<br /> text text<br /> );<br /> CREATEINDEX message_usr_id ON message (usr_id);<br /> The table has 10000000 records.<br /><br /> I found the following:<br/> The less unique keys the less size of the table.<br /><br /> Next 2 tablas demonstrates it. <br /> New B-tree<br /> Count of unique keys (usr_id), index“s size , time of creation<br /> 10000000 ;"214 MB" ;"00:00:34.193441"<br/> 3333333 ;"214 MB" ;"00:00:45.731173"<br /> 2000000 ;"129 MB" ;"00:00:41.445876"<br/> 1000000 ;"129 MB" ;"00:00:38.455616"<br /> 100000 ;"86 MB" ;"00:00:40.887626"<br/> 10000 ;"79 MB" ;"00:00:47.199774"<br /><br /> Old B-tree <br /> Count of unique keys(usr_id), index“s size , time of creation<br /> 10000000 ;"214 MB" ;"00:00:35.043677"<br /> 3333333 ;"286MB" ;"00:00:40.922845"<br /> 2000000 ;"300 MB" ;"00:00:46.454846"<br /> 1000000 ;"278 MB" ;"00:00:42.323525"<br/> 100000 ;"287 MB" ;"00:00:47.438132"<br /> 10000 ;"280 MB" ;"00:01:00.307873"<br/><br /> I inserted data randomly and sequentially, it did not influence the index's size.<br /> Timeof select, insert and update random rows is not changed. It is great, but certainly it needs some more detailed study.<br/> <br /> Alexander Popov<br /> Postgres Professional: <a class="moz-txt-link-freetext" href="http://www.postgrespro.com">http://www.postgrespro.com</a><br/> The Russian Postgres Company <br /><br /><br />
On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Please, find the new version of the patch attached. Now it has WAL > functionality. > > Detailed description of the feature you can find in README draft > https://goo.gl/50O8Q0 > > This patch is pretty complicated, so I ask everyone, who interested in this > feature, > to help with reviewing and testing it. I will be grateful for any feedback. > But please, don't complain about code style, it is still work in progress. > > Next things I'm going to do: > 1. More debugging and testing. I'm going to attach in next message couple of > sql scripts for testing. > 2. Fix NULLs processing > 3. Add a flag into pg_index, that allows to enable/disable compression for > each particular index. > 4. Recheck locking considerations. I tried to write code as less invasive as > possible, but we need to make sure that algorithm is still correct. > 5. Change BTMaxItemSize > 6. Bring back microvacuum functionality. I really like this idea, and the performance results seem impressive, but I think we should push this out to 9.7. A btree patch that didn't have WAL support until two and a half weeks into the final CommitFest just doesn't seem to me like a good candidate. First, as a general matter, if a patch isn't code-complete at the start of a CommitFest, it's reasonable to say that it should be reviewed but not necessarily committed in that CommitFest. This patch has had some review, but I'm not sure how deep that review is, and I think it's had no code review at all of the WAL logging changes, which were submitted only a week ago, well after the CF deadline. Second, the btree AM is a particularly poor place to introduce possibly destabilizing changes. Everybody depends on it, all the time, for everything. And despite new tools like amcheck, it's not a particularly easy thing to debug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 24, 2016 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
On Fri, Mar 18, 2016 at 1:19 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Please, find the new version of the patch attached. Now it has WAL
> functionality.
>
> Detailed description of the feature you can find in README draft
> https://goo.gl/50O8Q0
>
> This patch is pretty complicated, so I ask everyone, who interested in this
> feature,
> to help with reviewing and testing it. I will be grateful for any feedback.
> But please, don't complain about code style, it is still work in progress.
>
> Next things I'm going to do:
> 1. More debugging and testing. I'm going to attach in next message couple of
> sql scripts for testing.
> 2. Fix NULLs processing
> 3. Add a flag into pg_index, that allows to enable/disable compression for
> each particular index.
> 4. Recheck locking considerations. I tried to write code as less invasive as
> possible, but we need to make sure that algorithm is still correct.
> 5. Change BTMaxItemSize
> 6. Bring back microvacuum functionality.
I really like this idea, and the performance results seem impressive,
but I think we should push this out to 9.7. A btree patch that didn't
have WAL support until two and a half weeks into the final CommitFest
just doesn't seem to me like a good candidate. First, as a general
matter, if a patch isn't code-complete at the start of a CommitFest,
it's reasonable to say that it should be reviewed but not necessarily
committed in that CommitFest. This patch has had some review, but I'm
not sure how deep that review is, and I think it's had no code review
at all of the WAL logging changes, which were submitted only a week
ago, well after the CF deadline. Second, the btree AM is a
particularly poor place to introduce possibly destabilizing changes.
Everybody depends on it, all the time, for everything. And despite
new tools like amcheck, it's not a particularly easy thing to debug.
It's all true. But:
1) It's a great feature many users dream about.
2) Patch is not very big.
3) Patch doesn't introduce significant infrastructural changes. It just change some well-isolated placed.
Let's give it a chance. I've signed as additional reviewer and I'll do my best in spotting all possible issues in this patch.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I really like this idea, and the performance results seem impressive, > but I think we should push this out to 9.7. A btree patch that didn't > have WAL support until two and a half weeks into the final CommitFest > just doesn't seem to me like a good candidate. First, as a general > matter, if a patch isn't code-complete at the start of a CommitFest, > it's reasonable to say that it should be reviewed but not necessarily > committed in that CommitFest. This patch has had some review, but I'm > not sure how deep that review is, and I think it's had no code review > at all of the WAL logging changes, which were submitted only a week > ago, well after the CF deadline. Second, the btree AM is a > particularly poor place to introduce possibly destabilizing changes. > Everybody depends on it, all the time, for everything. And despite > new tools like amcheck, it's not a particularly easy thing to debug. Regrettably, I must agree. I don't see a plausible path to commit for this patch in the ongoing CF. I think that Anastasia did an excellent job here, and I wish I could have been of greater help sooner. Nevertheless, it would be unwise to commit this given the maturity of the code. There have been very few instances of performance improvements to the B-Tree code for as long as I've been interested, because it's so hard, and the standard is so high. The only example I can think of from the last few years is Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which were far less invasive, and Simon's commit c7111d11b1, which we just outright reverted from 9.5 due to subtle bugs (and even that was significantly less invasive than this patch). Improving nbtree is something that requires several rounds of expert review, and that's something that's in short supply for the B-Tree code in particular. I think that a new testing strategy is needed to make this easier, and I hope to get that going with amcheck. I need help with formalizing a "testing first" approach for improving the B-Tree code, because I think it's the only way that we can move forward with projects like this. It's *incredibly* hard to push forward patches like this given our current, limited testing strategy. -- Peter Geoghegan
On 3/24/16 10:21 AM, Alexander Korotkov wrote: > 1) It's a great feature many users dream about. Doesn't matter if it starts eating their data... > 2) Patch is not very big. > 3) Patch doesn't introduce significant infrastructural changes. It just > change some well-isolated placed. It doesn't really matter how big the patch is, it's a question of "What did the patch fail to consider?". With something as complicated as the btree code, there's ample opportunities for missing things. (And FWIW, I'd argue that a 51kB patch is certainly not small, and a patch that is doing things in critical sections isn't terribly isolated). I do think this will be a great addition, but it's just too late to be adding this to 9.6. (BTW, I'm getting bounces from a.lebedev@postgrespro.ru, as well as postmaster@. I emailed info@postgrespro.ru about this but never heard back.) -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
25.03.2016 01:12, Peter Geoghegan: > On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I really like this idea, and the performance results seem impressive, >> but I think we should push this out to 9.7. A btree patch that didn't >> have WAL support until two and a half weeks into the final CommitFest >> just doesn't seem to me like a good candidate. First, as a general >> matter, if a patch isn't code-complete at the start of a CommitFest, >> it's reasonable to say that it should be reviewed but not necessarily >> committed in that CommitFest. You're right. Frankly, I thought that someone will help me with the path, but I had to finish it myself. *off-topic* I wonder, if we can add new flag to commitfest. Something like "Needs assistance", which will be used to mark big and complicated patches in progress. While "Needs review" means that the patch is almost ready and only requires the final review. >> This patch has had some review, but I'm >> not sure how deep that review is, and I think it's had no code review >> at all of the WAL logging changes, which were submitted only a week >> ago, well after the CF deadline. Second, the btree AM is a >> particularly poor place to introduce possibly destabilizing changes. >> Everybody depends on it, all the time, for everything. And despite >> new tools like amcheck, it's not a particularly easy thing to debug. > Regrettably, I must agree. I don't see a plausible path to commit for > this patch in the ongoing CF. > > I think that Anastasia did an excellent job here, and I wish I could > have been of greater help sooner. Nevertheless, it would be unwise to > commit this given the maturity of the code. There have been very few > instances of performance improvements to the B-Tree code for as long > as I've been interested, because it's so hard, and the standard is so > high. The only example I can think of from the last few years is > Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which > were far less invasive, and Simon's commit c7111d11b1, which we just > outright reverted from 9.5 due to subtle bugs (and even that was > significantly less invasive than this patch). Improving nbtree is > something that requires several rounds of expert review, and that's > something that's in short supply for the B-Tree code in particular. I > think that a new testing strategy is needed to make this easier, and I > hope to get that going with amcheck. I need help with formalizing a > "testing first" approach for improving the B-Tree code, because I > think it's the only way that we can move forward with projects like > this. It's *incredibly* hard to push forward patches like this given > our current, limited testing strategy. Unfortunately, I must agree. This patch seems to be far from final version until the feature freeze. I'll move it to the future commitfest. Anyway it means, that now we have more time to improve the patch. If you have any ideas related to this patch like prefix/suffix compression, I'll be glad to discuss them. Same for any other ideas of B-tree optimization. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I really like this idea, and the performance results seem impressive, >> but I think we should push this out to 9.7. A btree patch that didn't >> have WAL support until two and a half weeks into the final CommitFest >> just doesn't seem to me like a good candidate. First, as a general >> matter, if a patch isn't code-complete at the start of a CommitFest, >> it's reasonable to say that it should be reviewed but not necessarily >> committed in that CommitFest. This patch has had some review, but I'm >> not sure how deep that review is, and I think it's had no code review >> at all of the WAL logging changes, which were submitted only a week >> ago, well after the CF deadline. Second, the btree AM is a >> particularly poor place to introduce possibly destabilizing changes. >> Everybody depends on it, all the time, for everything. And despite >> new tools like amcheck, it's not a particularly easy thing to debug. > > Regrettably, I must agree. I don't see a plausible path to commit for > this patch in the ongoing CF. > > I think that Anastasia did an excellent job here, and I wish I could > have been of greater help sooner. Nevertheless, it would be unwise to > commit this given the maturity of the code. There have been very few > instances of performance improvements to the B-Tree code for as long > as I've been interested, because it's so hard, and the standard is so > high. The only example I can think of from the last few years is > Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which > were far less invasive, and Simon's commit c7111d11b1, which we just > outright reverted from 9.5 due to subtle bugs (and even that was > significantly less invasive than this patch). Improving nbtree is > something that requires several rounds of expert review, and that's > something that's in short supply for the B-Tree code in particular. I > think that a new testing strategy is needed to make this easier, and I > hope to get that going with amcheck. I need help with formalizing a > "testing first" approach for improving the B-Tree code, because I > think it's the only way that we can move forward with projects like > this. It's *incredibly* hard to push forward patches like this given > our current, limited testing strategy. I've been toying (having gotten nowhere concrete really) with prefix compression myself, I agree that messing with btree code is quite harder than it ought to be. Perhaps trying experimental format changes in a separate experimental am wouldn't be all that bad (say, nxbtree?). People could opt-in to those, by creating the indexes with nxbtree instead of plain btree (say in development environments) and get some testing going without risking much. Normally the same effect should be achievable with mere flags, but since format changes to btree tend to be rather invasive, ensuring the patch doesn't change behavior with the flag off is hard as well, hence the wholly separate am idea.
On 18/03/16 19:19, Anastasia Lubennikova wrote: > Please, find the new version of the patch attached. Now it has WAL > functionality. > > Detailed description of the feature you can find in README draft > https://goo.gl/50O8Q0 > > This patch is pretty complicated, so I ask everyone, who interested in > this feature, > to help with reviewing and testing it. I will be grateful for any feedback. > But please, don't complain about code style, it is still work in progress. > > Next things I'm going to do: > 1. More debugging and testing. I'm going to attach in next message > couple of sql scripts for testing. > 2. Fix NULLs processing > 3. Add a flag into pg_index, that allows to enable/disable compression > for each particular index. > 4. Recheck locking considerations. I tried to write code as less > invasive as possible, but we need to make sure that algorithm is still > correct. > 5. Change BTMaxItemSize > 6. Bring back microvacuum functionality. I think we should pack the TIDs more tightly, like GIN does with the varbyte encoding. It's tempting to commit this without it for now, and add the compression later, but I'd like to avoid having to deal with multiple binary-format upgrades, so let's figure out the final on-disk format that we want, right from the beginning. It would be nice to reuse the varbyte encoding code from GIN, but we might not want to use that exact scheme for B-tree. Firstly, an important criteria when we designed GIN's encoding scheme was to avoid expanding on-disk size for any data set, which meant that a TID had to always be encoded in 6 bytes or less. We don't have that limitation with B-tree, because in B-tree, each item is currently stored as a separate IndexTuple, which is much larger. So we are free to choose an encoding scheme that's better at packing some values, at the expense of using more bytes for other values, if we want to. Some analysis on what we want would be nice. (It's still important that removing a TID from the list never makes the list larger, for VACUUM.) Secondly, to be able to just always enable this feature, without a GUC or reloption, we might need something that's faster for random access than GIN's posting lists. Or can we just add the setting, but it would be nice to have some more analysis on the worst-case performance before we decide on that. I find the macros in nbtree.h in the patch quite confusing. They're similar to what we did in GIN, but again we might want to choose differently here. So some discussion on the desired IndexTuple layout is in order. (One clear bug is that using the high bit of BlockNumber for the BT_POSTING flag will fail for a table larger than 2^31 blocks.) - Heikki
On Mon, Jul 4, 2016 at 2:30 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I think we should pack the TIDs more tightly, like GIN does with the varbyte > encoding. It's tempting to commit this without it for now, and add the > compression later, but I'd like to avoid having to deal with multiple > binary-format upgrades, so let's figure out the final on-disk format that we > want, right from the beginning. While the idea of duplicate storage is pretty obviously compelling, there could be other, non-obvious benefits. I think that it could bring further benefits if we could use duplicate storage to change this property of nbtree (this is from the README): """ Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page. This does not work for nonunique keys (for example, if we have enough equal keys to spread across several leaf pages, there *must* be some equal bounding keys in the first level up). Therefore we assume Ki <= v <= Ki+1 instead. A search that finds exact equality to a bounding key in an upper tree level must descend to the left of that key to ensure it finds any equal keys in the preceding page. An insertion that sees the high key of its target page is equal to the key to be inserted has a choice whether or not to move right, since the new key could go on either page. (Currently, we try to find a page where there is room for the new key without a split.) """ If we could *guarantee* that all keys in the index are unique, then we could maintain the keyspace as L&Y originally described. The practical benefits to this would be: * We wouldn't need to take the extra step described above -- finding a bounding key/separator key that's fully equal to our scankey would no longer necessitate a probably-useless descent to the left of that key. (BTW, I wonder if we could get away with not inserting a downlink into parent when a leaf page split finds an identical IndexTuple in parent, *without* changing the keyspace invariant I mention -- if we're always going to go to the left of an equal-to-scankey key in an internal page, why even have more than one?) * This would make suffix truncation of internal index tuples easier, and that's important. The traditional reason why suffix truncation is important is that it can keep the tree a lot shorter than it would otherwise be. These days, that might not seem that important, because even if you have twice the number of internal pages than strictly necessary, that still isn't that many relative to typical main memory size (and even CPU cache sizes, perhaps). The reason I think it's important these days is that not having suffix truncation makes our "separator keys" overly prescriptive about what part of the keyspace is owned by each internal page. With a pristine index (following REINDEX), this doesn't matter much. But, I think that we get much bigger problems with index bloat due to the poor fan-out that we sometimes see due to not having suffix truncation, *combined* with the page deletion algorithms restriction on deleting internal pages (it can only be done for internal pages with *no* children). Adding another level or two to the B-Tree makes it so that your workload's "sparse deletion patterns" really don't need to be that sparse in order to bloat the B-Tree badly, necessitating a REINDEX to get back to acceptable performance (VACUUM won't do it). To avoid this, we should make the internal pages represent the key space in the least restrictive way possible, by applying suffix truncation so that it's much more likely that things will *stay* balanced as churn occurs. This is probably a really bad problem with things like composite indexes over text columns, or indexes with many NULL values. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
The new version of the patch is attached. This version is even simpler than the previous one, thanks to the recent btree design changes and all the feedback I received. I consider it ready for review and testing. [feature overview] This patch implements the deduplication of btree non-pivot tuples on leaf pages in a manner similar to GIN index "posting lists". Non-pivot posting tuple has following format: t_tid | t_info | key values | posting_list[] Where t_tid and t_info fields are used to store meta info about tuple's posting list. posting list is an array of ItemPointerData. Currently, compression is applied to all indexes except system indexes, unique indexes, and indexes with included columns. On insertion, compression applied not to each tuple, but to the page before split. If the target page is full, we try to compress it. [benchmark results] idx ON tbl(c1); index contains 10000000 integer values i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=10000000 means that all keys are different. i / old size (MB) / new size (MB) 1 215 88 1000 215 90 100000 215 71 10000000 214 214 For more, see the attached diagram with test results. [future work] Many things can be improved in this feature. Personally, I'd prefer to keep this patch as small as possible and work on other improvements after a basic part is committed. Though, I understand that some of these can be considered essential for this patch to be approved. 1. Implement a split of the posting tuples on a page split. 2. Implement microvacuum of posting tuples. 3. Add a flag into pg_index, which allows enabling/disabling compression for a particular index. 4. Implement posting list compression. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Attachment
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > i - number of distinct values in the index. > So i=1 means that all rows have the same key, > and i=10000000 means that all keys are different. > > i / old size (MB) / new size (MB) > 1 215 88 > 1000 215 90 > 100000 215 71 > 10000000 214 214 > > For more, see the attached diagram with test results. I tried this on my own "UK land registry" test data [1], which was originally used for the v12 nbtree work. My test case has a low cardinality, multi-column text index. I chose this test case because it was convenient for me. On v12/master, the index is 1100MB. Whereas with your patch, it ends up being 196MB -- over 5.5x smaller! I also tried it out with the "Mouse genome informatics" database [2], which was already improved considerably by the v12 work on duplicates. This is helped tremendously by your patch. It's not quite 5.5x across the board, of course. There are 187 indexes (on 28 tables), and almost all of the indexes are smaller. Actually, *most* of the indexes are *much* smaller. Very often 50% smaller. I don't have time to do an in-depth analysis of these results today, but clearly the patch is very effective on real world data. I think that we tend to underestimate just how common indexes with a huge number of duplicates are. [1] https://https:/postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com [2] http://www.informatics.jax.org/software.shtml -- Peter Geoghegan
On Thu, Jul 4, 2019 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote: > I tried this on my own "UK land registry" test data [1], which was > originally used for the v12 nbtree work. My test case has a low > cardinality, multi-column text index. I chose this test case because > it was convenient for me. > > On v12/master, the index is 1100MB. Whereas with your patch, it ends > up being 196MB -- over 5.5x smaller! I also see a huge and consistent space saving for TPC-H. All 9 indexes are significantly smaller. The lineitem orderkey index is "just" 1/3 smaller, which is the smallest improvement among TPC-H indexes in my index bloat test case. The two largest indexes after the initial bulk load are *much* smaller: the lineitem parts supplier index is ~2.7x smaller, while the lineitem ship date index is a massive ~4.2x smaller. Also, the orders customer key index is ~2.8x smaller, and the order date index is ~2.43x smaller. Note that the test involved retail insertions, not CREATE INDEX. I haven't seen any regression in the size of any index so far, including when the number of internal pages is all that we measure. Actually, there seems to be cases where there is a noticeably larger reduction in internal pages than in leaf pages, probably because of interactions with suffix truncation. This result is very impressive. We'll need to revisit what the right trade-off is for the compression scheme, which Heikki had some thoughts on when we left off 3 years ago, but that should be a lot easier now. I am very encouraged by the fact that this relatively simple approach already works quite nicely. It's also great to see that bulk insertions with lots of compression are very clearly faster with this latest revision of your patch, unlike earlier versions from 2016 that made those cases slower (though I haven't tested indexes that don't really use compression). I think that this is because you now do the compression lazily, at the point where it looks like we may need to split the page. Previous versions of the patch had to perform compression eagerly, just like GIN, which is not really appropriate for nbtree. -- Peter Geoghegan
On Thu, Jul 4, 2019 at 05:06:09PM -0700, Peter Geoghegan wrote: > This result is very impressive. We'll need to revisit what the right > trade-off is for the compression scheme, which Heikki had some > thoughts on when we left off 3 years ago, but that should be a lot > easier now. I am very encouraged by the fact that this relatively > simple approach already works quite nicely. It's also great to see > that bulk insertions with lots of compression are very clearly faster > with this latest revision of your patch, unlike earlier versions from > 2016 that made those cases slower (though I haven't tested indexes > that don't really use compression). I think that this is because you > now do the compression lazily, at the point where it looks like we may > need to split the page. Previous versions of the patch had to perform > compression eagerly, just like GIN, which is not really appropriate > for nbtree. I am also encouraged and am happy we can finally move this duplicate optimization forward. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > The new version of the patch is attached. > This version is even simpler than the previous one, > thanks to the recent btree design changes and all the feedback I received. > I consider it ready for review and testing. I took a closer look at this patch, and have some general thoughts on its design, and specific feedback on the implementation. Preserving the *logical contents* of B-Tree indexes that use compression is very important -- that should not change in a way that outside code can notice. The heap TID itself should count as logical contents here, since we want to be able to implement retail index tuple deletion in the future. Even without retail index tuple deletion, amcheck's "rootdescend" verification assumes that it can find one specific tuple (which could now just be one specific "logical tuple") using specific key values from the heap, including the heap tuple's heap TID. This requirement makes things a bit harder for your patch, because you have to deal with one or two edge-cases that you currently don't handle: insertion of new duplicates that fall inside the min/max range of some existing posting list. That should be rare enough in practice, so the performance penalty won't be too bad. This probably means that code within _bt_findinsertloc() and/or _bt_binsrch_insert() will need to think about a logical tuple as a distinct thing from a physical tuple, though that won't be necessary in most places. The need to "preserve the logical contents" also means that the patch will need to recognize when indexes are not safe as a target for compression/deduplication (maybe we should call this feature deduplilcation, so it's clear how it differs from TOAST?). For example, if we have a case-insensitive ICU collation, then it is not okay to treat an opclass-equal pair of text strings that use the collation as having the same value when considering merging the two into one. You don't actually do that in the patch, but you also don't try to deal with the fact that such a pair of strings are equal, and so must have their final positions determined by the heap TID column (deduplication within _bt_compress_one_page() must respect that). Possibly equal-but-distinct values seems like a problem that's not worth truly fixing, but it will be necessary to store metadata about whether or not we're willing to do deduplication in the meta page, based on operator class and collation details. That seems like a restriction that we're just going to have to accept, though I'm not too worried about exactly what that will look like right now. We can work it out later. I think that the need to be careful about the logical contents of indexes already causes bugs, even with "safe for compression" indexes. For example, I can sometimes see an assertion failure within_bt_truncate(), at the point where we check if heap TID values are safe: /* * Lehman and Yao require that the downlink to the right page, which is to * be inserted into the parent page in the second phase of a page split be * a strict lower bound on items on the right page, and a non-strict upper * bound for items on the left page. Assert that heap TIDs follow these * invariants, since a heap TID value is apparently needed as a * tiebreaker. */ #ifndef DEBUG_NO_TRUNCATE Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), BTreeTupleGetMinTID(firstright)) < 0); ... This bug is not that easy to see, but it will happen with a big index, even without updates or deletes. I think that this happens because compression can allow the "logical tuples" to be in the wrong heap TID order when there are multiple posting lists for the same value. As I said, I think that it's necessary to see a posting list as being comprised of multiple logical tuples in the context of inserting new tuples, even when you're not performing compression or splitting the page. I also see that amcheck's bt_index_parent_check() function fails, though bt_index_check() does not fail when I don't use any of its extra verification options. (You haven't updated amcheck, but I don't think that you need to update it for these basic checks to continue to work.) Other feedback on specific things: * A good way to assess whether or not the "logical tuple versus physical tuple" thing works is to make sure that amcheck's "rootdescend" verification works with a variety of indexes. As I said, it has the same requirements for nbtree as retail index tuple deletion will. * _bt_findinsertloc() should not call _bt_compress_one_page() for !heapkeyspace (version 3) indexes -- the second call to _bt_compress_one_page() should be removed. * Why can't compression be used on system catalog indexes? I understand that they are not a compelling case, but we tend to do things the same way with catalog tables and indexes unless there is a very good reason not to (e.g. HOT, suffix truncation). I see that the tests fail when that restriction is removed, but I don't think that that has anything to do with system catalogs. I think that that's due to a bug somewhere else. Why have this restriction at all? * It looks like we could be less conservative in nbtsplitloc.c to good effect. We know for sure that a posting list will be truncated down to one heap TID even in the worst case, so we can safely assume that the new high key will be a lot smaller than the firstright tuple that it is based on when it has a posting list. We only have to keep one TID. This will allow us to leave more tuples on the left half of the page in certain cases, further improving space utilization. * Don't you need to update nbtdesc.c? * Maybe we could do compression with unique indexes when inserting values with NULLs? Note that we now treat an insertion of a tuple with NULLs into a unique index as if it wasn't even a unique index -- see the "checkingunique" optimization at the beginning of _bt_doinsert(). Having many NULL values in a unique index is probably fairly common. * It looks like amcheck's heapallindexed verification needs to have normalization added, to avoid false positives. This situation is specifically anticipated by existing comments above bt_normalize_tuple(). Again, being careful about "logical versus physical tuple" seems necessary. * Doesn't the nbtsearch.c/_bt_readpage() code that deals with backwards scans need to return posting lists backwards, not forwards? It seems like a good idea to try to "preserve the logical contents" here too, just to be conservative. Within nbtsort.c: * Is the new code in _bt_buildadd() actually needed? If so, why? * insert_itupprev_to_page_buildadd() is only called within nbtsort.c, and so should be static. The name also seems very long. * add_item_to_posting() is called within both nbtsort.c and nbtinsert.c, and so should remain non-static, but have less generic (and shorter) name. (Use the usual _bt_* style instead.) * Is nbtsort.c the right place for these functions, anyway? (Maybe, but maybe not, IMV.) I ran pgindent on the patch, and made some small manual whitespace adjustments, which is attached. There are no real changes, but some of the formatting in the original version you posted was hard to read. Please work off this for your next revision. -- Peter Geoghegan
Attachment
On Sat, Jul 6, 2019 at 4:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > I took a closer look at this patch, and have some general thoughts on > its design, and specific feedback on the implementation. I have some high level concerns about how the patch might increase contention, which could make queries slower. Apparently that is a real problem in other systems that use MVCC when their bitmap index feature is used -- they are never really supposed to be used with OLTP apps. This patch makes nbtree behave rather a lot like a bitmap index. That's not exactly true, because you're not storing a bitmap or compressing the TID lists, but they're definitely quite similar. It's easy to imagine a hybrid approach, that starts with a B-Tree with deduplication/TID lists, and eventually becomes a bitmap index as more duplicates are added [1]. It doesn't seem like it would be practical for these other MVCC database systems to have standard B-Tree secondary indexes that compress duplicates gracefully in the way that you propose to with this patch, because lock contention would presumably be a big problem for the same reason as it is with their bitmap indexes (whatever the true reason actually is). Is it really possible to have something that's adaptive, offering the best of both worlds? Having dug into it some more, I think that the answer for us might actually be "yes, we can have it both ways". Other database systems that are also based on MVCC still probably use a limited form of index locking, even in READ COMMITTED mode, though this isn't very widely known. They need this for unique indexes, but they also need it for transaction rollback, to remove old entries from the index when the transaction must abort. The section "6.7 Standard Practice" from the paper "Architecture of a Database System" [2] goes into this, saying: "All production databases today support ACID transactions. As a rule, they use write-ahead logging for durability, and two-phase locking for concurrency control. An exception is PostgreSQL, which uses multiversion concurrency control throughout." I suggest reading "6.7 Standard Practice" in full. Anyway, I think that *hundreds* or even *thousands* of rows are effectively locked all at once when a bitmap index needs to be updated in these other systems -- and I mean a heavyweight lock that lasts until the xact commits or aborts, like a Postgres row lock. As I said, this is necessary simply because the transaction might need to roll back. Of course, your patch never needs to do anything like that -- the only risk is that buffer lock contention will be increased. Maybe VACUUM isn't so bad after all! Doing deduplication adaptively and automatically in nbtree seems like it might play to the strengths of Postgres, while also ameliorating its weaknesses. As the same paper goes on to say, it's actually quite unusual that PostgreSQL has *transactional* full text search built in (using GIN), and offers transactional, high concurrency spatial indexing (using GiST). Actually, this is an additional advantages of our "pure" approach to MVCC -- we can add new high concurrency, transactional access methods relatively easily. [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.3159&rep=rep1&type=pdf [2] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf -- Peter Geoghegan
On Wed, Jul 10, 2019 at 09:53:04PM -0700, Peter Geoghegan wrote: > Anyway, I think that *hundreds* or even *thousands* of rows are > effectively locked all at once when a bitmap index needs to be updated > in these other systems -- and I mean a heavyweight lock that lasts > until the xact commits or aborts, like a Postgres row lock. As I said, > this is necessary simply because the transaction might need to roll > back. Of course, your patch never needs to do anything like that -- > the only risk is that buffer lock contention will be increased. Maybe > VACUUM isn't so bad after all! > > Doing deduplication adaptively and automatically in nbtree seems like > it might play to the strengths of Postgres, while also ameliorating > its weaknesses. As the same paper goes on to say, it's actually quite > unusual that PostgreSQL has *transactional* full text search built in > (using GIN), and offers transactional, high concurrency spatial > indexing (using GiST). Actually, this is an additional advantages of > our "pure" approach to MVCC -- we can add new high concurrency, > transactional access methods relatively easily. Wow, I never thought of that. The only things I know we lock until transaction end are rows we update (against concurrent updates), and additions to unique indexes. By definition, indexes with many duplicates are not unique, so that doesn't apply. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi Peter, Thank you very much for your attention to this patch. Let me comment some points of your message. On Sun, Jul 7, 2019 at 2:09 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > The new version of the patch is attached. > > This version is even simpler than the previous one, > > thanks to the recent btree design changes and all the feedback I received. > > I consider it ready for review and testing. > > I took a closer look at this patch, and have some general thoughts on > its design, and specific feedback on the implementation. > > Preserving the *logical contents* of B-Tree indexes that use > compression is very important -- that should not change in a way that > outside code can notice. The heap TID itself should count as logical > contents here, since we want to be able to implement retail index > tuple deletion in the future. Even without retail index tuple > deletion, amcheck's "rootdescend" verification assumes that it can > find one specific tuple (which could now just be one specific "logical > tuple") using specific key values from the heap, including the heap > tuple's heap TID. This requirement makes things a bit harder for your > patch, because you have to deal with one or two edge-cases that you > currently don't handle: insertion of new duplicates that fall inside > the min/max range of some existing posting list. That should be rare > enough in practice, so the performance penalty won't be too bad. This > probably means that code within _bt_findinsertloc() and/or > _bt_binsrch_insert() will need to think about a logical tuple as a > distinct thing from a physical tuple, though that won't be necessary > in most places. Could you please elaborate more on preserving the logical contents? I can understand it as following: "B-Tree should have the same structure and invariants as if each TID in posting list be a separate tuple". So, if we imagine each TID to become separate tuple it would be the same B-tree, which just can magically sometimes store more tuples in page. Is my understanding correct? But outside code will still notice changes as soon as it directly accesses B-tree pages (like contrib/amcheck does). Do you mean we need an API for accessing logical B-tree tuples or something? > The need to "preserve the logical contents" also means that the patch > will need to recognize when indexes are not safe as a target for > compression/deduplication (maybe we should call this feature > deduplilcation, so it's clear how it differs from TOAST?). For > example, if we have a case-insensitive ICU collation, then it is not > okay to treat an opclass-equal pair of text strings that use the > collation as having the same value when considering merging the two > into one. You don't actually do that in the patch, but you also don't > try to deal with the fact that such a pair of strings are equal, and > so must have their final positions determined by the heap TID column > (deduplication within _bt_compress_one_page() must respect that). > Possibly equal-but-distinct values seems like a problem that's not > worth truly fixing, but it will be necessary to store metadata about > whether or not we're willing to do deduplication in the meta page, > based on operator class and collation details. That seems like a > restriction that we're just going to have to accept, though I'm not > too worried about exactly what that will look like right now. We can > work it out later. I think in order to deduplicate "equal but distinct" values we need at least to give up with index only scans. Because we have no restriction that equal according to B-tree opclass values are same for other operations and/or user output. > I think that the need to be careful about the logical contents of > indexes already causes bugs, even with "safe for compression" indexes. > For example, I can sometimes see an assertion failure > within_bt_truncate(), at the point where we check if heap TID values > are safe: > > /* > * Lehman and Yao require that the downlink to the right page, which is to > * be inserted into the parent page in the second phase of a page split be > * a strict lower bound on items on the right page, and a non-strict upper > * bound for items on the left page. Assert that heap TIDs follow these > * invariants, since a heap TID value is apparently needed as a > * tiebreaker. > */ > #ifndef DEBUG_NO_TRUNCATE > Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft), > BTreeTupleGetMinTID(firstright)) < 0); > ... > > This bug is not that easy to see, but it will happen with a big index, > even without updates or deletes. I think that this happens because > compression can allow the "logical tuples" to be in the wrong heap TID > order when there are multiple posting lists for the same value. As I > said, I think that it's necessary to see a posting list as being > comprised of multiple logical tuples in the context of inserting new > tuples, even when you're not performing compression or splitting the > page. I also see that amcheck's bt_index_parent_check() function > fails, though bt_index_check() does not fail when I don't use any of > its extra verification options. (You haven't updated amcheck, but I > don't think that you need to update it for these basic checks to > continue to work.) Do I understand correctly that current patch may produce posting lists of the same value with overlapping ranges of TIDs? If so, it's definitely wrong. > * Maybe we could do compression with unique indexes when inserting > values with NULLs? Note that we now treat an insertion of a tuple with > NULLs into a unique index as if it wasn't even a unique index -- see > the "checkingunique" optimization at the beginning of _bt_doinsert(). > Having many NULL values in a unique index is probably fairly common. I think unique indexes may benefit from deduplication not only because of NULL values. Non-HOT updates produce duplicates of non-NULL values in unique indexes. And those duplicates can take significant space. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Jul 11, 2019 at 7:53 AM Peter Geoghegan <pg@bowt.ie> wrote: > Anyway, I think that *hundreds* or even *thousands* of rows are > effectively locked all at once when a bitmap index needs to be updated > in these other systems -- and I mean a heavyweight lock that lasts > until the xact commits or aborts, like a Postgres row lock. As I said, > this is necessary simply because the transaction might need to roll > back. Of course, your patch never needs to do anything like that -- > the only risk is that buffer lock contention will be increased. Maybe > VACUUM isn't so bad after all! > > Doing deduplication adaptively and automatically in nbtree seems like > it might play to the strengths of Postgres, while also ameliorating > its weaknesses. As the same paper goes on to say, it's actually quite > unusual that PostgreSQL has *transactional* full text search built in > (using GIN), and offers transactional, high concurrency spatial > indexing (using GiST). Actually, this is an additional advantages of > our "pure" approach to MVCC -- we can add new high concurrency, > transactional access methods relatively easily. Good finding, thank you! BTW, I think deduplication could cause some small performance degradation in some particular cases, because page-level locks became more coarse grained once pages hold more tuples. However, this doesn't seem like something we should much care about. Providing an option to turn deduplication off looks enough for me. Regarding bitmap indexes itself, I think our BRIN could provide them. However, it would be useful to have opclass parameters to make them tunable. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Sun, 7 Jul 2019 at 01:08, Peter Geoghegan <pg@bowt.ie> wrote: > * Maybe we could do compression with unique indexes when inserting > values with NULLs? Note that we now treat an insertion of a tuple with +1 I tried this patch and found the improvements impressive. However, when I tried with multi-column indexes it wasn't giving any improvement, is it the known limitation of the patch? I am surprised to find that such a patch is on radar since quite some years now and not yet committed. Going through the patch, here are a few comments from me, /* Add the new item into the page */ + offnum = OffsetNumberNext(offnum); + + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d IndexTupleSize %zu free %zu", + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); + and other such DEBUG4 statements are meant to be removed, right...? Just because I didn't find any other such statements in this API and there are many in this patch, so not sure how much are they needed. /* * If we have only 10 uncompressed items on the full page, it probably * won't worth to compress them. */ if (maxoff - n_posting_on_page < 10) return; Is this a magic number...? /* * We do not expect to meet any DEAD items, since this function is * called right after _bt_vacuum_one_page(). If for some reason we * found dead item, don't compress it, to allow upcoming microvacuum * or vacuum clean it up. */ if (ItemIdIsDead(itemId)) continue; This makes me wonder about those 'some' reasons. Caller is responsible for checking BTreeTupleIsPosting to ensure that + * he will get what he expects This can be re-framed to make the caller more gender neutral. Other than that, I am curious about the plans for its backward compatibility. -- Regards, Rafia Sabih
On Thu, Jul 11, 2019 at 7:30 AM Bruce Momjian <bruce@momjian.us> wrote: > Wow, I never thought of that. The only things I know we lock until > transaction end are rows we update (against concurrent updates), and > additions to unique indexes. By definition, indexes with many > duplicates are not unique, so that doesn't apply. Right. Another advantage of their approach is that you can make queries like this work: UPDATE tab SET unique_col = unique_col + 1 This will not throw a unique violation error on most/all other DB systems when the updated column (in this case "unique_col") has a unique constraint/is the primary key. This behavior is actually required by the SQL standard. An SQL statement is supposed to be all-or-nothing, which Postgres doesn't quite manage here. The section "6.6 Interdependencies of Transactional Storage" from the paper "Architecture of a Database System" provides additional background information (I should have suggested reading both 6.6 and 6.7 together). -- Peter Geoghegan
On Thu, Jul 11, 2019 at 8:02 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Could you please elaborate more on preserving the logical contents? I > can understand it as following: "B-Tree should have the same structure > and invariants as if each TID in posting list be a separate tuple". That's exactly what I mean. > So, if we imagine each TID to become separate tuple it would be the > same B-tree, which just can magically sometimes store more tuples in > page. Is my understanding correct? Yes. > But outside code will still > notice changes as soon as it directly accesses B-tree pages (like > contrib/amcheck does). Do you mean we need an API for accessing > logical B-tree tuples or something? Well, contrib/amcheck isn't really outside code. But amcheck's "rootdescend" option will still need to be able to supply a heap TID as just another column, and get back zero or one logical tuples from the index. This is important because retail index tuple deletion needs to be able to think about logical tuples in the same way. I also think that it might be useful for the planner to expect to get back duplicates in heap TID order in the future (or in reverse order in the case of a backwards scan). Query execution and VACUUM code outside of nbtree should be able to pretend that there is no such thing as a posting list. The main thing that the patch is missing that is needed to "preserve logical contents" is the ability to update/expand an *existing* posting list due to a retail insertion of a new duplicate that happens to be within the range of that existing posting list. This will usually be a non-HOT update that doesn't change the value for the row in the index -- that must change the posting list, even when there is available space on the page without recompressing. We must still occasionally be eager, like GIN always is, though in practice we'll almost always add to posting lists in a lazy fashion, when it looks like we might have to split the page -- the lazy approach seems to perform best. > I think in order to deduplicate "equal but distinct" values we need at > least to give up with index only scans. Because we have no > restriction that equal according to B-tree opclass values are same for > other operations and/or user output. We can either prevent index-only scans in the case of affected indexes, or prevent compression, or give the user a choice. I'm not too worried about how that will work for users just yet. > Do I understand correctly that current patch may produce posting lists > of the same value with overlapping ranges of TIDs? If so, it's > definitely wrong. Yes, it can, since the assertion fails. It looks like the assertion itself was changed to match what I expect, so I assume that this bug will be fixed in the next version of the patch. It fails with a fairly big index on text for me. > > * Maybe we could do compression with unique indexes when inserting > > values with NULLs? Note that we now treat an insertion of a tuple with > > NULLs into a unique index as if it wasn't even a unique index -- see > > the "checkingunique" optimization at the beginning of _bt_doinsert(). > > Having many NULL values in a unique index is probably fairly common. > > I think unique indexes may benefit from deduplication not only because > of NULL values. Non-HOT updates produce duplicates of non-NULL values > in unique indexes. And those duplicates can take significant space. I agree that we should definitely have an open mind about unique indexes, even with non-NULL values. If we can prevent a page split by deduplicating the contents of a unique index page, then we'll probably win. Why not try? This will need to be tested. -- Peter Geoghegan
On Thu, Jul 11, 2019 at 8:09 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > BTW, I think deduplication could cause some small performance > degradation in some particular cases, because page-level locks became > more coarse grained once pages hold more tuples. However, this > doesn't seem like something we should much care about. Providing an > option to turn deduplication off looks enough for me. There was an issue like this with my v12 work on nbtree, with the TPC-C indexes. They were always ~40% smaller, but there was a regression when TPC-C was used with a small number of warehouses, when the data could easily fit in memory (which is not allowed by the TPC-C spec, in effect). TPC-C is very write-heavy, which combined with everything else causes this problem. I wasn't doing anything too fancy there -- the regression seemed to happen simply because the index was smaller, not because of the overhead of doing page splits differently or anything like that (there were far fewer splits). I expect there to be some regression for workloads like this. I am willing to accept that provided it's not too noticeable, and doesn't have an impact on other workloads. I am optimistic about it. > Regarding bitmap indexes itself, I think our BRIN could provide them. > However, it would be useful to have opclass parameters to make them > tunable. I thought that we might implement them in nbtree myself. But we don't need to decide now. -- Peter Geoghegan
On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote: > I tried this patch and found the improvements impressive. However, > when I tried with multi-column indexes it wasn't giving any > improvement, is it the known limitation of the patch? It'll only deduplicate full duplicates. It works with multi-column indexes, provided the entire set of values in duplicated -- not just a prefix. Prefix compression is possible, but it's more complicated. It seems to generally require the DBA to specify a prefix length, expressed as a number of prefix columns. > I am surprised to find that such a patch is on radar since quite some > years now and not yet committed. The v12 work on nbtree (making heap TID a tiebreaker column) seems to have made the general approach a lot more effective. Compression is performed lazily, not eagerly, which seems to work a lot better. > + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d > IndexTupleSize %zu free %zu", > + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); > + > and other such DEBUG4 statements are meant to be removed, right...? I hope so too. > /* > * If we have only 10 uncompressed items on the full page, it probably > * won't worth to compress them. > */ > if (maxoff - n_posting_on_page < 10) > return; > > Is this a magic number...? I think that this should be a constant or something. > /* > * We do not expect to meet any DEAD items, since this function is > * called right after _bt_vacuum_one_page(). If for some reason we > * found dead item, don't compress it, to allow upcoming microvacuum > * or vacuum clean it up. > */ > if (ItemIdIsDead(itemId)) > continue; > > This makes me wonder about those 'some' reasons. I think that this is just defensive. Note that _bt_vacuum_one_page() is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag is set for the page. > Caller is responsible for checking BTreeTupleIsPosting to ensure that > + * he will get what he expects > > This can be re-framed to make the caller more gender neutral. Agreed. I also don't like anthropomorphizing code like this. > Other than that, I am curious about the plans for its backward compatibility. Me too. There is something about a new version 5 in comments in nbtree.h, but the version number isn't changed. I think that we may be able to get away with not increasing the B-Tree version from 4 to 5, actually. Deduplication is performed lazily when it looks like we might have to split the page, so there isn't any expectation that tuples will either be compressed or uncompressed in any context. -- Peter Geoghegan
On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I think unique indexes may benefit from deduplication not only because > > of NULL values. Non-HOT updates produce duplicates of non-NULL values > > in unique indexes. And those duplicates can take significant space. > > I agree that we should definitely have an open mind about unique > indexes, even with non-NULL values. If we can prevent a page split by > deduplicating the contents of a unique index page, then we'll probably > win. Why not try? This will need to be tested. I thought about this some more. I believe that the LP_DEAD bit setting within _bt_check_unique() is generally more important than the more complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even though the _bt_check_unique() thing can only be used with unique indexes. Also, I have often thought that we don't do enough to take advantage of the special characteristics of unique indexes -- they really are quite different. I believe that other database systems do this in various ways. Maybe we should too. Unique indexes are special because there can only ever be zero or one tuples of the same value that are visible to any possible MVCC snapshot. Within the index AM, there is little difference between an UPDATE by a transaction and a DELETE + INSERT of the same value by a transaction. If there are 3 or 5 duplicates within a unique index, then there is a strong chance that VACUUM could reclaim some of them, given the chance. It is worth going to a little effort to find out. In a traditional serial/bigserial primary key, the key space that is typically "owned" by the left half of a rightmost page split describes a range of about ~366 items, with few or no gaps for other values that didn't exist at the time of the split (i.e. the two pivot tuples on each side cover a range that is equal to the number of items itself). If the page ever splits again, the chances of it being due to non-HOT updates is perhaps 100%. Maybe VACUUM just didn't get around to the index in time, or maybe there is a long running xact, or whatever. If we can delay page splits in indexes like this, then we could easily prevent them from *ever* happening. Our first line of defense against page splits within unique indexes will probably always be LP_DEAD bits set within _bt_check_unique(), because it costs so little -- same as today. We could also add a second line of defense: deduplication -- same as with non-unique indexes with the patch. But we can even add a third line of defense on top of those two: more aggressive reclaiming of posting list space, by going to the heap to check the visibility status of earlier posting list entries. We can do this optimistically when there is no LP_DEAD bit set, based on heuristics. The high level principle here is that we can justify going to a small amount of extra effort for the chance to avoid a page split, and maybe even more than a small amount. Our chances of reversing the split by merging pages later on are almost zero. The two halves of the split will probably each get dirtied again and again in the future if we cannot avoid it, plus we have to dirty the parent page, and the old sibling page (to update its left link). In general, a page split is already really expensive. We could do something like amortize the cost of accessing the heap a second time for tuples that we won't have considered setting the LP_DEAD bit on within _bt_check_unique() by trying the *same* heap page a *second* time where possible (distinct values are likely to be nearby on the same page). I think that an approach like this could work quite well for many workloads. You only pay a cost (visiting the heap an extra time) when it looks like you'll get a benefit (not splitting the page). As you know, Andres already changed nbtree to get an XID for conflict purposes on the primary by visiting the heap a second time (see commit 558a9165e08), when we need to actually reclaim LP_DEAD space. I anticipated that we could extend this to do more clever/eager/lazy cleanup of additional items before that went in, which is a closely related idea. See: https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7 I know that this is a bit hand-wavy; the details certainly need to be worked out. However, it is not so different to the "ghost bit" design that other systems use with their non-unique indexes (though this idea applies specifically to unique indexes in our case). The main difference is that we're going to the heap rather than to UNDO, because that's where we store our visibility information. That doesn't seem like such a big difference -- we are also reasonably confident that we'll find that the TID is dead, even without LP_DEAD bits being set, because we only do the extra stuff with unique indexes. And, we do it lazily. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
11.07.2019 21:19, Peter Geoghegan wrote: > On Thu, Jul 11, 2019 at 8:34 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote: Hi, Peter, Rafia, thanks for the review. New version is attached. >> + elog(DEBUG4, "insert_itupprev_to_page. compressState->ntuples %d >> IndexTupleSize %zu free %zu", >> + compressState->ntuples, IndexTupleSize(to_insert), PageGetFreeSpace(page)); >> + >> and other such DEBUG4 statements are meant to be removed, right...? > I hope so too. Yes, these messages are only for debugging. I haven't delete them since this is still work in progress and it's handy to be able to print inner details. Maybe I should also write a patch for pageinspect. >> /* >> * If we have only 10 uncompressed items on the full page, it probably >> * won't worth to compress them. >> */ >> if (maxoff - n_posting_on_page < 10) >> return; >> >> Is this a magic number...? > I think that this should be a constant or something. Fixed. Now this is a constant in nbtree.h. I'm not 100% sure about the value. When the code will stabilize we can benchmark it and find optimal value. >> /* >> * We do not expect to meet any DEAD items, since this function is >> * called right after _bt_vacuum_one_page(). If for some reason we >> * found dead item, don't compress it, to allow upcoming microvacuum >> * or vacuum clean it up. >> */ >> if (ItemIdIsDead(itemId)) >> continue; >> >> This makes me wonder about those 'some' reasons. > I think that this is just defensive. Note that _bt_vacuum_one_page() > is prepared to find no dead items, even when the BTP_HAS_GARBAGE flag > is set for the page. You are right, now it is impossible to meet dead items in this function. Though it can change in the future if, for example, _bt_vacuum_one_page will behave lazily. So this is just a sanity check. Maybe it's worth to move it to Assert. > >> Caller is responsible for checking BTreeTupleIsPosting to ensure that >> + * he will get what he expects >> >> This can be re-framed to make the caller more gender neutral. > Agreed. I also don't like anthropomorphizing code like this. Fixed. >> Other than that, I am curious about the plans for its backward compatibility. > Me too. There is something about a new version 5 in comments in > nbtree.h, but the version number isn't changed. I think that we may be > able to get away with not increasing the B-Tree version from 4 to 5, > actually. Deduplication is performed lazily when it looks like we > might have to split the page, so there isn't any expectation that > tuples will either be compressed or uncompressed in any context. Current implementation is backward compatible. To distinguish posting tuples, it only adds one new flag combination. This combination was never possible before. Comment about version 5 is deleted. I also added a patch for amcheck. There is one major issue left - preserving TID order in posting lists. For a start, I added a sort into BTreeFormPostingTuple function. It turned out to be not very helpful, because we cannot check this invariant lazily. Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to implement insertion into the middle of the posting list. I'll send a new version this week. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
17.07.2019 19:36, Anastasia Lubennikova: > > There is one major issue left - preserving TID order in posting lists. > For a start, I added a sort into BTreeFormPostingTuple function. > It turned out to be not very helpful, because we cannot check this > invariant lazily. > > Now I work on patching _bt_binsrch_insert() and _bt_insertonpg() to > implement > insertion into the middle of the posting list. I'll send a new version > this week. Patch 0002 (must be applied on top of 0001) implements preserving of correct TID order inside posting list when inserting new tuples. This version passes all regression tests including amcheck test. I also used following script to test insertion into the posting list: set client_min_messages to debug4; drop table tbl; create table tbl (i1 int, i2 int); insert into tbl select 1, i from generate_series(0,1000) as i; insert into tbl select 1, i from generate_series(0,1000) as i; create index idx on tbl (i1); delete from tbl where i2 <500; vacuum tbl ; insert into tbl select 1, i from generate_series(1001, 1500) as i; The last insert triggers several insertions that can be seen in debug messages. I suppose it is not the final version of the patch yet, so I left some debug messages and TODO comments to ease review. Please, in your review, pay particular attention to usage of BTreeTupleGetHeapTID. For posting tuples it returns the first tid from posting list like BTreeTupleGetMinTID, but maybe some callers are not ready for that and want BTreeTupleGetMaxTID instead. Incorrect usage of these macros may cause some subtle bugs, which are probably not covered by tests. So, please double-check it. Next week I'm going to check performance and try to find specific scenarios where this feature can lead to degradation and measure it, to understand if we need to make this deduplication optional. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Patch 0002 (must be applied on top of 0001) implements preserving of > correct TID order > inside posting list when inserting new tuples. > This version passes all regression tests including amcheck test. > I also used following script to test insertion into the posting list: Nice! > I suppose it is not the final version of the patch yet, > so I left some debug messages and TODO comments to ease review. I'm fine with leaving them in. I have sometimes distributed a separate patch with debug messages, but now that I think about it, that probably wasn't a good use of time. You will probably want to remove at least some of the debug messages during performance testing. I'm thinking of code that appears in very tight inner loops, such as the _bt_compare() code. > Please, in your review, pay particular attention to usage of > BTreeTupleGetHeapTID. > For posting tuples it returns the first tid from posting list like > BTreeTupleGetMinTID, > but maybe some callers are not ready for that and want > BTreeTupleGetMaxTID instead. > Incorrect usage of these macros may cause some subtle bugs, > which are probably not covered by tests. So, please double-check it. One testing strategy that I plan to use for the patch is to deliberately corrupt a compressed index in a subtle way using pg_hexedit, and then see if amcheck detects the problem. For example, I may swap the order of two TIDs in the middle of a posting list, which is something that is unlikely to produce wrong answers to queries, and won't even be detected by the "heapallindexed" check, but is still wrong. If we can detect very subtle, adversarial corruption like this, then we can detect any real-world problem. Once we have confidence in amcheck's ability to detect problems with posting lists in general, we can use it in many different contexts without much thought. For example, we'll probably need to do long running benchmarks to validate the performance of the patch. It's easy to add amcheck testing at the end of each run. Every benchmark is now also a correctness/stress test, for free. > Next week I'm going to check performance and try to find specific > scenarios where this > feature can lead to degradation and measure it, to understand if we need > to make this deduplication optional. Sounds good, though I think it might be a bit too early to decide whether or not it needs to be enabled by default. For one thing, the approach to WAL-logging within _bt_compress_one_page() is probably fairly inefficient, which may be a problem for certain workloads. It's okay to leave it that way for now, because it is not relevant to the core design of the patch. I'm sure that _bt_compress_one_page() can be carefully optimized when the time comes. My current focus is not on the raw performance itself. For now, I am focussed on making sure that the compression works well, and that the resulting indexes "look nice" in general. FWIW, the first few versions of my v12 work on nbtree didn't actually make *anything* go faster. It took a couple of months to fix the more important regressions, and a few more months to fix all of them. I think that the work on this patch may develop in a similar way. I am willing to accept regressions in the unoptimized code during development because it seems likely that you have the right idea about the data structure itself, which is the one thing that I *really* care about. Once you get that right, the remaining problems are very likely to either be fixable with further work on optimizing specific code, or a price that users will mostly be happy to pay to get the benefits. -- Peter Geoghegan
On Fri, Jul 19, 2019 at 12:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jul 19, 2019 at 10:53 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > Patch 0002 (must be applied on top of 0001) implements preserving of > > correct TID order > > inside posting list when inserting new tuples. > > This version passes all regression tests including amcheck test. > > I also used following script to test insertion into the posting list: > > Nice! Hmm. So, the attached test case fails amcheck verification for me with the latest version of the patch: $ psql -f amcheck-compress-test.sql DROP TABLE CREATE TABLE CREATE INDEX CREATE EXTENSION INSERT 0 2001 psql:amcheck-compress-test.sql:6: ERROR: down-link lower bound invariant violated for index "idx_desc_nl" DETAIL: Parent block=3 child index tid=(2,2) parent page lsn=10/F87A3438. Note that this test only has an INSERT statement. You have to use bt_index_parent_check() to see the problem -- bt_index_check() will not detect the problem. -- Peter Geoghegan
Attachment
On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > Hmm. So, the attached test case fails amcheck verification for me with > the latest version of the patch: Attached is a revised version of your v2 that fixes this issue -- I'll call this v3. In general, my goal for the revision was to make sure that all of my old tests from the v12 work passed, and to make sure that amcheck can detect almost any possible problem. I tested the amcheck changes by corrupting random state in a test index using pg_hexedit, then making sure that amcheck actually complained in each case. I also fixed one or two bugs in passing, including the bug that caused an assertion failure in _bt_truncate(). That was down to a subtle off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't make that many changes to your v2. There are probably some things about the patch that I still don't understand, or things that I have misunderstood. Other changes: * We now support system catalog indexes. There is no reason not to support them. * Removed unnecessary code from _bt_buildadd(). * Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which I used to fix that bug I mentioned. I agree that we should keep the DEBUG4 traces around until the overall design settles down. I found the ones that you added helpful, too. * Added quite a few new assertions. For example, we need to still support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot let them use compression -- new defensive assertions were added to make this break loudly. * Changed the custom binary search code within _bt_compare_posting() to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know of any reason not to do it that way? * Added quite a few "FIXME"/"XXX" comments at various points, to indicate where I have general concerns that need more discussion. * Included my own pageinspect hack to visualize the minimum TIDs in posting lists. It's broken out into a separate patch file. The code is very rough, but it might help someone else, so I thought I'd include it. I also have some new concerns about the code in the patch that I will point out now (though only as something to think about a solution on -- I am unsure myself): * It's a bad sign that compression involves calls to PageAddItem() that are allowed to fail (we just give up on compression when that happens). For one thing, all existing calls to PageAddItem() in Postgres are never expected to fail -- if they do fail we get a "can't happen" error that suggests corruption. It was a good idea to take this approach to get the patch to work, and to prove the general idea, but we now need to fully work out all the details about the use of space. This includes complicated new questions around how alignment is supposed to work. Alignment in nbtree is already complicated today -- you're supposed to MAXALIGN() everything in nbtree, so that the MAXALIGN() within bufpage.c routines cannot be different to the lp_len/IndexTupleSize() length (note that heapam can have tuples whose lp_len isn't aligned, so nbtree could do it differently if it proved useful). Code within nbtsplitloc.c fully understands the space requirements for the bufpage.c routines, and is very careful about it. (The bufpage.c details are supposed to be totally hidden from code like nbtsplitloc.c, but I guess that that ideal isn't quite possible in reality. Code comments don't really explain the situation today.) I'm not sure what it would look like for this patch to be as precise about free space as nbtsplitloc.c already is, even though that seems desirable (I just know that it would mean you would trust PageAddItem() to work in all cases). The patch is different to what we already have today in that it tries to add *less than* a single MAXALIGN() quantum at a time in some places (when a posting list needs to grow by one item). The devil is in the details. * As you know, the current approach to WAL logging is very inefficient. It's okay for now, but we'll need a fine-grained approach for the patch to be commitable. I think that this is subtly related to the last item (i.e. the one about alignment). I have done basic performance tests using unlogged tables. The patch seems to either make big INSERT queries run as fast or faster than before when inserting into unlogged tables, which is a very good start. * Since we can now split a posting list in two, we may also have to reconsider BTMaxItemSize, or some similar mechanism that worries about extreme cases where it becomes impossible to split because even two pages are not enough to fit everything. Think of what happens when there is a tuple with a single large datum, that gets split in two (the tuple is split, not the page), with each half receiving its own copy of the datum. I haven't proven to myself that this is broken, but that may just be because I haven't spent any time on it. OTOH, maybe you already have it right, in which case it seems like it should be explained somewhere. Possibly in nbtree.h. This is tricky stuff. * I agree with all of your existing TODO items -- most of them seem very important to me. * Do we really need to keep BTreeTupleGetHeapTID(), now that we have BTreeTupleGetMinTID()? Can't we combine the two macros into one, so that callers don't need to think about the pivot vs posting list thing themselves? See the new code added to _bt_mkscankey() by v3, for example. It now handles both cases/macros at once, in order to keep its amcheck caller happy. amcheck's verify_nbtree.c received similar ugly code in v3. * We should at least experiment with applying compression when inserting into unique indexes. Like Alexander, I think that compression in unique indexes might work well, given how they must work in Postgres. My next steps will be to study the design of the _bt_insertonpg_in_posting() stuff some more. It seems like you already have the right general idea there, but I would like to come up with a way of making _bt_insertonpg_in_posting() understand how to work with space on the page with total certainty, much like nbtsplitloc.c does today. This should allow us to make WAL-logging more precise/incremental. -- Peter Geoghegan
Attachment
On Tue, Jul 23, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is a revised version of your v2 that fixes this issue -- I'll > call this v3. Remember that index that I said was 5.5x smaller with the patch applied, following retail insertions (a single big INSERT ... SELECT ...)? Well, it's 6.5x faster with this small additional patch applied on top of the v3 I posted yesterday. Many of the indexes in my test suite are about ~20% smaller __in addition to__ very big size reductions. Some are even ~30% smaller than they were with v3 of the patch. For example, the fair use implementation of TPC-H that my test data comes from has an index on the "orders" o_orderdate column, named idx_orders_orderdate, which is made ~30% smaller by the addition of this simple patch (once again, this is following a single big INSERT ... SELECT ...). This change makes idx_orders_orderdate ~3.3x smaller than it is with master/Postgres 12, in case you were wondering. This new patch teaches nbtsplitloc.c to subtract posting list overhead when sizing the new high key for the left half of a candidate split point, since we know for sure that _bt_truncate() will at least manage to truncate away that much from the new high key, even in the worst case. Since posting lists are often very large, this can make a big difference. This is actually just a bugfix, not a new idea -- I merely made nbtsplitloc.c understand how truncation works with posting lists. There seems to be a kind of "synergy" between the nbtsplitloc.c handling of pages that have lots of duplicates and posting list compression. It seems as if the former mechanism "sets up the bowling pins", while the latter mechanism "knocks them down", which is really cool. We should try to gain a better understanding of how that works, because it's possible that it could be even more effective in some cases. -- Peter Geoghegan
Attachment
On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote: > There seems to be a kind of "synergy" between the nbtsplitloc.c > handling of pages that have lots of duplicates and posting list > compression. It seems as if the former mechanism "sets up the bowling > pins", while the latter mechanism "knocks them down", which is really > cool. We should try to gain a better understanding of how that works, > because it's possible that it could be even more effective in some > cases. I found another important way in which this synergy can fail to take place, which I can fix. By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes from my test suite become much smaller, while most are not affected. These indexes were not helped too much by the patch before. For example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full of duplicates of a single value (that's how it appears after an initial TPC-E bulk load), as are a couple of other TPC-E indexes. TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its idx_lineitem_orderkey index becomes ~15% smaller. I believe that this happened because rightmost page splits were an inefficient case for compression. But rightmost page split heavy indexes with lots of duplicates are not that uncommon. Think of any index with many NULL values, for example. I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm not sure what the idea is behind it. My sense is that we're likely to benefit by delaying page splits, no matter what. Though I am still looking at it purely from a space utilization point of view, at least for now. -- Peter Geoghegan
On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <pg@bowt.ie> wrote: > > There seems to be a kind of "synergy" between the nbtsplitloc.c > > handling of pages that have lots of duplicates and posting list > > compression. It seems as if the former mechanism "sets up the bowling > > pins", while the latter mechanism "knocks them down", which is really > > cool. We should try to gain a better understanding of how that works, > > because it's possible that it could be even more effective in some > > cases. > > I found another important way in which this synergy can fail to take > place, which I can fix. > > By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes > from my test suite become much smaller, while most are not affected. > These indexes were not helped too much by the patch before. For > example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full > of duplicates of a single value (that's how it appears after an > initial TPC-E bulk load), as are a couple of other TPC-E indexes. > TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its > idx_lineitem_orderkey index becomes ~15% smaller. > > I believe that this happened because rightmost page splits were an > inefficient case for compression. But rightmost page split heavy > indexes with lots of duplicates are not that uncommon. Think of any > index with many NULL values, for example. > > I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm > not sure what the idea is behind it. My sense is that we're likely to > benefit by delaying page splits, no matter what. Though I am still > looking at it purely from a space utilization point of view, at least > for now. > Minor comment fix, pointes-->pointer, plus, are we really doing the half, or is it just splitting into two. /* + * Split posting tuple into two halves. + * + * Left tuple contains all item pointes less than the new one and + * right tuple contains new item pointer and all to the right. + * + * TODO Probably we can come up with more clever algorithm. + */ Some remains of 'he'. +/* + * If tuple is posting, t_tid.ip_blkid contains offset of the posting list. + * Caller is responsible for checking BTreeTupleIsPosting to ensure that + * it will get what he expects + */ Everything reads just fine without 'us'. /* + * This field helps us to find beginning of the remaining tuples from + * postings which follow array of offset numbers. + */ -- Regards, Rafia Sabih
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
24.07.2019 4:22, Peter Geoghegan wrote: > > Attached is a revised version of your v2 that fixes this issue -- I'll > call this v3. In general, my goal for the revision was to make sure > that all of my old tests from the v12 work passed, and to make sure > that amcheck can detect almost any possible problem. I tested the > amcheck changes by corrupting random state in a test index using > pg_hexedit, then making sure that amcheck actually complained in each > case. > > I also fixed one or two bugs in passing, including the bug that caused > an assertion failure in _bt_truncate(). That was down to a subtle > off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't > make that many changes to your v2. There are probably some things > about the patch that I still don't understand, or things that I have > misunderstood. > Thank you for this review and fixes. > * Changed the custom binary search code within _bt_compare_posting() > to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know > of any reason not to do it that way? It's ok to update it. There was no particular reason, just my habit. > * Added quite a few "FIXME"/"XXX" comments at various points, to > indicate where I have general concerns that need more discussion. + * FIXME: The calls to BTreeGetNthTupleOfPosting() allocate memory, If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(), we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not tuples. Fixed in version 4. > * Included my own pageinspect hack to visualize the minimum TIDs in > posting lists. It's broken out into a separate patch file. The code is > very rough, but it might help someone else, so I thought I'd include > it. Cool, I think we should add it to the final patchset, probably, as separate function by analogy with tuple_data_split. > I also have some new concerns about the code in the patch that I will > point out now (though only as something to think about a solution on > -- I am unsure myself): > > * It's a bad sign that compression involves calls to PageAddItem() > that are allowed to fail (we just give up on compression when that > happens). For one thing, all existing calls to PageAddItem() in > Postgres are never expected to fail -- if they do fail we get a "can't > happen" error that suggests corruption. It was a good idea to take > this approach to get the patch to work, and to prove the general idea, > but we now need to fully work out all the details about the use of > space. This includes complicated new questions around how alignment is > supposed to work. The main reason to implement this gentle error handling is the fact that deduplication could cause storage overhead, which leads to running out of space on the page. First of all, it is a legacy of the previous versions where BTreeFormPostingTuple was not able to form non-posting tuple even in case where a number of posting items is 1. Another case that was in my mind is the situation where we have 2 tuples: t_tid | t_info | key + t_tid | t_info | key and compressed result is: t_tid | t_info | key | t_tid | t_tid If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple can be larger. It may happen if keysize <= 4 byte. In this situation original tuples must have been aligned to size 16 bytes each, and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is also safe. I changed DEBUG message to ERROR in v4 and it passes all regression tests. I doubt that it covers all corner cases, so I'll try to add more special tests. > Alignment in nbtree is already complicated today -- you're supposed to > MAXALIGN() everything in nbtree, so that the MAXALIGN() within > bufpage.c routines cannot be different to the lp_len/IndexTupleSize() > length (note that heapam can have tuples whose lp_len isn't aligned, > so nbtree could do it differently if it proved useful). Code within > nbtsplitloc.c fully understands the space requirements for the > bufpage.c routines, and is very careful about it. (The bufpage.c > details are supposed to be totally hidden from code like > nbtsplitloc.c, but I guess that that ideal isn't quite possible in > reality. Code comments don't really explain the situation today.) > > I'm not sure what it would look like for this patch to be as precise > about free space as nbtsplitloc.c already is, even though that seems > desirable (I just know that it would mean you would trust > PageAddItem() to work in all cases). The patch is different to what we > already have today in that it tries to add *less than* a single > MAXALIGN() quantum at a time in some places (when a posting list needs > to grow by one item). The devil is in the details. > > * As you know, the current approach to WAL logging is very > inefficient. It's okay for now, but we'll need a fine-grained approach > for the patch to be commitable. I think that this is subtly related to > the last item (i.e. the one about alignment). I have done basic > performance tests using unlogged tables. The patch seems to either > make big INSERT queries run as fast or faster than before when > inserting into unlogged tables, which is a very good start. > > * Since we can now split a posting list in two, we may also have to > reconsider BTMaxItemSize, or some similar mechanism that worries about > extreme cases where it becomes impossible to split because even two > pages are not enough to fit everything. Think of what happens when > there is a tuple with a single large datum, that gets split in two > (the tuple is split, not the page), with each half receiving its own > copy of the datum. I haven't proven to myself that this is broken, but > that may just be because I haven't spent any time on it. OTOH, maybe > you already have it right, in which case it seems like it should be > explained somewhere. Possibly in nbtree.h. This is tricky stuff. Hmm, I can't get the problem. In current implementation each posting tuple is smaller than BTMaxItemSize, so no split can lead to having tuple of larger size. > * I agree with all of your existing TODO items -- most of them seem > very important to me. > > * Do we really need to keep BTreeTupleGetHeapTID(), now that we have > BTreeTupleGetMinTID()? Can't we combine the two macros into one, so > that callers don't need to think about the pivot vs posting list thing > themselves? See the new code added to _bt_mkscankey() by v3, for > example. It now handles both cases/macros at once, in order to keep > its amcheck caller happy. amcheck's verify_nbtree.c received similar > ugly code in v3. No, we don't need them both. I don't mind combining them into one macro. Actually, we never needed BTreeTupleGetMinTID(), since its functionality is covered by BTreeTupleGetHeapTID. On the other hand, in some cases BTreeTupleGetMinTID() looks more readable. For example here: > Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lefttup), BTreeTupleGetMinTID(righttup)) < 0); > * We should at least experiment with applying compression when > inserting into unique indexes. Like Alexander, I think that > compression in unique indexes might work well, given how they must > work in Postgres. The main reason why I decided to avoid applying compression to unique indexes is the performance of microvacuum. It is not applied to items inside a posting tuple. And I expect it to be important for unique indexes, which ideally contain only a few live values. One more thing I want to discuss: /* * We do not expect to meet any DEAD items, since this function is * called right after _bt_vacuum_one_page(). If for some reason we * found dead item, don't compress it, to allow upcoming microvacuum * or vacuum clean it up. */ if (ItemIdIsDead(itemId)) continue; In the previous review Rafia asked about "some reason". Trying to figure out if this situation possible, I changed this line to Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a performance test. Unfortunately, I was not able to reproduce it. The explanation I see is that page had DEAD items, but for some reason BTP_HAS_GARBAGE was not set so _bt_vacuum_one_page() was not called. I find it difficult to understand what could lead to this situation, so probably we need to inspect it closer to exclude the possibility of a bug. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Jul 31, 2019 at 9:23 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > * Included my own pageinspect hack to visualize the minimum TIDs in > > posting lists. It's broken out into a separate patch file. The code is > > very rough, but it might help someone else, so I thought I'd include > > it. > Cool, I think we should add it to the final patchset, > probably, as separate function by analogy with tuple_data_split. Good idea. Attached is v5, which is based on your v4. The three main differences between this and v4 are: * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my July 24 e-mail. We can always add something like this back during performance validation of the patch. Right now, having no BT_COMPRESS_THRESHOLD limit definitely improves space utilization for certain important cases, which seems more important than the uncertain/speculative downside. * We now have experimental support for unique indexes. This is broken out into its own patch. * We now handle LP_DEAD items in a special way within _bt_insertonpg_in_posting(). As you pointed out already, we do need to think about LP_DEAD items directly, rather than assuming that they cannot be on the page that _bt_insertonpg_in_posting() must process. More on that later. > If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple > can be > larger. It may happen if keysize <= 4 byte. > In this situation original tuples must have been aligned to size 16 > bytes each, > and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is > also safe. I still need to think about the exact details of alignment within _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I could be wrong. > I changed DEBUG message to ERROR in v4 and it passes all regression tests. > I doubt that it covers all corner cases, so I'll try to add more special > tests. It also passes my tests, FWIW. > Hmm, I can't get the problem. > In current implementation each posting tuple is smaller than BTMaxItemSize, > so no split can lead to having tuple of larger size. That sounds correct, then. > No, we don't need them both. I don't mind combining them into one macro. > Actually, we never needed BTreeTupleGetMinTID(), > since its functionality is covered by BTreeTupleGetHeapTID. I've removed BTreeTupleGetMinTID() in v5. I think it's fine to just have a comment next to BTreeTupleGetHeapTID(), and another comment next to BTreeTupleGetMaxTID(). > The main reason why I decided to avoid applying compression to unique > indexes > is the performance of microvacuum. It is not applied to items inside a > posting > tuple. And I expect it to be important for unique indexes, which ideally > contain only a few live values. I found that the performance of my experimental patch with unique index was significantly worse. It looks like this is a bad idea, as you predicted, though we may still want to do deduplication/compression with NULL values in unique indexes. I did learn a few things from implementing unique index support, though. BTW, there is a subtle bug in how my unique index patch does WAL-logging -- see my comments within index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if replication isn't used. I don't think that we're going to use this experimental patch at all, so I didn't bother fixing the bug. > if (ItemIdIsDead(itemId)) > continue; > > In the previous review Rafia asked about "some reason". > Trying to figure out if this situation possible, I changed this line to > Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a > performance > test. Unfortunately, I was not able to reproduce it. I found it easy enough to see LP_DEAD items within _bt_insertonpg_in_posting() when running pgbench with the extra unique index patch. To give you a simple example of how this can happen, consider the comments about BTP_HAS_GARBAGE within _bt_delitems_vacuum(). That probably isn't the only way it can happen, either. ISTM that we need to be prepared for LP_DEAD items during deduplication, rather than trying to prevent deduplication from ever having to see an LP_DEAD item. v5 makes _bt_insertonpg_in_posting() prepared to overwrite an existing item if it's an LP_DEAD item that falls in the same TID range (that's _bt_compare()-wise "equal" to an existing tuple, which may or may not be a posting list tuple already). I haven't made this code do something like call index_compute_xid_horizon_for_tuples(), even though that's needed for correctness (i.e. this new code is currently broken in the same way that I mentioned unique index support is broken). I also added a nearby FIXME comment to _bt_insertonpg_in_posting() -- I don't think think that the code for splitting a posting list in two is currently crash-safe. How do you feel about officially calling this deduplication, not compression? I think that it's a more accurate name for the technique. -- Peter Geoghegan
Attachment
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
06.08.2019 4:28, Peter Geoghegan wrote: > Attached is v5, which is based on your v4. The three main differences > between this and v4 are: > > * Removed BT_COMPRESS_THRESHOLD stuff, for the reasons explained in my > July 24 e-mail. We can always add something like this back during > performance validation of the patch. Right now, having no > BT_COMPRESS_THRESHOLD limit definitely improves space utilization for > certain important cases, which seems more important than the > uncertain/speculative downside. Fair enough. I think we can measure performance and make a decision, when patch will stabilize. > * We now have experimental support for unique indexes. This is broken > out into its own patch. > > * We now handle LP_DEAD items in a special way within > _bt_insertonpg_in_posting(). > > As you pointed out already, we do need to think about LP_DEAD items > directly, rather than assuming that they cannot be on the page that > _bt_insertonpg_in_posting() must process. More on that later. > >> If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple >> can be >> larger. It may happen if keysize <= 4 byte. >> In this situation original tuples must have been aligned to size 16 >> bytes each, >> and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is >> also safe. > I still need to think about the exact details of alignment within > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I > could be wrong. Could you explain more about these cases? Now I don't understand the problem. >> The main reason why I decided to avoid applying compression to unique >> indexes >> is the performance of microvacuum. It is not applied to items inside a >> posting >> tuple. And I expect it to be important for unique indexes, which ideally >> contain only a few live values. > I found that the performance of my experimental patch with unique > index was significantly worse. It looks like this is a bad idea, as > you predicted, though we may still want to do > deduplication/compression with NULL values in unique indexes. I did > learn a few things from implementing unique index support, though. > > BTW, there is a subtle bug in how my unique index patch does > WAL-logging -- see my comments within > index_compute_xid_horizon_for_tuples(). The bug shouldn't matter if > replication isn't used. I don't think that we're going to use this > experimental patch at all, so I didn't bother fixing the bug. Thank you for the patch. Still, I'd suggest to leave it as a possible future improvement, so that it doesn't distract us from the original feature. >> if (ItemIdIsDead(itemId)) >> continue; >> >> In the previous review Rafia asked about "some reason". >> Trying to figure out if this situation possible, I changed this line to >> Assert(!ItemIdIsDead(itemId)) in our test version. And it failed in a >> performance >> test. Unfortunately, I was not able to reproduce it. > I found it easy enough to see LP_DEAD items within > _bt_insertonpg_in_posting() when running pgbench with the extra unique > index patch. To give you a simple example of how this can happen, > consider the comments about BTP_HAS_GARBAGE within > _bt_delitems_vacuum(). That probably isn't the only way it can happen, > either. ISTM that we need to be prepared for LP_DEAD items during > deduplication, rather than trying to prevent deduplication from ever > having to see an LP_DEAD item. I added to v6 another related fix for _bt_compress_one_page(). Previous code was implicitly deleted DEAD items without calling index_compute_xid_horizon_for_tuples(). New code has a check whether DEAD items on the page exist and remove them if any. Another possible solution is to copy dead items as is from old page to the new one, but I think it's good to remove dead tuples as fast as possible. > v5 makes _bt_insertonpg_in_posting() prepared to overwrite an > existing item if it's an LP_DEAD item that falls in the same TID range > (that's _bt_compare()-wise "equal" to an existing tuple, which may or > may not be a posting list tuple already). I haven't made this code do > something like call index_compute_xid_horizon_for_tuples(), even > though that's needed for correctness (i.e. this new code is currently > broken in the same way that I mentioned unique index support is > broken). Is it possible that DEAD tuple to delete was smaller than itup? > I also added a nearby FIXME comment to > _bt_insertonpg_in_posting() -- I don't think think that the code for > splitting a posting list in two is currently crash-safe. > Good catch. It seems, that I need to rearrange the code. I'll send updated patch this week. > How do you feel about officially calling this deduplication, not > compression? I think that it's a more accurate name for the technique. I agree. Should I rename all related names of functions and variables in the patch? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
13.08.2019 18:45, Anastasia Lubennikova wrote:
I also added a nearby FIXME comment toGood catch. It seems, that I need to rearrange the code.
_bt_insertonpg_in_posting() -- I don't think think that the code for
splitting a posting list in two is currently crash-safe.
I'll send updated patch this week.
Attached is v7.
In this version of the patch, I heavily refactored the code of insertion into
posting tuple. bt_split logic is quite complex, so I omitted a couple of
optimizations. They are mentioned in TODO comments.
Now the algorithm is the following:
- If bt_findinsertloc() found out that tuple belongs to existing posting tuple's
TID interval, it sets 'in_posting_offset' variable and passes it to
_bt_insertonpg()
- If 'in_posting_offset' is valid and origtup is valid,
merge our itup into origtup.
It can result in one tuple neworigtup, that must replace origtup; or two tuples:
neworigtup and newrighttup, if the result exceeds BTMaxItemSize,
- If two new tuple(s) fit into the old page, we're lucky.
call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to
atomically replace oldtup with new tuple(s) and generate xlog record.
- In case page split is needed, pass both tuples to _bt_split().
_bt_findsplitloc() is now aware of upcoming replacement of origtup with
neworigtup, so it uses correct item size where needed.
It seems that now all replace operations are crash-safe. The new patch passes
all regression tests, so I think it's ready for review again.
In the meantime, I'll run more stress-tests.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Aug 13, 2019 at 8:45 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > I still need to think about the exact details of alignment within > > _bt_insertonpg_in_posting(). I'm worried about boundary cases there. I > > could be wrong. > Could you explain more about these cases? > Now I don't understand the problem. Maybe there is no problem. > Thank you for the patch. > Still, I'd suggest to leave it as a possible future improvement, so that > it doesn't > distract us from the original feature. I don't even think that it's useful work for the future. It's just nice to be sure that we could support unique index deduplication if it made sense. Which it doesn't. If I didn't write the patch that implements deduplication for unique indexes, I might still not realize that we need the index_compute_xid_horizon_for_tuples() stuff in certain other places. I'm not serious about it at all, except as a learning exercise/experiment. > I added to v6 another related fix for _bt_compress_one_page(). > Previous code was implicitly deleted DEAD items without > calling index_compute_xid_horizon_for_tuples(). > New code has a check whether DEAD items on the page exist and remove > them if any. > Another possible solution is to copy dead items as is from old page to > the new one, > but I think it's good to remove dead tuples as fast as possible. I think that what you've done in v7 is probably the best way to do it. It's certainly simple, which is appropriate given that we're not really expecting to see LP_DEAD items within _bt_compress_one_page() (we just need to be prepared for them). > > v5 makes _bt_insertonpg_in_posting() prepared to overwrite an > > existing item if it's an LP_DEAD item that falls in the same TID range > > (that's _bt_compare()-wise "equal" to an existing tuple, which may or > > may not be a posting list tuple already). I haven't made this code do > > something like call index_compute_xid_horizon_for_tuples(), even > > though that's needed for correctness (i.e. this new code is currently > > broken in the same way that I mentioned unique index support is > > broken). > Is it possible that DEAD tuple to delete was smaller than itup? I'm not sure what you mean by this. I suppose that it doesn't matter, since we both prefer the alternative that you came up with anyway. > > How do you feel about officially calling this deduplication, not > > compression? I think that it's a more accurate name for the technique. > I agree. > Should I rename all related names of functions and variables in the patch? Please rename them when convenient. -- Peter Geoghegan
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Now the algorithm is the following: > > - If bt_findinsertloc() found out that tuple belongs to existing posting tuple's > TID interval, it sets 'in_posting_offset' variable and passes it to > _bt_insertonpg() > > - If 'in_posting_offset' is valid and origtup is valid, > merge our itup into origtup. > > It can result in one tuple neworigtup, that must replace origtup; or two tuples: > neworigtup and newrighttup, if the result exceeds BTMaxItemSize, That sounds like the right way to do it. > - If two new tuple(s) fit into the old page, we're lucky. > call _bt_delete_and_insert(..., neworigtup, newrighttup, newitemoff) to > atomically replace oldtup with new tuple(s) and generate xlog record. > > - In case page split is needed, pass both tuples to _bt_split(). > _bt_findsplitloc() is now aware of upcoming replacement of origtup with > neworigtup, so it uses correct item size where needed. That makes sense, since _bt_split() is responsible for both splitting the page, and inserting the new item on either the left or right page, as part of the first phase of a page split. In other words, if you're adding something new to _bt_insertonpg(), you probably also need to add something new to _bt_split(). So that's what you did. > It seems that now all replace operations are crash-safe. The new patch passes > all regression tests, so I think it's ready for review again. I'm looking at it now. I'm going to spend a significant amount of time on this tomorrow. I think that we should start to think about efficient WAL-logging now. > In the meantime, I'll run more stress-tests. As you probably realize, wal_consistency_checking is a good thing to use with your tests here. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
20.08.2019 4:04, Peter Geoghegan wrote: > On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > >> It seems that now all replace operations are crash-safe. The new patch passes >> all regression tests, so I think it's ready for review again. > I'm looking at it now. I'm going to spend a significant amount of time > on this tomorrow. > > I think that we should start to think about efficient WAL-logging now. Thank you for the review. The new version v8 is attached. Compared to previous version, this patch includes updated btree_xlog_insert() and btree_xlog_split() so that WAL records now only contain data about updated posting tuple and don't require full page writes. I haven't updated pg_waldump yet. It is postponed until we agree on nbtxlog changes. Also in this patch I renamed all 'compress' keywords to 'deduplicate' and did minor cleanup of outdated comments. I'm going to look through the patch once more to update nbtxlog comments, where needed and answer to your remarks that are still left in the comments. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I'm going to look through the patch once more to update nbtxlog > comments, where needed and > answer to your remarks that are still left in the comments. Have you been using amcheck's rootdescend verification? I see this problem with v8, with the TPC-H test data: DEBUG: finished verifying presence of 1500000 tuples from table "customer" with bitset 51.09% set ERROR: could not find tuple using search from root page in index "idx_customer_nationkey2" I've been running my standard amcheck query with these databases, which is: SELECT bt_index_parent_check(index => c.oid, heapallindexed => true, rootdescend => true), c.relname, c.relpages FROM pg_index i JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON c.relnamespace = n.oid WHERE am.amname = 'btree' AND c.relpersistence != 't' AND c.relkind = 'i' AND i.indisready AND i.indisvalid ORDER BY c.relpages DESC; There were many large indexes that amcheck didn't detect a problem with. I don't yet understand what the problem is, or why we only see the problem for a small number of indexes. Note that all of these indexes passed verification with v5, so this is some kind of regression. I also noticed that there were some regressions in the size of indexes -- indexes were not nearly as small as they were in v5 in some cases. The overall picture was a clear regression in how effective deduplication is. I think that it would save time if you had direct access to my test data, even though it's a bit cumbersome. You'll have to download about 10GB of dumps, which require plenty of disk space when restored: regression=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description ------------+-------+----------+------------+------------+-------------------+---------+------------+-------------------------------------------- land | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | | 6425 MB | pg_default | mgd | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | | 61 GB | pg_default | postgres | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | | 7753 kB | pg_default | default administrative connection database regression | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | | 886 MB | pg_default | template0 | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg +| 7609 kB | pg_default | unmodifiable empty database | | | | | pg=CTc/pg | | | template1 | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/pg +| 7609 kB | pg_default | default template for new databases | | | | | pg=CTc/pg | | | tpcc | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | | 10 GB | pg_default | tpce | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | | 26 GB | pg_default | tpch | pg | UTF8 | en_US.UTF8 | en_US.UTF8 | | 32 GB | pg_default | (9 rows) I have found it very valuable to use this test data when changing nbtsplitloc.c, or anything that could affect where page splits make free space available. If this is too much data to handle conveniently, then you could skip "mgd" and almost have as much test coverage. There really does seem to be a benefit to using diverse test cases like this, because sometimes regressions only affect a small number of specific indexes for specific reasons. For example, only TPC-H has a small number of indexes that have tuples that are inserted in order, but also have many duplicates. Removing the BT_COMPRESS_THRESHOLD stuff really helped with those indexes. Want me to send this data and the associated tests script over to you? -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
23.08.2019 7:33, Peter Geoghegan wrote: > On Wed, Aug 21, 2019 at 10:19 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> I'm going to look through the patch once more to update nbtxlog >> comments, where needed and >> answer to your remarks that are still left in the comments. > Have you been using amcheck's rootdescend verification? No, I haven't checked it with the latest version yet. > There were many large indexes that amcheck didn't detect a problem > with. I don't yet understand what the problem is, or why we only see > the problem for a small number of indexes. Note that all of these > indexes passed verification with v5, so this is some kind of > regression. > > I also noticed that there were some regressions in the size of indexes > -- indexes were not nearly as small as they were in v5 in some cases. > The overall picture was a clear regression in how effective > deduplication is. Do these indexes have something in common? Maybe some specific workload? Are there any error messages in log? I'd like to specify what caused the problem. There were several major changes between v5 and v8: - dead tuples handling added in v6; - _bt_split changes for posting tuples in v7; - WAL logging of posting tuple changes in v8. I don't think the last one could break regular indexes on master. Do you see the same regression in v6, v7? > I think that it would save time if you had direct access to my test > data, even though it's a bit cumbersome. You'll have to download about > 10GB of dumps, which require plenty of disk space when restored: > > > Want me to send this data and the associated tests script over to you? > Yes, I think it will help me to debug the patch faster. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Now the algorithm is the following: > - In case page split is needed, pass both tuples to _bt_split(). > _bt_findsplitloc() is now aware of upcoming replacement of origtup with > neworigtup, so it uses correct item size where needed. > > It seems that now all replace operations are crash-safe. The new patch passes > all regression tests, so I think it's ready for review again. I think that the way this works within nbtsplitloc.c is too complicated. In v5, the only thing that nbtsplitloc.c knew about deduplication was that it could be sure that suffix truncation would at least make a posting list into a single heap TID in the worst case. This consideration was mostly about suffix truncation, not deduplication, which seemed like a good thing to me. _bt_split() and _bt_findsplitloc() should know as little as possible about posting lists. Obviously it will sometimes be necessary to deal with the case where a posting list is about to become too big (i.e. it's about to go over BTMaxItemSize()), and so must be split. Less often, a page split will be needed because of one of these posting list splits. These are two complicated areas (posting list splits and page splits), and it would be a good idea to find a way to separate them as much as possible. Remember, nbtsplitloc.c works by pretending that the new item that cannot fit on the page is already on its own imaginary version of the page that *can* fit the new item, along with everything else from the original/actual page. That gets *way* too complicated when it has to deal with the fact that the new item is being merged with an existing item. Perhaps nbtsplitloc.c could also "pretend" that the new item is always a plain tuple, without knowing anything about posting lists. Almost like how it worked in v5. We always want posting lists to be as close to the BTMaxItemSize() size as possible, because that helps with space utilization. In v5 of the patch, this was what happened, because, in effect, we didn't try to do anything complicated with the new item. This worked well, apart from the crash safety issue. Maybe we can simulate the v5 approach, giving us the best of all worlds (good space utilization, simplicity, and crash safety). Something like this: * Posting list splits should always result in one posting list that is at or just under BTMaxItemSize() in size, plus one plain tuple to its immediate right on the page. This is similar to the more common case where we cannot add additional tuples to a posting list due to the BTMaxItemSize() restriction, and so end up with a single tuple (or a smaller posting list with the same value) to the right of a BTMaxItemSize()-sized posting list tuple. I don't see a reason to split a posting list in the middle -- we should always split to the right, leaving the posting list as large as possible. * When there is a simple posting list split, with no page split, the logic required is fairly straightforward: We rewrite the posting list in-place so that our new item goes wherever it belongs in the existing posting list on the page (we memmove() the posting list to make space for the new TID, basically). The old last/rightmost TID in the original posting list becomes a new, plain tuple. We may need a new WAL record for this, but it's not that different to a regular leaf page insert. * When this happens to result in a page split, we then have a "fake" new item -- the right half of the posting list that we split, which is always a plain item. Obviously we need to be a bit careful with the WAL logging, but the space accounting within _bt_split() and _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can work like it did in v5, when the only thing it knew about posting lists was that _bt_truncate() always removes them, maybe leaving a single TID behind in the new high key. (Note also that it's not okay to remove the conservative assumption about at least having space for one heap TID within _bt_recsplitloc() -- that needs to be restored to its v5 state in the next version of the patch.) Because deduplication is lazy, there is little value in doing deduplication of the new item (which may or may not be the fake new item). The nbtsplitloc.c logic will "trap" duplicates on the same page today, so we can just let deduplication of the new item happen at a later time. _bt_split() can almost pretend that posting lists don't exist, and nbtsplitloc.c needs to know nothing about posting lists (apart from the way that _bt_truncate() behaves with posting lists). We "lie" to _bt_findsplitloc(), and tell it that the new item is our fake new item -- it doesn't do anything that will be broken by that lie, because it doesn't care about the actual content of posting lists. And, we can fix the "fake new item is not actually real new item" issue at one point within _bt_split(), just as we're about to WAL log. What do you think of that approach? -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
28.08.2019 6:19, Peter Geoghegan wrote: > On Fri, Aug 16, 2019 at 8:56 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Now the algorithm is the following: >> - In case page split is needed, pass both tuples to _bt_split(). >> _bt_findsplitloc() is now aware of upcoming replacement of origtup with >> neworigtup, so it uses correct item size where needed. >> >> It seems that now all replace operations are crash-safe. The new patch passes >> all regression tests, so I think it's ready for review again. > I think that the way this works within nbtsplitloc.c is too > complicated. In v5, the only thing that nbtsplitloc.c knew about > deduplication was that it could be sure that suffix truncation would > at least make a posting list into a single heap TID in the worst case. > This consideration was mostly about suffix truncation, not > deduplication, which seemed like a good thing to me. _bt_split() and > _bt_findsplitloc() should know as little as possible about posting > lists. > > Obviously it will sometimes be necessary to deal with the case where a > posting list is about to become too big (i.e. it's about to go over > BTMaxItemSize()), and so must be split. Less often, a page split will > be needed because of one of these posting list splits. These are two > complicated areas (posting list splits and page splits), and it would > be a good idea to find a way to separate them as much as possible. > Remember, nbtsplitloc.c works by pretending that the new item that > cannot fit on the page is already on its own imaginary version of the > page that *can* fit the new item, along with everything else from the > original/actual page. That gets *way* too complicated when it has to > deal with the fact that the new item is being merged with an existing > item. Perhaps nbtsplitloc.c could also "pretend" that the new item is > always a plain tuple, without knowing anything about posting lists. > Almost like how it worked in v5. > > We always want posting lists to be as close to the BTMaxItemSize() > size as possible, because that helps with space utilization. In v5 of > the patch, this was what happened, because, in effect, we didn't try > to do anything complicated with the new item. This worked well, apart > from the crash safety issue. Maybe we can simulate the v5 approach, > giving us the best of all worlds (good space utilization, simplicity, > and crash safety). Something like this: > > * Posting list splits should always result in one posting list that is > at or just under BTMaxItemSize() in size, plus one plain tuple to its > immediate right on the page. This is similar to the more common case > where we cannot add additional tuples to a posting list due to the > BTMaxItemSize() restriction, and so end up with a single tuple (or a > smaller posting list with the same value) to the right of a > BTMaxItemSize()-sized posting list tuple. I don't see a reason to > split a posting list in the middle -- we should always split to the > right, leaving the posting list as large as possible. > > * When there is a simple posting list split, with no page split, the > logic required is fairly straightforward: We rewrite the posting list > in-place so that our new item goes wherever it belongs in the existing > posting list on the page (we memmove() the posting list to make space > for the new TID, basically). The old last/rightmost TID in the > original posting list becomes a new, plain tuple. We may need a new > WAL record for this, but it's not that different to a regular leaf > page insert. > > * When this happens to result in a page split, we then have a "fake" > new item -- the right half of the posting list that we split, which is > always a plain item. Obviously we need to be a bit careful with the > WAL logging, but the space accounting within _bt_split() and > _bt_findsplitloc() can work just the same as now. nbtsplitloc.c can > work like it did in v5, when the only thing it knew about posting > lists was that _bt_truncate() always removes them, maybe leaving a > single TID behind in the new high key. (Note also that it's not okay > to remove the conservative assumption about at least having space for > one heap TID within _bt_recsplitloc() -- that needs to be restored to > its v5 state in the next version of the patch.) > > Because deduplication is lazy, there is little value in doing > deduplication of the new item (which may or may not be the fake new > item). The nbtsplitloc.c logic will "trap" duplicates on the same page > today, so we can just let deduplication of the new item happen at a > later time. _bt_split() can almost pretend that posting lists don't > exist, and nbtsplitloc.c needs to know nothing about posting lists > (apart from the way that _bt_truncate() behaves with posting lists). > We "lie" to _bt_findsplitloc(), and tell it that the new item is our > fake new item -- it doesn't do anything that will be broken by that > lie, because it doesn't care about the actual content of posting > lists. And, we can fix the "fake new item is not actually real new > item" issue at one point within _bt_split(), just as we're about to > WAL log. > > What do you think of that approach? I think it's a good idea. Thank you for such a detailed description of various cases. I already started to simplify this code, while debugging amcheck error in v8. At first, I rewrote it to split posting tuple into a posting and a regular tuple instead of two posting tuples. Your explanation helped me to understand that this approach can be extended to the case of insertion into posting list, that doesn't trigger posting split, and that nbtsplitloc indeed doesn't need to know about posting tuples specific. The code is much cleaner now. The new version is attached. It passes regression tests. I also run land and tpch test. They pass amcheck rootdescend and if I interpreted results correctly, the new version shows slightly better compression. \l+ tpch | anastasia | UTF8 | ru_RU.UTF-8 | ru_RU.UTF-8 | | 31 GB | pg_default | land | anastasia | UTF8 | ru_RU.UTF-8 | ru_RU.UTF-8 | | 6380 MB | pg_default | Some individual indexes are larger, some are smaller compared to the expected output. This patch is based on v6, so it again contains "compression" instead of "deduplication" in variable names and comments. I will rename them when code becomes more stable. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Attachment
On Thu, Aug 29, 2019 at 5:13 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Your explanation helped me to understand that this approach can be > extended to > the case of insertion into posting list, that doesn't trigger posting > split, > and that nbtsplitloc indeed doesn't need to know about posting tuples > specific. > The code is much cleaner now. Fantastic! > Some individual indexes are larger, some are smaller compared to the > expected output. I agree that v9 might be ever so slightly more space efficient than v5 was, on balance. In any case v9 completely fixes the regression that I saw in the last version. I have pushed the changes to the test output for the serial tests that I privately maintain, that I gave you access to. The MGD test output also looks perfect. We may find that deduplication is a little too effective, in the sense that it packs so many tuples on to leaf pages that *concurrent* inserters will tend to get excessive page splits. We may find that it makes sense to aim for posting lists that are maybe 96% of BTMaxItemSize() -- note that BTREE_SINGLEVAL_FILLFACTOR is 96 for this reason. Concurrent inserters will tend to have heap TIDs that are slightly out of order, so we want to at least have enough space remaining on the left half of a "single value mode" split. We may end up with a design where deduplication anticipates what will be useful for nbtsplitloc.c. I still think that it's too early to start worrying about problems like this one -- I feel it will be useful to continue to focus on the code and the space utilization of the serial test cases for now. We can look at it at the same time that we think about adding back something like BT_COMPRESS_THRESHOLD. I am mentioning it now because it's probably a good time for you to start thinking about it, if you haven't already (actually, maybe I'm just describing what BT_COMPRESS_THRESHOLD was supposed to do in the first place). We'll need to have a good benchmark to assess these questions, and it's not obvious what that will be. Two possible candidates are TPC-H and TPC-E. (Of course, I mean running them for real -- not using their indexes to make sure that the nbtsplitloc.c stuff works well in isolation.) Any thoughts on a conventional benchmark that allows us to understand the patch's impact on both throughput and latency? BTW, I notice that we often have indexes that are quite a lot smaller when they were created with retail insertions rather than with CREATE INDEX/REINDEX. This is not new, but the difference is much larger than it typically is without the patch. For example, the TPC-E index on trade.t_ca_id (which is named "i_t_ca_id" or "i_t_ca_id2" in my test) is 162 MB with CREATE INDEX/REINDEX, and 121 MB with retail insertions (assuming the insertions use the actual order from the test). I'm not sure what to do about this, if anything. I mean, the reason that the retail insertions do better is that they have the nbtsplitloc.c stuff, and because we don't split the page until it's 100% full and until deduplication stops helping -- we could apply several rounds of deduplication before we actually have to split the cage. So the difference that we see here is both logical and surprising. How do you feel about this CREATE INDEX index-size-is-larger business? -- Peter Geoghegan
On Thu, Aug 29, 2019 at 5:07 PM Peter Geoghegan <pg@bowt.ie> wrote: > I agree that v9 might be ever so slightly more space efficient than v5 > was, on balance. I see some Valgrind errors on v9, all of which look like the following two sample errors I go into below. First one: ==11193== VALGRINDERROR-BEGIN ==11193== Unaddressable byte(s) found during client check request ==11193== at 0x4C0E03: PageAddItemExtended (bufpage.c:332) ==11193== by 0x20F6C3: _bt_split (nbtinsert.c:1643) ==11193== by 0x20F6C3: _bt_insertonpg (nbtinsert.c:1206) ==11193== by 0x21239B: _bt_doinsert (nbtinsert.c:306) ==11193== by 0x2150EE: btinsert (nbtree.c:207) ==11193== by 0x20D63A: index_insert (indexam.c:186) ==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393) ==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593) ==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219) ==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445) ==11193== by 0x36C738: ExecProcNode (executor.h:240) ==11193== by 0x36C738: ExecutePlan (execMain.c:1648) ==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365) ==11193== by 0x36C7DD: ExecutorRun (execMain.c:309) ==11193== by 0x4CC41A: ProcessQuery (pquery.c:161) ==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283) ==11193== by 0x4CD31C: PortalRun (pquery.c:796) ==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231) ==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256) ==11193== by 0x453650: BackendRun (postmaster.c:4446) ==11193== by 0x453650: BackendStartup (postmaster.c:4137) ==11193== by 0x453650: ServerLoop (postmaster.c:1704) ==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377) ==11193== by 0x3B85A1: main (main.c:210) ==11193== Address 0x9c11350 is 0 bytes after a recently re-allocated block of size 8,192 alloc'd ==11193== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==11193== by 0x61085A: AllocSetAlloc (aset.c:914) ==11193== by 0x617AD8: palloc (mcxt.c:938) ==11193== by 0x21A829: _bt_mkscankey (nbtutils.c:107) ==11193== by 0x2118F3: _bt_doinsert (nbtinsert.c:93) ==11193== by 0x2150EE: btinsert (nbtree.c:207) ==11193== by 0x20D63A: index_insert (indexam.c:186) ==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393) ==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593) ==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219) ==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445) ==11193== by 0x36C738: ExecProcNode (executor.h:240) ==11193== by 0x36C738: ExecutePlan (execMain.c:1648) ==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365) ==11193== by 0x36C7DD: ExecutorRun (execMain.c:309) ==11193== by 0x4CC41A: ProcessQuery (pquery.c:161) ==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283) ==11193== by 0x4CD31C: PortalRun (pquery.c:796) ==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231) ==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256) ==11193== by 0x453650: BackendRun (postmaster.c:4446) ==11193== by 0x453650: BackendStartup (postmaster.c:4137) ==11193== by 0x453650: ServerLoop (postmaster.c:1704) ==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377) ==11193== ==11193== VALGRINDERROR-END { <insert_a_suppression_name_here> Memcheck:User fun:PageAddItemExtended fun:_bt_split fun:_bt_insertonpg fun:_bt_doinsert fun:btinsert fun:index_insert fun:ExecInsertIndexTuples fun:ExecInsert fun:ExecModifyTable fun:ExecProcNodeFirst fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:ProcessQuery fun:PortalRunMulti fun:PortalRun fun:exec_simple_query fun:PostgresMain fun:BackendRun fun:BackendStartup fun:ServerLoop fun:PostmasterMain fun:main } nbtinsert.c:1643 is the first PageAddItem() in _bt_split() -- the lefthikey call. Second one: ==11193== VALGRINDERROR-BEGIN ==11193== Invalid read of size 2 ==11193== at 0x20FDF5: _bt_insertonpg (nbtinsert.c:1126) ==11193== by 0x21239B: _bt_doinsert (nbtinsert.c:306) ==11193== by 0x2150EE: btinsert (nbtree.c:207) ==11193== by 0x20D63A: index_insert (indexam.c:186) ==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393) ==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593) ==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219) ==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445) ==11193== by 0x36C738: ExecProcNode (executor.h:240) ==11193== by 0x36C738: ExecutePlan (execMain.c:1648) ==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365) ==11193== by 0x36C7DD: ExecutorRun (execMain.c:309) ==11193== by 0x4CC41A: ProcessQuery (pquery.c:161) ==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283) ==11193== by 0x4CD31C: PortalRun (pquery.c:796) ==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231) ==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256) ==11193== by 0x453650: BackendRun (postmaster.c:4446) ==11193== by 0x453650: BackendStartup (postmaster.c:4137) ==11193== by 0x453650: ServerLoop (postmaster.c:1704) ==11193== by 0x454CAC: PostmasterMain (postmaster.c:1377) ==11193== by 0x3B85A1: main (main.c:210) ==11193== Address 0x9905b90 is 11,088 bytes inside a recently re-allocated block of size 524,288 alloc'd ==11193== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==11193== by 0x61085A: AllocSetAlloc (aset.c:914) ==11193== by 0x617AD8: palloc (mcxt.c:938) ==11193== by 0x1C5677: CopyIndexTuple (indextuple.c:508) ==11193== by 0x20E887: _bt_compress_one_page (nbtinsert.c:2751) ==11193== by 0x21241E: _bt_findinsertloc (nbtinsert.c:773) ==11193== by 0x21241E: _bt_doinsert (nbtinsert.c:303) ==11193== by 0x2150EE: btinsert (nbtree.c:207) ==11193== by 0x20D63A: index_insert (indexam.c:186) ==11193== by 0x36B7F2: ExecInsertIndexTuples (execIndexing.c:393) ==11193== by 0x391793: ExecInsert (nodeModifyTable.c:593) ==11193== by 0x3924DC: ExecModifyTable (nodeModifyTable.c:2219) ==11193== by 0x37306D: ExecProcNodeFirst (execProcnode.c:445) ==11193== by 0x36C738: ExecProcNode (executor.h:240) ==11193== by 0x36C738: ExecutePlan (execMain.c:1648) ==11193== by 0x36C738: standard_ExecutorRun (execMain.c:365) ==11193== by 0x36C7DD: ExecutorRun (execMain.c:309) ==11193== by 0x4CC41A: ProcessQuery (pquery.c:161) ==11193== by 0x4CC5EB: PortalRunMulti (pquery.c:1283) ==11193== by 0x4CD31C: PortalRun (pquery.c:796) ==11193== by 0x4C8EFC: exec_simple_query (postgres.c:1231) ==11193== by 0x4C9EE0: PostgresMain (postgres.c:4256) ==11193== by 0x453650: BackendRun (postmaster.c:4446) ==11193== by 0x453650: BackendStartup (postmaster.c:4137) ==11193== by 0x453650: ServerLoop (postmaster.c:1704) ==11193== ==11193== VALGRINDERROR-END { <insert_a_suppression_name_here> Memcheck:Addr2 fun:_bt_insertonpg fun:_bt_doinsert fun:btinsert fun:index_insert fun:ExecInsertIndexTuples fun:ExecInsert fun:ExecModifyTable fun:ExecProcNodeFirst fun:ExecProcNode fun:ExecutePlan fun:standard_ExecutorRun fun:ExecutorRun fun:ProcessQuery fun:PortalRunMulti fun:PortalRun fun:exec_simple_query fun:PostgresMain fun:BackendRun fun:BackendStartup fun:ServerLoop fun:PostmasterMain fun:main } nbtinsert.c:1126 is this code from _bt_insertonpg(): elog(DEBUG4, "dest before (%u,%u)", ItemPointerGetBlockNumberNoCheck((ItemPointer) dest), ItemPointerGetOffsetNumberNoCheck((ItemPointer) dest)); This is probably harmless, but it needs to be fixed. -- Peter Geoghegan
On Thu, Aug 29, 2019 at 10:10 PM Peter Geoghegan <pg@bowt.ie> wrote: > I see some Valgrind errors on v9, all of which look like the following > two sample errors I go into below. I've found a fix for these Valgrind issues. It's a matter of making sure that _bt_truncate() sizes new pivot tuples properly, which is quite subtle: --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -2155,8 +2155,11 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright, { BTreeTupleClearBtIsPosting(pivot); BTreeTupleSetNAtts(pivot, keepnatts); - pivot->t_info &= ~INDEX_SIZE_MASK; - pivot->t_info |= BTreeTupleGetPostingOffset(firstright); + if (keepnatts == natts) + { + pivot->t_info &= ~INDEX_SIZE_MASK; + pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright)); + } } I'm varying how the new pivot tuple is sized here according to whether or not index_truncate_tuple() just does a CopyIndexTuple(). This very slightly changes the behavior of the nbtsplitloc.c stuff, but that's not a concern for me. I will post a patch with this and other tweaks next week. -- Peter Geoghegan
On Sat, Aug 31, 2019 at 1:04 AM Peter Geoghegan <pg@bowt.ie> wrote: > I've found a fix for these Valgrind issues. Attach is v10, which fixes the Valgrind issue. Other changes: * The code now fully embraces the idea that posting list splits involve "changing the incoming item" in a way that "avoids" having the new/incoming item overlap with an existing posting list tuple. This allowed me to cut down on the changes required within nbtinsert.c considerably. * Streamlined a lot of the code in nbtsearch.c. I was able to significantly simplify _bt_compare() and _bt_binsrch_insert(). * Removed the DEBUG4 traces. A lot of these had to go when I refactored nbtsearch.c code, so I thought I might as well removed the remaining ones. I hope that you don't mind (go ahead and add them back where that makes sense). * A backwards scan will return "logical tuples" in descending order now. We should do this on general principle, and also because of the possibility of future external code that expects and takes advantage of consistent heap TID order. This change might even have a small performance benefit today, though: Index scans that visit multiple heap pages but only match on a single key will only pin each heap page visited once. Visiting the heap pages in descending order within a B-Tree page full of duplicates, but ascending order within individual posting lists could result in unnecessary extra pinning. * Standardized terminology. We consistently call what the patch adds "deduplication" rather than "compression". * Added a new section on the design to the nbtree README. This is fairly high level, and talks about dynamics that we can't really talk about anywhere else, such as how nbtsplitloc.c "cooperates" with deduplication, producing an effect that is greater than the sum of its parts. * I also made some changes to the WAL logging for leaf page insertions and page splits. I didn't add the optimization that you anticipated in your nbtxlog.h comments (i.e. only WAL-log a rewritten posting list when it will go on the left half of the split, just like the new/incoming item thing we have already). I agree that that's a good idea, and should be added soon. Actually, I think the whole "new item vs. rewritten posting list item" thing makes the WAL logging confusing, so this is not really about performance. Maybe the easiest way to do this is also the way that performs best. I'm thinking of this: maybe we could completely avoid WAL-logging the entire rewritten/split posting list. After all, the contents of the rewritten posting list are derived from the existing/original posting list, as well as the new/incoming item. We can make the WAL record much smaller on average by making standbys repeat a little bit of the work performed on the primary. Maybe we could WAL-log "in_posting_offset" itself, and an ItemPointerData (obviously the new item offset number tells us the offset number of the posting list that must be replaced/memmoved()'d). Then have the standby repeat some of the work performed on the primary -- at least the work of swapping a heap TID could be repeated on standbys, since it's very little extra work for standbys, but could really reduce the WAL volume. This might actually be simpler. The WAL logging that I didn't touch in v10 is the most important thing to improve. I am talking about the WAL-logging that is performed as part of deduplicating all items on a page, to avoid a page split (i.e. the WAL-logging within _bt_dedup_one_page()). That still just does a log_newpage_buffer() in v10, which is pretty inefficient. Much like the posting list split WAL logging stuff, WAL logging in _bt_dedup_one_page() can probably be made more efficient by describing deduplication in terms of logical changes. For example, the WAL records should consist of metadata that could be read by a human as "merge the tuples from offset number 15 until offset number 27". Perhaps this could also share code with the posting list split stuff. What do you think? Once we make the WAL-logging within _bt_dedup_one_page() more efficient, that also makes it fairly easy to make the deduplication that it performs occur incrementally, maybe even very incrementally. I can imagine the _bt_dedup_one_page() caller specifying "my new tuple is 32 bytes, and I'd really like to not have to split the page, so please at least do enough deduplication to make it fit". Delaying deduplication increases the amount of time that we have to set the LP_DEAD bit for remaining items on the page, which might be important. Also, spreading out the volume of WAL produced by deduplication over time might be important with certain workloads. We would still probably do somewhat more work than strictly necessary to avoid a page split if we were to make _bt_dedup_one_page() incremental like this, though not by a huge amount. OTOH, maybe I am completely wrong about "incremental deduplication" being a good idea. It seems worth experimenting with, though. It's not that much more work on top of making the _bt_dedup_one_page() WAL-logging efficient, which seems like the thing we should focus on now. Thoughts? -- Peter Geoghegan
Attachment
On Mon, Sep 2, 2019 at 6:53 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attach is v10, which fixes the Valgrind issue. Attached is v11, which makes the kill_prior_tuple optimization work with posting list tuples. The only catch is that it can only work when all "logical tuples" within a posting list are known-dead, since of course there is only one LP_DEAD bit available for each posting list. The hardest part of this kill_prior_tuple work was writing the new _bt_killitems() code, which I'm still not 100% happy with. Still, it seems to work well -- new pageinspect LP_DEAD status info was added to the second patch to verify that we're setting LP_DEAD bits as needed for posting list tuples. I also had to add a new nbtree-specific, posting-list-aware version of index_compute_xid_horizon_for_tuples() -- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to avoid splitting a posting list with the LP_DEAD bit set. I took a naive approach to avoiding that problem, adding code to _bt_findinsertloc() to prevent it. Posting list splits are generally assumed to be rare, so the fact that this is slightly inefficient should be fine IMV. I also refactored deduplication itself in anticipation of making the WAL logging more efficient, and incremental. So, the structure of the code within _bt_dedup_one_page() was simplified, without really changing it very much (I think). I also fixed a bug in _bt_dedup_one_page(). The check for dead items was broken in previous versions, because the loop examined the high key tuple in every iteration. Making _bt_dedup_one_page() more efficient and incremental is still the most important open item for the patch. -- Peter Geoghegan
Attachment
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
09.09.2019 22:54, Peter Geoghegan wrote: > > Attached is v11, which makes the kill_prior_tuple optimization work > with posting list tuples. The only catch is that it can only work when > all "logical tuples" within a posting list are known-dead, since of > course there is only one LP_DEAD bit available for each posting list. > > The hardest part of this kill_prior_tuple work was writing the new > _bt_killitems() code, which I'm still not 100% happy with. Still, it > seems to work well -- new pageinspect LP_DEAD status info was added to > the second patch to verify that we're setting LP_DEAD bits as needed > for posting list tuples. I also had to add a new nbtree-specific, > posting-list-aware version of index_compute_xid_horizon_for_tuples() > -- _bt_compute_xid_horizon_for_tuples(). Finally, it was necessary to > avoid splitting a posting list with the LP_DEAD bit set. I took a > naive approach to avoiding that problem, adding code to > _bt_findinsertloc() to prevent it. Posting list splits are generally > assumed to be rare, so the fact that this is slightly inefficient > should be fine IMV. > > I also refactored deduplication itself in anticipation of making the > WAL logging more efficient, and incremental. So, the structure of the > code within _bt_dedup_one_page() was simplified, without really > changing it very much (I think). I also fixed a bug in > _bt_dedup_one_page(). The check for dead items was broken in previous > versions, because the loop examined the high key tuple in every > iteration. > > Making _bt_dedup_one_page() more efficient and incremental is still > the most important open item for the patch. Hi, thank you for the fixes and improvements. I reviewed them and everything looks good except the idea of not splitting dead posting tuples. According to comments to scan->ignore_killed_tuples in genam.c:107, it may lead to incorrect tuple order on a replica. I don't sure, if it leads to any real problem, though, or it will be resolved by subsequent visibility checks. Anyway, it's worth to add more comments in _bt_killitems() explaining why it's safe. Attached is v12, which contains WAL optimizations for posting split and page deduplication. Changes to prior version: * xl_btree_split record doesn't contain posting tuple anymore, instead it keeps 'in_posting offset' and repeats the logic of _bt_insertonpg() as you proposed upthread. * I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains info about groups of tuples deduplicated into posting tuples. In principle, it is possible to fit it into some existing record, but I preferred to keep things clear. I haven't measured how these changes affect WAL size yet. Do you have any suggestions on how to automate testing of new WAL records? Is there any suitable place in regression tests? * I also noticed that _bt_dedup_one_page() can be optimized to return early when none tuples were deduplicated. I wonder if we can introduce inner statistic to tune deduplication? That is returning to the idea of BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that have very few duplicates or pages that are already full of posting lists. To be honest, I don't believe that incremental deduplication can really improve something, because no matter how many items were compressed we still rewrite all items from the original page to the new one, so, why not do our best. What do we save by this incremental approach? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I reviewed them and everything looks good except the idea of not > splitting dead posting tuples. > According to comments to scan->ignore_killed_tuples in genam.c:107, > it may lead to incorrect tuple order on a replica. > I don't sure, if it leads to any real problem, though, or it will be > resolved > by subsequent visibility checks. Fair enough, but I didn't do that because it's compelling on its own -- it isn't. I did it because it seemed like the best way to handle posting list splits in a version of the patch where LP_DEAD bits can be set on posting list tuples. I think that we have 3 high level options here: 1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting lists at all. This is clearly the easiest approach. 2. We do what I did in v11 of the patch -- we make it so that _bt_insertonpg() and _bt_split() never have to deal with LP_DEAD posting lists that they must split in passing. 3. We add additional code to _bt_insertonpg() and _bt_split() to deal with the rare case where they must split an LP_DEAD posting list, probably by unsetting the bit or something like that. Obviously it would be wrong to leave the LP_DEAD bit set for the newly inserted heap tuples TID that must go in a posting list that had its LP_DEAD bit set -- that would make it dead to index scans even after its xact successfully committed. I think that you already agree that we want to have the kill_prior_tuple optimizations with posting lists, so #1 isn't really an option. That just leaves #2 and #3. Since posting list splits are already assumed to be quite rare, it seemed far simpler to take the conservative approach of forcing clean-up that removes LP_DEAD bits so that _bt_insertonpg() and _bt_split() don't have to think about it. Obviously I think it's important that we make as few changes as possible to _bt_insertonpg() and _bt_split(), in general. I don't understand what you mean about visibility checks. There is nothing truly special about the way in which _bt_findinsertloc() will sometimes have to kill LP_DEAD items so that _bt_insertonpg() and _bt_split() don't have to think about LP_DEAD posting lists. As far as recovery is concerned, it is just another XLOG_BTREE_DELETE record, like any other. Note that there is a second call to _bt_binsrch_insert() within _bt_findinsertloc() when it has to generate a new XLOG_BTREE_DELETE record (by calling _bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that isn't dependent on the BTP_HAS_GARBAGE status bit being set). > Anyway, it's worth to add more comments in > _bt_killitems() explaining why it's safe. There is no question that the little snippet of code I added to _bt_killitems() in v11 is still too complicated. We also have to consider cases where the array overflows because the scan direction was changed (see the kill_prior_tuple comment block in btgetuple()). Yeah, it's messy. > Attached is v12, which contains WAL optimizations for posting split and > page > deduplication. Cool. > * xl_btree_split record doesn't contain posting tuple anymore, instead > it keeps > 'in_posting offset' and repeats the logic of _bt_insertonpg() as you > proposed > upthread. That looks good. > * I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains > info about > groups of tuples deduplicated into posting tuples. In principle, it is > possible > to fit it into some existing record, but I preferred to keep things clear. I definitely think that inventing a new WAL record was the right thing to do. > I haven't measured how these changes affect WAL size yet. > Do you have any suggestions on how to automate testing of new WAL records? > Is there any suitable place in regression tests? I don't know about the regression tests (I doubt that there is a natural place for such a test), but I came up with a rough test case. I more or less copied the approach that you took with the index build WAL reduction patches, though I also figured out a way of subtracting heapam WAL overhead to get a real figure. I attach the test case -- note that you'll need to use the "land" database with this. (This test case might need to be improved, but it's a good start.) > * I also noticed that _bt_dedup_one_page() can be optimized to return early > when none tuples were deduplicated. I wonder if we can introduce inner > statistic to tune deduplication? That is returning to the idea of > BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that > have > very few duplicates or pages that are already full of posting lists. I think that the BT_COMPRESS_THRESHOLD idea is closely related to making _bt_dedup_one_page() behave incrementally. On my machine, v12 of the patch actually uses slightly more WAL than v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL compression, so if I turned that off v12 would probably win by a small amount). Both numbers are wildly excessive, though. The master branch figure is only 2011 MB, which is only about 1.8x the size of the index on the master branch. And this is for a test case that makes the index 6.5x smaller, so the gap between total index size and total WAL volume is huge here -- the volume of WAL is nearly 40x greater than the index size! You are right to wonder what the result would be if we put BT_COMPRESS_THRESHOLD back in. It would probably significantly reduce the volume of WAL, because _bt_dedup_one_page() would no longer "thrash". However, I strongly suspect that that wouldn't be good enough at reducing the WAL volume down to something acceptable. That will require an approach to WAL-logging that is much more logical than physical. The nbtree_wal_test.sql test case involves a case where page splits mostly don't WAL-log things that were previously WAL-logged by simple inserts, because nbtsplitloc.c has us split in a right-heavy fashion when there are lots of duplicates. In other words, the _bt_split() optimization to WAL volume naturally works very well with the test case, or really any case with lots of duplicates, so the "write amplification" to the total volume of WAL is relatively small on the master branch. I think that the new WAL record has to be created once per posting list that is generated, not once per page that is deduplicated -- that's the only way that I can see that avoids a huge increase in total WAL volume. Even if we assume that I am wrong about there being value in making deduplication incremental, it is still necessary to make the WAL-logging behave incrementally. Otherwise you end up needlessly rewriting things that didn't actually change way too often. That's definitely not okay. Why worry about bringing 40x down to 20x, or even 10x? It needs to be comparable to the master branch. > To be honest, I don't believe that incremental deduplication can really > improve > something, because no matter how many items were compressed we still > rewrite > all items from the original page to the new one, so, why not do our best. > What do we save by this incremental approach? The point of being incremental is not to save work in cases where a page split is inevitable anyway. Rather, the idea is that we can be even more lazy, and avoid doing work that will never be needed -- maybe delaying page splits actually means preventing them entirely. Or, we can spread out the work over time, so that the amount of WAL per checkpoint is smoother than what we would get with a batch approach. My mental model of page splits is that there are sometimes many of them on the same page again and again in a very short time period, but more often the chances of any individual page being split is low. Even the rightmost page of a serial PK index isn't truly an exception, because a new rightmost page isn't "the same page" as the original rightmost page -- it is its new right sibling. Since we're going to have to optimize the WAL logging anyway, it will be relatively easy to experiment with incremental deduplication within _bt_dedup_one_page(). The WAL logging is the the hard part, so let's focus on that rather than worrying too much about whether or not incrementally doing all the work (not just the WAL logging) makes sense. It's still too early to be sure about whether or not that's a good idea. -- Peter Geoghegan
Attachment
On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Attached is v12, which contains WAL optimizations for posting split and > page > deduplication. Hmm. So v12 seems to have some problems with the WAL logging for posting list splits. With wal_debug = on and wal_consistency_checking='all', I can get a replica to fail consistency checking very quickly when "make installcheck" is run on the primary: 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30423A0; LSN 0/30425A0: prev 0/3041C78; xid 506; len 3; blkref #0: rel 1663/16385/2608, blk 56 FPW - Heap/INSERT: off 20 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30425A0; LSN 0/3042F78: prev 0/30423A0; xid 506; len 4; blkref #0: rel 1663/16385/2673, blk 13 FPW - Btree/INSERT_LEAF: off 138; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3042F78; LSN 0/3043788: prev 0/30425A0; xid 506; len 4; blkref #0: rel 1663/16385/2674, blk 37 FPW - Btree/INSERT_LEAF: off 68; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043788; LSN 0/30437C0: prev 0/3042F78; xid 506; len 28 - Transaction/ABORT: 2019-09-11 15:01:06.291717-07; rels: pg_tblspc/16388/PG_13_201909071/16385/16399 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30437C0; LSN 0/3043A30: prev 0/3043788; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9 FPW - Heap/INSERT: off 9 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043A30; LSN 0/3043D08: prev 0/30437C0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2 FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3043D08; LSN 0/3044948: prev 0/3043A30; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1 FPW - Btree/INSERT_LEAF: off 169; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044948; LSN 0/3044B58: prev 0/3043D08; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56 FPW - Heap/INSERT: off 21 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3044B58; LSN 0/30454A0: prev 0/3044948; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8 FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30454A0; LSN 0/3045CC0: prev 0/3044B58; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 37 FPW - Btree/INSERT_LEAF: off 71; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045CC0; LSN 0/3045F48: prev 0/30454A0; xid 507; len 3; blkref #0: rel 1663/16385/1247, blk 9 FPW - Heap/INSERT: off 10 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3045F48; LSN 0/3046240: prev 0/3045CC0; xid 507; len 4; blkref #0: rel 1663/16385/2703, blk 2 FPW - Btree/INSERT_LEAF: off 51; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046240; LSN 0/3046E70: prev 0/3045F48; xid 507; len 4; blkref #0: rel 1663/16385/2704, blk 1 FPW - Btree/INSERT_LEAF: off 44; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3046E70; LSN 0/3047090: prev 0/3046240; xid 507; len 3; blkref #0: rel 1663/16385/2608, blk 56 FPW - Heap/INSERT: off 22 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3047090; LSN 0/30479E0: prev 0/3046E70; xid 507; len 4; blkref #0: rel 1663/16385/2673, blk 8 FPW - Btree/INSERT_LEAF: off 156; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30479E0; LSN 0/3048420: prev 0/3047090; xid 507; len 4; blkref #0: rel 1663/16385/2674, blk 38 FPW - Btree/INSERT_LEAF: off 10; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048420; LSN 0/30486B0: prev 0/30479E0; xid 507; len 3; blkref #0: rel 1663/16385/1259, blk 0 FPW - Heap/INSERT: off 6 flags 0x00 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/30486B0; LSN 0/3048C30: prev 0/3048420; xid 507; len 4; blkref #0: rel 1663/16385/2662, blk 2 FPW - Btree/INSERT_LEAF: off 119; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3048C30; LSN 0/3049668: prev 0/30486B0; xid 507; len 4; blkref #0: rel 1663/16385/2663, blk 1 FPW - Btree/INSERT_LEAF: off 42; in_posting_offset 0 4448/2019-09-11 15:01:06 PDT LOG: REDO @ 0/3049668; LSN 0/304A550: prev 0/3048C30; xid 507; len 4; blkref #0: rel 1663/16385/3455, blk 1 FPW - Btree/INSERT_LEAF: off 2; in_posting_offset 1 4448/2019-09-11 15:01:06 PDT FATAL: inconsistent page found, rel 1663/16385/3455, forknum 0, blkno 1 4448/2019-09-11 15:01:06 PDT CONTEXT: WAL redo at 0/3049668 for Btree/INSERT_LEAF: off 2; in_posting_offset 1 4447/2019-09-11 15:01:06 PDT LOG: startup process (PID 4448) exited with exit code 1 4447/2019-09-11 15:01:06 PDT LOG: terminating any other active server processes 4447/2019-09-11 15:01:06 PDT LOG: database system is shut down I regularly use this test case for the patch -- I think that I fixed a similar problem in v11, when I changed the same WAL logging, but I didn't mention it until now. I will debug this myself in a few days, though you may prefer to do it before then. -- Peter Geoghegan
On Wed, Sep 11, 2019 at 3:09 PM Peter Geoghegan <pg@bowt.ie> wrote: > Hmm. So v12 seems to have some problems with the WAL logging for > posting list splits. With wal_debug = on and > wal_consistency_checking='all', I can get a replica to fail > consistency checking very quickly when "make installcheck" is run on > the primary I see the bug here. The problem is that we WAL-log a version of the new item that already has its heap TID changed. On the primary, the call to _bt_form_newposting() has a new item with the original heap TID, which is then rewritten before being inserted -- that's correct. But during recovery, we *start out with* a version of the new item that *already* had its heap TID swapped. So we have nowhere to get the original heap TID from during recovery. Attached patch fixes the problem in a hacky way -- it WAL-logs the original heap TID, just in case. Obviously this fix isn't usable, but it should make the problem clearer. Can you come up with a proper fix, please? I can think of one way of doing it, but I'll leave the details to you. The same issue exists in _bt_split(), so the tests will still fail with wal_consistency_checking -- it just takes a lot longer to reach a point where an inconsistent page is found, because posting list splits that occur at the same point that we need to split a page are much rarer than posting list splits that occur when we simply need to insert, without splitting the page. I suggest using wal_consistency_checking to test the fix that you come up with. As I mentioned, I regularly use it. Also note that there are further subtleties to doing this within _bt_split() -- see the FIXME comments there. Thanks -- Peter Geoghegan
Attachment
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that the new WAL record has to be created once per posting > list that is generated, not once per page that is deduplicated -- > that's the only way that I can see that avoids a huge increase in > total WAL volume. Even if we assume that I am wrong about there being > value in making deduplication incremental, it is still necessary to > make the WAL-logging behave incrementally. Attached is v13 of the patch, which shows what I mean. You could say that v13 makes _bt_dedup_one_page() do a few extra things that are kind of similar to the things that nbtsplitloc.c does for _bt_split(). More specifically, the v13-0001-* patch includes code that makes _bt_dedup_one_page() "goal orientated" -- it calculates how much space will be freed when _bt_dedup_one_page() goes on to deduplicate those items on the page that it has already "decided to deduplicate". The v13-0002-* patch makes _bt_dedup_one_page() actually use this ability -- it makes _bt_dedup_one_page() give up on deduplication when it is clear that the items that are already "pending deduplication" will free enough space for its caller to at least avoid a page split. This revision of the patch doesn't truly make deduplication incremental. It is only a proof of concept that shows how _bt_dedup_one_page() can *decide* that it will free "enough" space, whatever that may mean, so that it can finish early. The task of making _bt_dedup_one_page() actually avoid lots of work when it finishes early remains. As I said yesterday, I'm not asking you to accept that v13-0002-* is an improvement. At least not yet. In fact, "finishes early" due to the v13-0002-* logic clearly makes everything a lot slower, since _bt_dedup_one_page() will "thrash" even more than earlier versions of the patch. This is especially problematic with WAL-logged relations -- the test case that I shared yesterday goes from about 6GB to 10GB with v13-0002-* applied. But we need to fundamentally rethink the approach to the rewriting + WAL-logging by _bt_dedup_one_page() anyway. (Note that total index space utilization is barely affected by the v13-0002-* patch, so clearly that much works well.) Other changes: * Small tweaks to amcheck (nothing interesting, really). * Small tweaks to the _bt_killitems() stuff. * Moved all of the deduplication helper functions to nbtinsert.c. This is where deduplication gets complicated, so I think that it should all live there. (i.e. nbtsort.c will call nbtinsert.c code, never the other way around.) Note that I haven't merged any of the changes from v12 of the patch from yesterday. I didn't merge the posting list WAL logging changes because of the bug I reported, but I would have were it not for that. The WAL logging for _bt_dedup_one_page() added to v12 didn't appear to be more efficient than your original approach (i.e. calling log_newpage_buffer()), so I have stuck with your original approach. It would be good to hear your thoughts on this _bt_dedup_one_page() WAL volume/"write amplification" issue. -- Peter Geoghegan
Attachment
On Tue, Sep 1, 2015 at 12:33 PM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > > Hi, Tomas! > > On Mon, Aug 31, 2015 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> On 08/31/2015 09:41 AM, Anastasia Lubennikova wrote: >>> >>> I'm going to begin work on effective storage of duplicate keys in B-tree >>> index. >>> The main idea is to implement posting lists and posting trees for B-tree >>> index pages as it's already done for GIN. >>> >>> In a nutshell, effective storing of duplicates in GIN is organised as >>> follows. >>> Index stores single index tuple for each unique key. That index tuple >>> points to posting list which contains pointers to heap tuples (TIDs). If >>> too many rows having the same key, multiple pages are allocated for the >>> TIDs and these constitute so called posting tree. >>> You can find wonderful detailed descriptions in gin readme >>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> >>> and articles <http://www.cybertec.at/gin-just-an-index-type/>. >>> It also makes possible to apply compression algorithm to posting >>> list/tree and significantly decrease index size. Read more in >>> presentation (part 1) >>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. >>> >>> Now new B-tree index tuple must be inserted for each table row that we >>> index. >>> It can possibly cause page split. Because of MVCC even unique index >>> could contain duplicates. >>> Storing duplicates in posting list/tree helps to avoid superfluous splits. >>> >>> So it seems to be very useful improvement. Of course it requires a lot >>> of changes in B-tree implementation, so I need approval from community. >> >> >> In general, index size is often a serious issue - cases where indexes need more space than tables are not quite uncommonin my experience. So I think the efforts to lower space requirements for indexes are good. >> >> But if we introduce posting lists into btree indexes, how different are they from GIN? It seems to me that if I createa GIN index (using btree_gin), I do get mostly the same thing you propose, no? > > > Yes, In general GIN is a btree with effective duplicates handling + support of splitting single datums into multiple keys. > This proposal is mostly porting duplicates handling from GIN to btree. Is it worth to make a provision to add an ability to control how duplicates are sorted ? If we speak about GIN, why not take into account our experiments with RUM (https://github.com/postgrespro/rum) ? > >> Sure, there are differences - GIN indexes don't handle UNIQUE indexes, > > > The difference between btree_gin and btree is not only UNIQUE feature. > 1) There is no gingettuple in GIN. GIN supports only bitmap scans. And it's not feasible to add gingettuple to GIN. Atleast with same semantics as it is in btree. > 2) GIN doesn't support multicolumn indexes in the way btree does. Multicolumn GIN is more like set of separate singlecolumnGINs: it doesn't have composite keys. > 3) btree_gin can't effectively handle range searches. "a < x < b" would be hangle as "a < x" intersect "x < b". That isextremely inefficient. It is possible to fix. However, there is no clear proposal how to fit this case into GIN interface,yet. > >> >> but the compression can only be effective when there are duplicate rows. So either the index is not UNIQUE (so the b-treefeature is not needed), or there are many updates. > > > From my observations users can use btree_gin only in some cases. They like compression, but can't use btree_gin mostlybecause of #1. > >> Which brings me to the other benefit of btree indexes - they are designed for high concurrency. How much is this goingto be affected by introducing the posting lists? > > > I'd notice that current duplicates handling in PostgreSQL is hack over original btree. It is designed so in btree accessmethod in PostgreSQL, not btree in general. > Posting lists shouldn't change concurrency much. Currently, in btree you have to lock one page exclusively when you'reinserting new value. > When posting list is small and fits one page you have to do similar thing: exclusive lock of one page to insert new value. > When you have posting tree, you have to do exclusive lock on one page of posting tree. > > One can say that concurrency would became worse because index would become smaller and number of pages would became smallertoo. Since number of pages would be smaller, backends are more likely concur for the same page. But this argumentcan be user against any compression and for any bloat. > > ------ > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company -- Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
13.09.2019 4:04, Peter Geoghegan wrote:
Attached is v14 based on v12 (v13 changes are not merged).On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:I think that the new WAL record has to be created once per posting list that is generated, not once per page that is deduplicated -- that's the only way that I can see that avoids a huge increase in total WAL volume. Even if we assume that I am wrong about there being value in making deduplication incremental, it is still necessary to make the WAL-logging behave incrementally.It would be good to hear your thoughts on this _bt_dedup_one_page() WAL volume/"write amplification" issue.
In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.
I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they trigger.
Here are test results (config is standard, except fsync=off to speedup tests):
'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().
+-------------------+-----------+-----------+----------------+-----------+
| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |
+-------------------+-----------+-----------+----------------+-----------+
| time | 09:12 min | 06:56 min | 06:24 min | 08:10 min |
| nbtree_wal_volume | 8083 MB | 2128 MB | 2327 MB | 2439 MB |
| index_size | 169 MB | 169 MB | 169 MB | 1118 MB |
+-------------------+-----------+-----------+----------------+-----------+
With random insertions into btree it's highly possible that deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer() approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
Also, since the index is packed way better than before, it probably benefits less of wal_compression.
One possible "fix" to decrease WAL amplification is to add REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of inadequate WAL amount.
However, I doubt that it is a crash-safe idea.
Another, and more realistic approach is to make deduplication less intensive:
if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record.
Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because WAL load decreased.
So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?
I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;
+-----------------+---------+---------+----------------+--------+
| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |
+-----------------+---------+---------+----------------+--------+
| checkpoints_req | 16 | 7 | 8 | 10 |
+-----------------+---------+---------+----------------+--------+And I struggle to explain the reason of this.
Do you understand what can cause the difference?
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Attached is v14 based on v12 (v13 changes are not merged). > > In this version, I fixed the bug you mentioned and also fixed nbtinsert, > so that it doesn't save newposting in xlog record anymore. Cool. > I tested patch with nbtree_wal_test, and found out that the real issue is > not the dedup WAL records themselves, but the full page writes that they trigger. > Here are test results (config is standard, except fsync=off to speedup tests): > > 'FPW on' and 'FPW off' are tests on v14. > NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page(). I think that is makes sense to focus on synthetic cases without FPWs/FPIs from checkpoints. At least for now. > With random insertions into btree it's highly possible that deduplication will often be > the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed. I find that hard to believe. Deduplication only occurs when we're about to split the page. If that's almost as likely to occur as a simple insert, then we're in big trouble (maybe it's actually true, but if it is then that's the real problem). Also, fewer pages for the index naturally leads to far fewer FPIs after a checkpoint. I used "pg_waldump -z" and "pg_waldump --stats=record" to evaluate the same case on v13. It was practically the same as the master branch, apart from the huge difference in FPIs for the XLOG rmgr. Aside from that one huge difference, there was a similar volume of the same types of WAL records in each case. Mostly leaf inserts, and far fewer internal page inserts. I suppose this isn't surprising. It probably makes sense for the final version of the patch to increase the volume of WAL a little overall, since the savings for internal page stuff cannot make up for the cost of having to WAL log something extra (deduplication operations) on leaf pages, regardless of the size of those extra dedup WAL records (I am ignoring FPIs after a checkpoint in this analysis). So the patch is more or less certain to add *some* WAL overhead in cases that benefit, and that's okay. But, it adds way too much WAL overhead today (even in v14), for reasons that we don't understand yet, which is not okay. I may have misunderstood your approach to WAL-logging in v12. I thought that you were WAL-logging things that didn't change, which doesn't seem to be the case with v14. I thought that v12 was very similar to v11 (and my v13) in terms of how _bt_dedup_one_page() does its WAL-logging. v14 looks good, though. "pg_waldump -z" and "pg_waldump --stats=record" will break down the contributing factor of FPIs, so it should be possible to account for the overhead in the test case exactly. We can debug the problem by using pg_waldump to count the absolute number of each type of record, and the size of each type of record. (Thinks some more...) I think that the problem here is that you didn't copy this old code from _bt_split() over to _bt_dedup_one_page(): /* * Copy the original page's LSN into leftpage, which will become the * updated version of the page. We need this because XLogInsert will * examine the LSN and possibly dump it in a page image. */ PageSetLSN(leftpage, PageGetLSN(origpage)); isleaf = P_ISLEAF(oopaque); Note that this happens at the start of _bt_split() -- the temp page buffer based on origpage starts out with the same LSN as origpage. This is an important step of the WAL volume optimization used by _bt_split(). > That's why there is no significant difference with log_newpage_buffer() approach. > And that's why "lazy" deduplication doesn't help to decrease amount of WAL. The term "lazy deduplication" is seriously overloaded here. I think that this could cause miscommunications. Let me list the possible meanings of that term here: 1. First of all, the basic approach to deduplication is already lazy, unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid a page split. I'm 100% sure that we both think that that works well compared to an eager approach (like GIN's). 2. Second of all, there is the need to incrementally WAL log. It looks like v14 does that well, in that it doesn't create "xlrec_dedup.n_intervals" space when it isn't truly needed. That's good. 3. Third, there is incremental writing of the page itself -- avoiding using a temp buffer. Not sure where I stand on this. 4. Finally, there is the possibility that we could make deduplication incremental, in order to avoid work that won't be needed altogether -- this would probably be combined with 3. Not sure where I stand on this, either. We should try to be careful when using these terms, as there is a very real danger of talking past each other. > Another, and more realistic approach is to make deduplication less intensive: > if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record. I see that v14 uses the "dedupInterval" struct, which provides a logical description of a deduplicated set of tuples. That general approach is at least 95% of what I wanted from the _bt_dedup_one_page() WAL-logging. > Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions, > not because deduplication itself is cpu bound or something, but because WAL load decreased. I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression faster as things are. I am not against bringing back BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I think that it's a distraction. It may hide problems that we want to fix. Like the PageSetLSN() problem I mentioned just now, and maybe others. We will definitely need to have page space accounting that's a bit similar to nbtsplitloc.c, to avoid the case where a leaf page is 100% full (or has 4 bytes left, or something). That happens regularly now. That must start with teaching _bt_dedup_one_page() about how much space it will free. Basing it on the number of items on the page or whatever is not going to work that well. I think that it would be possible to have something like BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the deduplication incremental, in the sense that it can give up on deduplication when it frees enough space (i.e. something like v13's 0002-* patch). I said that these two things are closely related, which is true, but it's also true that they don't overlap. Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so made a handful of specific indexes (mostly from TPC-H) significantly smaller. I never tried to debug the problem. It's possible that we could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like), but not use it on rightmost pages, and get the best of both worlds. IIRC right-heavy low cardinality indexes (e.g. a low cardinality date column) were improved by removing BT_COMPRESS_THRESHOLD, but we can debug that when the time comes. > So I propose to develop this idea. The question is how to choose threshold. > I wouldn't like to introduce new user settings. Any ideas? I think that there should be a target fill factor that sometimes makes deduplication leave a small amount of free space. Maybe that means that the last posting list on the page is made a bit smaller than the other ones. It should be "goal orientated". The loop within _bt_dedup_one_page() is very confusing in both v13 and v14 -- I couldn't figure out why the accounting worked like this: > + /* > + * Project size of new posting list that would result from merging > + * current tup with pending posting list (could just be prev item > + * that's "pending"). > + * > + * This accounting looks odd, but it's correct because ... > + */ > + projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) + > + (dedupState->ntuples + itup_ntuples + 1) * > + sizeof(ItemPointerData)); Why the "+1" here? I have significantly refactored the _bt_dedup_one_page() loop in a way that seems like a big improvement. It allowed me to remove all of the small palloc() calls inside the loop, apart from the BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems to have shaved about 2 seconds off the "land" unlogged table test, which was originally about 1 minute 2 seconds with v13's 0001-* patch (and without v13's 0002-* patch). It seems like can easily be integrated with the approach to WAL logging taken in v14, so everything can be integrated soon. I'll work on that. > I also noticed that the number of checkpoints differ between tests: > select checkpoints_req from pg_stat_bgwriter ; > And I struggle to explain the reason of this. > Do you understand what can cause the difference? I imagine that the additional WAL volume triggered a checkpoint earlier than in the more favorable test, which indirectly triggered more FPIs, which contributed to triggering a checkpoint even earlier...and so on. Synthetic test cases can avoid this. A useful synthetic test should have no checkpoints at all, so that we can see the broken down costs, without any second order effects that add more cost in weird ways. -- Peter Geoghegan
On Mon, Sep 16, 2019 at 11:58 AM Peter Geoghegan <pg@bowt.ie> wrote: > I think that the problem here is that you didn't copy this old code > from _bt_split() over to _bt_dedup_one_page(): > > /* > * Copy the original page's LSN into leftpage, which will become the > * updated version of the page. We need this because XLogInsert will > * examine the LSN and possibly dump it in a page image. > */ > PageSetLSN(leftpage, PageGetLSN(origpage)); > isleaf = P_ISLEAF(oopaque); I can confirm that this is what the problem was. Attached are two patches: * A version of your v14 from today with a couple of tiny changes to make it work against the current master branch -- I had to rebase the patch, but the changes made while rebasing were totally trivial. (I like to keep CFTester green.) * The second patch actually fixes the PageSetLSN() thing, setting the temp page buffer's LSN to match the original page before any real work is done, and before XLogInsert() is called. Just like _bt_split(). The test case now shows exactly what you reported for "FPWs off" when FPWs are turned on, at least on my machine and with my checkpoint settings. That is, there are *zero* FPIs/FPWs, so the final nbtree volume is 2128 MB. This means that the volume of additional WAL required over what the master branch requires for the same test case is very small (2128 MB compares well with master's 2011 MB of WAL). Maybe we could do better than 2128 MB with more work, but this is definitely already low enough overhead to be acceptable. This also passes "make check-world" testing. However, my usual wal_consistency_checking smoke test fails pretty quickly with the two patches applied: 3634/2019-09-16 13:53:22 PDT FATAL: inconsistent page found, rel 1663/16385/2673, forknum 0, blkno 13 3634/2019-09-16 13:53:22 PDT CONTEXT: WAL redo at 0/3202370 for Btree/DEDUPLICATE: items were deduplicated to 12 items 3633/2019-09-16 13:53:22 PDT LOG: startup process (PID 3634) exited with exit code 1 Maybe the lack of the PageSetLSN() thing masked a bug in WAL replay, since without that we effectively always just replay FPIs, never truly relying on redo. (I didn't try wal_consistency_checking without the second patch, but I assume that you did, and found no problems for this reason.) Can you produce a new version that integrates the PageSetLSN() thing, and fixes this bug? Thanks -- Peter Geoghegan
Attachment
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
16.09.2019 21:58, Peter Geoghegan wrote: > On Mon, Sep 16, 2019 at 8:48 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> I tested patch with nbtree_wal_test, and found out that the real issue is >> not the dedup WAL records themselves, but the full page writes that they trigger. >> Here are test results (config is standard, except fsync=off to speedup tests): >> >> 'FPW on' and 'FPW off' are tests on v14. >> NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page(). > I think that is makes sense to focus on synthetic cases without > FPWs/FPIs from checkpoints. At least for now. > >> With random insertions into btree it's highly possible that deduplication will often be >> the first write after checkpoint, and thus will trigger FPW, even if only a few tuples were compressed. > <...> > > I think that the problem here is that you didn't copy this old code > from _bt_split() over to _bt_dedup_one_page(): > > /* > * Copy the original page's LSN into leftpage, which will become the > * updated version of the page. We need this because XLogInsert will > * examine the LSN and possibly dump it in a page image. > */ > PageSetLSN(leftpage, PageGetLSN(origpage)); > isleaf = P_ISLEAF(oopaque); > > Note that this happens at the start of _bt_split() -- the temp page > buffer based on origpage starts out with the same LSN as origpage. > This is an important step of the WAL volume optimization used by > _bt_split(). That's it. I suspected that such enormous amount of FPW reflects some bug. >> That's why there is no significant difference with log_newpage_buffer() approach. >> And that's why "lazy" deduplication doesn't help to decrease amount of WAL. My point was that the problem is extra FPWs, so it doesn't matter whether we deduplicate just several entries to free enough space or all of them. > The term "lazy deduplication" is seriously overloaded here. I think > that this could cause miscommunications. Let me list the possible > meanings of that term here: > > 1. First of all, the basic approach to deduplication is already lazy, > unlike GIN, in the sense that _bt_dedup_one_page() is called to avoid > a page split. I'm 100% sure that we both think that that works well > compared to an eager approach (like GIN's). Sure. > 2. Second of all, there is the need to incrementally WAL log. It looks > like v14 does that well, in that it doesn't create > "xlrec_dedup.n_intervals" space when it isn't truly needed. That's > good. In v12-v15 I mostly concentrated on this feature. The last version looks good to me. > 3. Third, there is incremental writing of the page itself -- avoiding > using a temp buffer. Not sure where I stand on this. I think it's a good idea. memmove must be much faster than copying items tuple by tuple. I'll send next patch by the end of the week. > 4. Finally, there is the possibility that we could make deduplication > incremental, in order to avoid work that won't be needed altogether -- > this would probably be combined with 3. Not sure where I stand on > this, either. > > We should try to be careful when using these terms, as there is a very > real danger of talking past each other. > >> Another, and more realistic approach is to make deduplication less intensive: >> if freed space is less than some threshold, fall back to not changing page at all and not generating xlog record. > I see that v14 uses the "dedupInterval" struct, which provides a > logical description of a deduplicated set of tuples. That general > approach is at least 95% of what I wanted from the > _bt_dedup_one_page() WAL-logging. > >> Probably that was the reason, why patch became faster after I added BT_COMPRESS_THRESHOLD in early versions, >> not because deduplication itself is cpu bound or something, but because WAL load decreased. > I think so too -- BT_COMPRESS_THRESHOLD definitely makes compression > faster as things are. I am not against bringing back > BT_COMPRESS_THRESHOLD. I just don't want to do it right now because I > think that it's a distraction. It may hide problems that we want to > fix. Like the PageSetLSN() problem I mentioned just now, and maybe > others. > > We will definitely need to have page space accounting that's a bit > similar to nbtsplitloc.c, to avoid the case where a leaf page is 100% > full (or has 4 bytes left, or something). That happens regularly now. > That must start with teaching _bt_dedup_one_page() about how much > space it will free. Basing it on the number of items on the page or > whatever is not going to work that well. > > I think that it would be possible to have something like > BT_COMPRESS_THRESHOLD to prevent thrashing, and *also* make the > deduplication incremental, in the sense that it can give up on > deduplication when it frees enough space (i.e. something like v13's > 0002-* patch). I said that these two things are closely related, which > is true, but it's also true that they don't overlap. > > Don't forget the reason why I removed BT_COMPRESS_THRESHOLD: Doing so > made a handful of specific indexes (mostly from TPC-H) significantly > smaller. I never tried to debug the problem. It's possible that we > could bring back BT_COMPRESS_THRESHOLD (or something fillfactor-like), > but not use it on rightmost pages, and get the best of both worlds. > IIRC right-heavy low cardinality indexes (e.g. a low cardinality date > column) were improved by removing BT_COMPRESS_THRESHOLD, but we can > debug that when the time comes. Now that extra FPW are proven to be a bug, I agree that giving up on deduplication early is not necessary. My previous considerations were based on the idea that deduplication always adds considerable overhead, which is not true after recent optimizations. >> So I propose to develop this idea. The question is how to choose threshold. >> I wouldn't like to introduce new user settings. Any ideas? > I think that there should be a target fill factor that sometimes makes > deduplication leave a small amount of free space. Maybe that means > that the last posting list on the page is made a bit smaller than the > other ones. It should be "goal orientated". > > The loop within _bt_dedup_one_page() is very confusing in both v13 and > v14 -- I couldn't figure out why the accounting worked like this: > >> + /* >> + * Project size of new posting list that would result from merging >> + * current tup with pending posting list (could just be prev item >> + * that's "pending"). >> + * >> + * This accounting looks odd, but it's correct because ... >> + */ >> + projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) + >> + (dedupState->ntuples + itup_ntuples + 1) * >> + sizeof(ItemPointerData)); > Why the "+1" here? I'll look at it. > I have significantly refactored the _bt_dedup_one_page() loop in a way > that seems like a big improvement. It allowed me to remove all of the > small palloc() calls inside the loop, apart from the > BTreeFormPostingTuple() palloc()s. It's also a lot faster -- it seems > to have shaved about 2 seconds off the "land" unlogged table test, > which was originally about 1 minute 2 seconds with v13's 0001-* patch > (and without v13's 0002-* patch). > > It seems like can easily be integrated with the approach to WAL > logging taken in v14, so everything can be integrated soon. I'll work > on that. New version is attached. It is v14 (with PageSetLSN fix) merged with v13. I also fixed a bug in btree_xlog_dedup(), that was previously masked by FPW. v15 passes make installcheck. I haven't tested it with land test yet. Will do it later this week. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Sep 17, 2019 at 9:43 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > 3. Third, there is incremental writing of the page itself -- avoiding > > using a temp buffer. Not sure where I stand on this. > > I think it's a good idea. memmove must be much faster than copying > items tuple by tuple. > I'll send next patch by the end of the week. I think that the biggest problem is that we copy all of the tuples, including existing posting list tuples that can't be merged with anything. Even if you assume that we'll never finish early (e.g. by using logic like the "if (pagesaving >= newitemsz) deduplicate = false;" thing), this can still unnecessarily slow down deduplication. Very often, _bt_dedup_one_page() is called when 1/2 - 2/3 of the space on the page is already used by a small number of very large posting list tuples. > > The loop within _bt_dedup_one_page() is very confusing in both v13 and > > v14 -- I couldn't figure out why the accounting worked like this: > I'll look at it. I'm currently working on merging my refactored version of _bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky. (I have finished merging the other WAL-logging stuff, though -- that was easy.) The general idea is that the loop in _bt_dedup_one_page() now explicitly operates with a "base" tuple, rather than *always* saving the prev tuple from the last loop iteration. We always have a "pending posting list", which won't be written-out as a posting list if it isn't possible to merge at least one existing page item. The "base" tuple doesn't change. "pagesaving" space accounting works in a way that doesn't care about whether or not the base tuple was already a posting list -- it saves the size of the IndexTuple without any existing posting list size, and calculates the contribution to the total size of the new posting list separately (heap TIDs from the original base tuple and subsequent tuples are counted together). This has a number of advantages: * The loop is a lot clearer now, and seems to have slightly better space utilization because of improved accounting (with or without the "if (pagesaving >= newitemsz) deduplicate = false;" thing). * I think that we're going to need to be disciplined about which tuple is the "base" tuple for correctness reasons -- we should always use the leftmost existing tuple to form a new posting list tuple. I am concerned about rare cases where we deduplicate tuples that are equal according to _bt_keep_natts_fast()/datum_image_eq() that nonetheless have different sizes (and are not bitwise equal). There are rare cases involving TOAST compression where that is just about possible (see the temp comments I added to _bt_keep_natts_fast() in the patch). * It's clearly faster, because there is far less palloc() overhead -- the "land" unlogged table test completes in about 95.5% of the time taken by v15 (I disabled "if (pagesaving >= newitemsz) deduplicate = false;" for both versions here, to keep it simple and fair). This also suggests that making _bt_dedup_one_page() do raw page adds and page deletes to the page in shared_buffers (i.e. don't use a temp buffer page) could pay off. As I went into at the start of this e-mail, unnecessarily doing expensive things like copying large posting lists around is a real concern. Even if it isn't truly useful for _bt_dedup_one_page() to operate in a very incremental fashion, incrementalism is probably still a good thing to aim for -- it seems to make deduplication faster in all cases. -- Peter Geoghegan
On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote: > This also suggests that making _bt_dedup_one_page() do raw page adds > and page deletes to the page in shared_buffers (i.e. don't use a temp > buffer page) could pay off. As I went into at the start of this > e-mail, unnecessarily doing expensive things like copying large > posting lists around is a real concern. Even if it isn't truly useful > for _bt_dedup_one_page() to operate in a very incremental fashion, > incrementalism is probably still a good thing to aim for -- it seems > to make deduplication faster in all cases. I think that I forgot to mention that I am concerned that the kill_prior_tuple/LP_DEAD optimization could be applied less often because _bt_dedup_one_page() operates too aggressively. That is a big part of my general concern. Maybe I'm wrong about this -- who knows? I definitely think that LP_DEAD setting by _bt_check_unique() is generally a lot more important than LP_DEAD setting by the kill_prior_tuple optimization, and the patch won't affect unique indexes. Only very serious benchmarking can give us a clear answer, though. -- Peter Geoghegan
On Wed, Sep 18, 2019 at 10:43 AM Peter Geoghegan <pg@bowt.ie> wrote: > I'm currently working on merging my refactored version of > _bt_dedup_one_page() with your v15 WAL-logging. This is a bit tricky. > (I have finished merging the other WAL-logging stuff, though -- that > was easy.) I attach version 16. This revision merges your recent work on WAL logging with my recent work on simplifying _bt_dedup_one_page(). See my e-mail from earlier today for details. Hopefully this will be a bit easier to work with when you go to make _bt_dedup_one_page() do raw PageIndexMultiDelete() + PageAddItem() calls against the page contained in a buffer directly (rather than using a temp version of the page in local memory in the style of _bt_split()). I find the loop within _bt_dedup_one_page() much easier to follow now. While I'm looking forward to seeing the PageIndexMultiDelete()/PageAddItem() approach that you come up with, the basic design of _bt_dedup_one_page() seems to be in much better shape today than it was a few weeks ago. I am going to spend the next few days teaching _bt_dedup_one_page() about space utilization. I'll probably make it respect a fillfactor-style target. I've noticed that it is often too aggressive about filling a page, though less often it actually shows the opposite problem: it fails to use more than about 2/3 of the page for the same value, again and again (must be something to do with the exact width of the tuples). In general, _bt_dedup_one_page() should know a few things about what nbtsplitloc.c will do when the page is very likely to be split soon. I'll also spend some more time working on the opclass infrastructure that we need to disable deduplication with datatypes where it is unsafe [1]. Other changes: * qsort() is no longer used by BTreeFormPostingTuple() in v16 -- we can easily sorting the array of heap TIDs the caller's responsibility. Since the heap TID column is sorted in ascending order among duplicates on a page, and since TIDs within individual posting lists are also sorted in ascending order, there is no need to resort. I added a new assertion to BTreeFormPostingTuple() that verifies that its caller actually gets it right. * The new nbtpage.c/VACUUM code has been tweaked to minimize the changes required against master. Nothing significant, though. It was easier to refactor the _bt_dedup_one_page() stuff by temporarily making nbtsort.c not use it. I didn't want to delay getting v16 to you, so I didn't take the time to fix-up nbtsort.c to use the new stuff. It's actually using its own old copy of stuff that it should get from nbtinsert.c in v16 -- it calls _bt_dedup_item_tid_sort(), not the new _bt_dedup_save_htid() function. I'll update it soon, though. [1] https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan
Attachment
On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > I attach version 16. This revision merges your recent work on WAL > logging with my recent work on simplifying _bt_dedup_one_page(). See > my e-mail from earlier today for details. I attach version 17. This version has changes that are focussed on further polishing certain things, including fixing some minor bugs. It seemed worth creating a new version for that. (I didn't get very far with the space utilization stuff I talked about, so no changes there.) Changes in v17: * nbtsort.c now has a loop structure that closely matches _bt_dedup_one_page() (I put this off in v16). We now reuse most of the nbtinsert.c deduplication routines. * Further simplification of btree_xlog_dedup() loop. Recovery no longer relies on local variables to track the progress of deduplication -- it uses dedup state (the state managed by nbtinsert.c's dedup routines) instead. This is easier to follow. * Reworked _bt_split() comments on posting list splits that coincide with page splits. * Fixed memory leaks in recovery code by creating a dedicated memory context that gets reset regularly. The context is create in a new rmgr "startup" callback I created for the B-Tree rmgr. We already do this for both GIN and GiST. More specifically, the REDO code calls MemoryContextReset() against its dedicated memory context after every record is processed by REDO, no matter what. The MemoryContextReset() call usually won't have to actually free anything, but that's okay because the no-free case does almost no work. I think that it makes sense to keep things as simple as possible for memory management during recovery -- it's too easy for a new memory leak to get introduced when a small change is made to the nbtinsert.c routines later on. * Optimize VACUUMing of posting lists: we now only allocate memory for an array of still-live posting list items when the array will actually be needed. It is only needed when there are tuples to remove from the posting list, because only then do we need to create a replacement posting list that lacks the heap TIDs that VACUUM needs to delete. It seemed like a really good idea to not allocate any memory in the common case where VACUUM doesn't need to change a posting list tuple at all. ginVacuumItemPointers() has exactly the same optimization. * Fixed an accounting bug in the output of VACCUM VERBOSE by changing some code in nbtree.c. The tuples_removed and num_index_tuples fields in IndexBulkDeleteResult are reported as "index row versions" by VACUUM VERBOSE. Everything but the index pages stat works at the level of "index row versions", which should not be affected by the deduplication patch. Of course, deduplication only changes the physical representation of items in the index -- never the logical contents of the index. This is what GIN does already. Another infrastructure thing that the patch needs to handle to be committable: We still haven't added an "off" switch to deduplication, which seems necessary. I suppose that this should look like GIN's "fastupdate" storage parameter. It's not obvious how to do this in a way that's easy to work with, though. Maybe we could do something like copy GIN's GinGetUseFastUpdate() macro, but the situation with nbtree is actually quite different. There are two questions for nbtree when it comes to deduplication within an inde: 1) Does the user want to use deduplication, because that will help performance?, and 2) Is it safe/possible to use deduplication at all? I think that we should probably stash this information (deduplication is both possible and safe) in the metapage. Maybe we can copy it over to our insertion scankey, just like the "heapkeyspace" field -- that information also comes from the metapage (it's based on the nbtree version). The "heapkeyspace" field is a bit ugly, so maybe we shouldn't go further by adding something similar, but I don't see any great alternative right now. -- Peter Geoghegan
Attachment
On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan <pg@bowt.ie> wrote: > I attach version 17. I attach a patch that applies on top of v17. It adds support for deduplication within unique indexes. Actually, this is a snippet of code that appeared in my prototype from August 5 (we need very little extra code for this now). Unique index support kind of looked like a bad idea at the time, but things have changed a lot since then. I benchmarked this overnight using a custom pgbench-based test that used a Zipfian distribution, with a single-row SELECT and an UPDATE of pgbench_accounts per xact. I used regular/logged tables this time around, since WAL-logging is now fairly efficient. I added a separate low cardinality index on pgbench_accounts(abalance). A low cardinality index is the most interesting case for this patch, obviously, but it also serves to prevent all HOT updates, increasing bloat for both indexes. We want a realistic case that creates a lot of index bloat. This wasn't a rigorous enough benchmark to present here in full, but the results were very encouraging. With reasonable client counts for the underlying hardware, we seem to have a small increase in TPS, and a small decrease in latency. There is a regression with 128 clients, when contention is ridiculously high (this is my home server, which only has 4 cores). More importantly: * The low cardinality index is almost 3x smaller with the patch -- no surprises there. * The read latency is where latency goes up, if it goes up at all. Whatever it is that might increase latency, it doesn't look like it's deduplication itself. Yeah, deduplication is expensive, but it's not nearly as expensive as a page split. (I'm talking about the immediate cost, not the bigger picture, though the bigger picture matters even more.) * The growth in primary key size over time is the thing I find really interesting. The patch seems to really control the number of pages splits over many hours with lots of non-HOT updates. I think that a timeline of days or weeks could be really interesting. I am now about 75% convinced that adding deduplication to unique indexes is a good idea, at least as an option that is disabled by default. We're already doing well here, even though there has been no work on optimizing deduplication in unique indexes. Further optimizations seem quite possible, though. I'm mostly thinking about optimizing non-HOT updates by teaching nbtree some basic things about versioning with unique indexes. For example, we could remember "recently dead" duplicates of the value we are about to insert (as part of an UPDATE statement) from within _bt_check_unique(). Then, when it looks like a page split may be necessary, we can hint to _bt_dedup_one_page(): "please just deduplicate the group of duplicates starting from this offset, which are duplicates of the this new item I am inserting -- do not create a posting list that I will have to split, though". We already cache the binary search bounds established within _bt_check_unique() in insertstate, so perhaps we could reuse that to make this work. The goal here is that the the old/recently dead versions end up together in their own posting list (or maybe two posting lists), whereas our new/most current tuple is on its own. There is a very good chance that our transaction will commit, leaving somebody else to set the LP_DEAD bit on the posting list that contains those old versions. In short, we'd be making deduplication and opportunistic garbage collection cooperate closely. This can work both ways -- maybe we should also teach _bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is, if we add the mechanism I just described in the last paragraph, maybe _bt_dedup_one_page() marks the posting list that is likely to get its LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here, LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit on this posting list tuple very soon". That way, if nobody actually does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If it goes to the heap and finds the latest version, and that that version is visible to any possible MVCC snapshot, that means that it's safe to kill all the other versions, even without the LP_DEAD bit set -- this is a unique index. So, it often gets to kill lots of extra garbage that it wouldn't get to kill, preventing page splits. The cost is pretty low: the risk that the single heap page check will be a wasted effort. (Of course, we still have to visit the heap pages of things that we go on to kill, to get the XIDs to generate recovery conflicts -- the important point is that we only need to visit one heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do all this -- cases that don't benefit at all also don't pay very much.) I don't think that we need to do either of these two other things to justify committing the patch with unique index support. But, teaching nbtree a little bit about versioning like this could work rather well in practice, without it really mattering that it will get the wrong idea at times (e.g. when transactions abort a lot). This all seems promising as a family of techniques for unique indexes. It's worth doing extra work if it might delay a page split, since delaying can actually fully prevent page splits that are mostly caused by non-HOT updates. Most primary key indexes are serial PKs, or some kind of counter. Postgres should mostly do page splits for these kinds of primary keys indexes in the places that make sense based on the dataset, and not because of "write amplification". -- Peter Geoghegan
Attachment
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
24.09.2019 3:13, Peter Geoghegan wrote: > On Wed, Sep 18, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote: >> I attach version 16. This revision merges your recent work on WAL >> logging with my recent work on simplifying _bt_dedup_one_page(). See >> my e-mail from earlier today for details. > I attach version 17. This version has changes that are focussed on > further polishing certain things, including fixing some minor bugs. It > seemed worth creating a new version for that. (I didn't get very far > with the space utilization stuff I talked about, so no changes there.) Attached is v18. In this version bt_dedup_one_page() is refactored so that: - no temp page is used, all updates are applied to the original page. - each posting tuple wal logged separately. This also allowed to simplify btree_xlog_dedup significantly. > Another infrastructure thing that the patch needs to handle to be committable: > > We still haven't added an "off" switch to deduplication, which seems > necessary. I suppose that this should look like GIN's "fastupdate" > storage parameter. It's not obvious how to do this in a way that's > easy to work with, though. Maybe we could do something like copy GIN's > GinGetUseFastUpdate() macro, but the situation with nbtree is actually > quite different. There are two questions for nbtree when it comes to > deduplication within an inde: 1) Does the user want to use > deduplication, because that will help performance?, and 2) Is it > safe/possible to use deduplication at all? I'll send another version with dedup option soon. > I think that we should probably stash this information (deduplication > is both possible and safe) in the metapage. Maybe we can copy it over > to our insertion scankey, just like the "heapkeyspace" field -- that > information also comes from the metapage (it's based on the nbtree > version). The "heapkeyspace" field is a bit ugly, so maybe we > shouldn't go further by adding something similar, but I don't see any > great alternative right now. > Why is it necessary to save this information somewhere but rel->rd_options, while we can easily access this field from _bt_findinsertloc() and _bt_load(). -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Sep 25, 2019 at 8:05 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Attached is v18. In this version bt_dedup_one_page() is refactored so that: > - no temp page is used, all updates are applied to the original page. > - each posting tuple wal logged separately. > This also allowed to simplify btree_xlog_dedup significantly. This looks great! Even if it isn't faster than using a temp page buffer, the flexibility seems like an important advantage. We can do things like have the _bt_dedup_one_page() caller hint that deduplication should start at a particular offset number. If that doesn't work out by the time the end of the page is reached (whatever "works out" may mean), then we can just start at the beginning of the page, and work through the items we skipped over initially. > > We still haven't added an "off" switch to deduplication, which seems > > necessary. I suppose that this should look like GIN's "fastupdate" > > storage parameter. > Why is it necessary to save this information somewhere but rel->rd_options, > while we can easily access this field from _bt_findinsertloc() and > _bt_load(). Maybe, but we also need to access a flag that says it's safe to use deduplication. Obviously deduplication is not safe for datatypes like numeric and text with a nondeterministic collation. The "is deduplication safe for this index?" mechanism will probably work by doing several catalog lookups. This doesn't seem like something we want to do very often, especially with a buffer lock held -- ideally it will be somewhere that's convenient to access. Do we want to do that separately, and have a storage parameter that says "I would like to use deduplication in principle, if it's safe"? Or, do we store both pieces of information together, and forbid setting the storage parameter to on when it's known to be unsafe for the underlying opclasses used by the index? I don't know. I think that you can start working on this without knowing exactly how we'll do those catalog lookups. What you come up with has to work with that before the patch can be committed, though. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
25.09.2019 22:14, Peter Geoghegan wrote: > >>> We still haven't added an "off" switch to deduplication, which seems >>> necessary. I suppose that this should look like GIN's "fastupdate" >>> storage parameter. >> Why is it necessary to save this information somewhere but rel->rd_options, >> while we can easily access this field from _bt_findinsertloc() and >> _bt_load(). > Maybe, but we also need to access a flag that says it's safe to use > deduplication. Obviously deduplication is not safe for datatypes like > numeric and text with a nondeterministic collation. The "is > deduplication safe for this index?" mechanism will probably work by > doing several catalog lookups. This doesn't seem like something we > want to do very often, especially with a buffer lock held -- ideally > it will be somewhere that's convenient to access. > > Do we want to do that separately, and have a storage parameter that > says "I would like to use deduplication in principle, if it's safe"? > Or, do we store both pieces of information together, and forbid > setting the storage parameter to on when it's known to be unsafe for > the underlying opclasses used by the index? I don't know. > > I think that you can start working on this without knowing exactly how > we'll do those catalog lookups. What you come up with has to work with > that before the patch can be committed, though. > Attached is v19. * It adds new btree reloption "deduplication". I decided to refactor the code and move BtreeOptions into a separate structure, rather than adding new btree specific value to StdRelOptions. Now it can be set even for indexes that do not support deduplication. In that case it will be ignored. Should we add this check to option validation? * By default deduplication is on for non-unique indexes and off for unique ones. * New function _bt_dedup_is_possible() is intended to be a single place to perform all the checks. Now it's just a stub to ensure that it works. Is there a way to extract this from existing opclass information, or we need to add new opclass field? Have you already started this work? I recall there was another thread, but didn't manage to find it. * I also integrated into this version your latest patch that enables deduplication on unique indexes, since now it can be easily switched on/off. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Attached is v19. Cool. > * By default deduplication is on for non-unique indexes and off for > unique ones. I think that it makes sense to enable deduplication by default -- even with unique indexes. It looks like deduplication can be very helpful with non-HOT updates. I have been benchmarking this using more or less standard pgbench at scale 200, with one big difference -- I also create an index on "pgbench_accounts (abalance)". This is a low cardinality index, which ends up about 3x smaller with the patch, as expected. It also makes all updates non-HOT updates, greatly increasing index bloat in the primary key of the accounts table -- this is what I found really interesting about this workload. The theory behind deduplication within unique indexes seems quite different to the cases we've focussed on so far -- that's why my working copy of the patch makes a few small changes to how _bt_dedup_one_page() works with unique indexes specifically (more on that later). With unique indexes, deduplication doesn't help by creating space -- it helps by creating *time* for garbage collection to run before the real "damage" is done -- it delays page splits. This is only truly valuable when page splits caused by non-HOT updates are delayed by so much that they're actually prevented entirely, typically because the _bt_vacuum_one_page() stuff can now happen before pages split, not after. In general, these page splits are bad because they degrade the B-Tree structure, more or less permanently (it's certainly permanent with this workload). Having a huge number of page splits *purely* because of non-HOT updates is particular bad -- it's just awful. I believe that this is the single biggest problem with the Postgres approach to versioned storage (we know that other DB systems have no primary key page splits with this kind of workload). Anyway, if you run this pgbench workload without rate-limiting, so that a patched Postgres does as much work as physically possible, the accounts table primary key (pgbench_accounts_pkey) at least grows at a slower rate -- the patch clearly beats master at the start of the benchmark/test (as measured by index size). As the clients are ramped up by my testing script, and as time goes on, eventually the size of the pgbench_accounts_pkey index "catches up" with master. The patch delays page splits, but ultimately the system as a whole cannot prevent the page splits altogether, since the server is being absolutely hammered by pgbench. Actually, the index is *exactly* the same size for both the master case and the patch case when we reach this "bloat saturation point". We can delay the problem, but we cannot prevent it. But what about a more realistic workload, with rate-limiting? When I add some rate limiting, so that the TPS/throughput is at about 50% of what I got the first time around (i.e. 50% of what is possible), or 15k TPS, it's very different. _bt_dedup_one_page() can now effectively cooperate with _bt_vacuum_one_page(). Now deduplication is able to "soak up all the extra garbage tuples" for long enough to delay and ultimately *prevent* almost all page splits. pgbench_accounts_pkey starts off at 428 MB for both master and patch (CREATE INDEX makes it that size). After about an hour, the index is 447 MB with the patch. The master case ends up with a pgbench_accounts_pkey size of 854 MB, though (this is very close to 857 MB, the "saturation point" index size from before). This is a very significant improvement, obviously -- the patch has an index that is ~52% of the size seen for the same index with the master branch! Here is how I changed _bt_dedup_one_page() for unique indexes to get this result: * We limit the size of posting lists to 5 heap TIDs in the checkingunique case. Right now, we will actually accept a checkingunique page split before we'll merge together items that result in a posting list with more heap TIDs than that (not sure about these details at all, though). * Avoid creating a new posting list that caller will have to split immediately anyway (this is based on details of _bt_dedup_one_page() caller's newitem tuple). (Not sure how much this customization contributes to this favorable test result -- maybe it doesn't make that much difference.) The goal here is for duplicates that are close together in both time and space to get "clumped together" into their own distinct, small-ish posting list tuples with no more than 5 TIDs. This is intended to help _bt_vacuum_one_page(), which is known to be a very important mechanism for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set very frequently within _bt_check_unique()). The general idea is to balance deduplication against LP_DEAD killing, and to increase spatial/temporal locality within these smaller posting lists. If we have one huge posting list for each value, then we can't set the LP_DEAD bit on anything at all, which is very bad. If we have a few posting lists that are not so big for each distinct value, we can often kill most of them within _bt_vacuum_one_page(), which is very good, and has minimal downside (i.e. we still get most of the benefits of aggressive deduplication). Interestingly, these non-HOT page splits all seem to "come in waves". I noticed this because I carefully monitored the benchmark/test case over time. The patch doesn't prevent the "waves of page splits" pattern, but it does make it much much less noticeable. > * New function _bt_dedup_is_possible() is intended to be a single place > to perform all the checks. Now it's just a stub to ensure that it works. > > Is there a way to extract this from existing opclass information, > or we need to add new opclass field? Have you already started this work? > I recall there was another thread, but didn't manage to find it. The thread is here: https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan
On Fri, Sep 27, 2019 at 7:02 PM Peter Geoghegan <pg@bowt.ie> wrote: > I think that it makes sense to enable deduplication by default -- even > with unique indexes. It looks like deduplication can be very helpful > with non-HOT updates. Attached is v20, which adds a custom strategy for the checkingunique (unique index) case to _bt_dedup_one_page(). It also makes deduplication the default for both unique and non-unique indexes. I simply altered your new BtreeDefaultDoDedup() macro from v19 to make nbtree use deduplication wherever it is safe to do so. This default may not be the best one in the end, though deduplication in unique indexes now looks very compelling. The new checkingunique heuristics added to _bt_dedup_one_page() were developed experimentally, based on pgbench tests. The general idea with the new checkingunique stuff is to make deduplication *extremely* lazy. We want to avoid making _bt_vacuum_one_page() garbage collection less effective by being too aggressive with deduplication -- workloads with lots of non-HOT-updates into unique indexes are greatly dependent on the LP_DEAD bit setting in _bt_check_unique(). At the same time, _bt_dedup_one_page() can be just as effective at delaying page splits as it is with non-unique indexes. I've found that my "regular pgbench, but with a low cardinality index on pgbench_accounts(abalance)" benchmark works best with the specific heuristics used in the patch, especially over many hours. I spent nearly 24 hours running the test at full speed (no throttling this time), at scale 500, and with very very aggressive autovacuum settings (autovacuum_vacuum_cost_delay=0ms, autovacuum_vacuum_scale_factor=0.02). Each run lasted one hour, with alternating runs of 4, 8, and 16 clients. Towards the end, the patch had about 5% greater throughput at lower client counts, and never seemed to be significantly slower (it was very slightly slower once or twice, but I think that that was just noise). More importantly, the indexes looked like this on master: bloated_abalance: 3017 MB pgbench_accounts_pkey: 2142 MB pgbench_branches_pkey: 1352 kB pgbench_tellers_pkey: 3416 kB And like this with the patch: bloated_abalance: 1015 MB pgbench_accounts_pkey: 1745 MB pgbench_branches_pkey: 296 kB pgbench_tellers_pkey: 888 kB * bloated_abalance is about 3x smaller here, as usual -- no surprises there. * pgbench_accounts_pkey is the most interesting case. You might think that it isn't that great that pgbench_accounts_pkey is 1745 MB with the patch, since it starts out at only 1071 MB (and would go back down to 1071 MB again if we were to do a REINDEX). However, you have to bear in mind that it takes a long time for it to get that big -- the growth over time is very important here. Even after the first run with 16 clients, it only reached 1160 MB -- that's an increase of ~8%. The master case had already reached 2142 MB ("bloat saturation point") by then, though. I could easily have stopped the benchmark there, or used rate-limiting, or excluded the 16 client case -- that would have allowed me to claim that the growth was under 10% for a workload where the master case has an index that doubles in size. On the other hand, if autovacuum wasn't configured to run very frequently, then the patch wouldn't look nearly this good. Deduplication helped autovacuum by "soaking up" the "recently dead" index tuples that cannot be killed right away. In short, the patch ameliorates weaknesses of the existing garbage collection mechanisms without changing them. The patch smoothed out the growth of pgbench_accounts_pkey over many hours. As I said, it was only 1160 MB after the first 3 hours/first 16 client run. It was 1356 MB after the second 16 client run (i.e. after running another round of one hour 4/8/16 client runs), finally finishing up at 1745 MB. So the growth in the size of pgbench_accounts_pkey for the patch was significantly improved, and the *rate* of growth over time was also improved. The master branch had a terrible jerky growth in the size of pgbench_accounts_pkey. The master branch did mostly keep up at first (i.e. the size of pgbench_accounts_pkey wasn't too different at first). But once we got to 16 clients for the first time, after a couple of hours, pgbench_accounts_pkey almost doubled in size over a period of only 10 or 20 minutes! The index size *exploded* in a very short period of time, starting only a few hours into the benchmark. (Maybe we don't see this anything like this with the patch because with the patch backends are more concerned about helping VACUUM, and less concerned about creating a mess that VACUUM must clean up. Not sure.) * We also manage to make the small pgbench indexes (pgbench_branches_pkey and pgbench_tellers_pkey) over 4x smaller here (without doing anything to force more non-HOT updates on the underlying tables). This result for the two small indexes looks good, but I should point out that we still only fit ~15 or so tuples on each leaf page with the patch when everything is over -- far far less than the number that CREATE INDEX stored on the leaf pages immediately (it leaves 366 items on each leaf page). This is kind of an extreme case, because there is so much contention, but space utilization with the patch is actually very bad here. The master branch is very very very bad, though, so we're at least down to only a single "very" here. Progress. Any thoughts on the approach taken for unique indexes within _bt_dedup_one_page() in v20? Obviously that stuff needs to be examined critically -- it's possible that it wouldn't do as well as it could or should with other workloads that I haven't thought about. Please take a look at the details. -- Peter Geoghegan
Attachment
On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > I've found that my "regular pgbench, but with a low cardinality index > on pgbench_accounts(abalance)" benchmark works best with the specific > heuristics used in the patch, especially over many hours. I ran pgbench without the pgbench_accounts(abalance) index, and with slightly adjusted client counts -- you could say that this was a classic pgbench benchmark of v20 of the patch. Still scale 500, with single hour runs. Here are the results for each 1 hour run, with client counts of 8, 16, and 32, with two rounds of runs total: master_1_run_8.out: "tps = 25156.689415 (including connections establishing)" patch_1_run_8.out: "tps = 25135.472084 (including connections establishing)" master_1_run_16.out: "tps = 30947.053714 (including connections establishing)" patch_1_run_16.out: "tps = 31225.044305 (including connections establishing)" master_1_run_32.out: "tps = 29550.231969 (including connections establishing)" patch_1_run_32.out: "tps = 29425.011249 (including connections establishing)" master_2_run_8.out: "tps = 24678.792084 (including connections establishing)" patch_2_run_8.out: "tps = 24891.130465 (including connections establishing)" master_2_run_16.out: "tps = 30878.930585 (including connections establishing)" patch_2_run_16.out: "tps = 30982.306091 (including connections establishing)" master_2_run_32.out: "tps = 29555.453436 (including connections establishing)" patch_2_run_32.out: "tps = 29591.767136 (including connections establishing)" This interlaced order is the same order that each 1 hour pgbench run actually ran in. The patch wasn't expected to do any better here -- it was expected to not be any slower for a workload that it cannot really help. Though the two small pgbench indexes do remain a lot smaller with the patch. While a lot of work remains to validate the performance of the patch, this looks good to me. -- Peter Geoghegan
On Mon, Sep 30, 2019 at 7:39 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v20, which adds a custom strategy for the checkingunique > (unique index) case to _bt_dedup_one_page(). It also makes > deduplication the default for both unique and non-unique indexes. I > simply altered your new BtreeDefaultDoDedup() macro from v19 to make > nbtree use deduplication wherever it is safe to do so. This default > may not be the best one in the end, though deduplication in unique > indexes now looks very compelling. Attached is v21, which fixes some bitrot -- v20 of the patch was made totally unusable by today's commit 8557a6f1. Other changes: * New datum_image_eq() patch fixes up datum_image_eq() to work with cstring/name columns, which we rely on. No need for a Valgrind suppressions anymore. The suppression was only needed to paper over the fact that datum_image_eq() would not really work properly with cstring datums (the suppression was papering over a legitimate complaint, but we fix the underlying problem with 8557a6f1 and the v21-0001-* patch). * New nbtdedup.c file added. This has all of the functions that dealt with deduplication and posting lists that were previously in nbtinsert.c and nbtutils.c. I think that this separation is somewhat cleaner. * Additional tweaks to the custom checkingunique algorithm used by deduplication. This is based on further tuning from benchmarking. This is certainly not final yet. * Greatly simplified the code for unique index LP_DEAD killing in _bt_check_unique(). This was pretty sloppy in v20 of the patch (it had two "goto" labels). Now it works with the existing loop conditions that advance to the next equal item on the page. * Additional adjustments to the nbtree.h comments about the on-disk format. Can you take a quick look at the first patch (the v21-0001-* patch), Anastasia? I would like to get that one out of the way soon. -- Peter Geoghegan
Attachment
On Mon, Nov 4, 2019 at 11:52 AM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v21, which fixes some bitrot -- v20 of the patch was made > totally unusable by today's commit 8557a6f1. Other changes: There is more bitrot, so I attach v22. This also has some new changes centered around fixing particular issues with space utilization. These changes are: * nbtsort.c now intelligently considers the contribution of suffix truncation of posting list tuples when considering whether or not a leaf page is "full". I mean "full" in the sense that it has exceeded the soft limit (fillfactor-wise limit) on space utilization for the page (no change in how the hard limit in _bt_buildadd() works). We don't usually bother predicting the space saving from suffix truncation when considering split points, even in nbtsplitloc.c, but it's worth making an exception for posting lists (actually, this is the same exception that nbtsplitloc.c already had in much earlier versions of the patch). Posting lists are very often large enough to really make a big contribution to how balanced free space is. I now observe that weird cases where CREATE INDEX packs leaf pages too empty (or too full) are now all but eliminated. CREATE INDEX now does a pretty good job of respecting leaf fillfactor, while also allowing deduplication to be very effective (CREATE INDEX did neither of these two things in earlier versions of the patch). * Added "single value" strategy for retail insert deduplication -- this is closely related to nbtsplitloc.c's single value strategy. The general idea is that _bt_dedup_one_page() anticipates that a future "single value" page split is likely to occur, and therefore limits deduplication after two "1/3 of a page"-wide posting lists at the start of the page. It arranges for deduplication to leave a neat split point for nbtsplitloc.c to use when the time comes. In other words, the patch now allows "single value" page splits to leave leaf pages BTREE_SINGLEVAL_FILLFACTOR% full, just like v12/master. Leaving a small amount of free space on pages that are packed full of duplicates is always a good idea. Also, we no longer force page splits to leave pages 2/3 full (only two large posting lists plus a high key), which sometimes happened with v21. On balance, this change seems to slightly improve space utilization. In general, it's now unusual for retail insertions to get better space utilization than CREATE INDEX -- in that sense normality/balance has been restored in v22. Actually, I wrote the v22 changes by working through a list of weird space utilization issues from my personal notes. I'm pretty sure I've fixed all of those -- only nbtsplitloc.c's single value strategy wants to split at a point that leaves a heap TID in the new high key for the page, so that's the only thing we need to worry about within nbtdedup.c. * "deduplication" storage parameter now has psql completion. I intend to push the datum_image_eq() preparatory patch soon. I will also push a commit that makes _bt_keep_natts_fast() use datum_image_eq() separately. Anybody have an opinion on that? -- Peter Geoghegan
Attachment
On Fri, Nov 8, 2019 at 10:35 AM Peter Geoghegan <pg@bowt.ie> wrote: > There is more bitrot, so I attach v22. The patch has stopped applying once again, so I attach v23. One reason for the bitrot is that I pushed preparatory commits, including today's "Make _bt_keep_natts_fast() use datum_image_eq()" commit. Good to get that out of the way. Other changes: * Decided to go back to turning deduplication on by default with non-unique indexes, and off by default using unique indexes. The unique index stuff was regressed enough with INSERT-heavy workloads that I was put off, despite my initial enthusiasm for enabling deduplication everywhere. * Disabled deduplication in system catalog indexes by deeming it generally unsafe. I realized that it would be impossible to provide a way to disable deduplication in system catalog indexes if it was enabled at all. The reason for this is simple: in general, it's not possible to set storage parameters for system catalog indexes. While I think that deduplication should work with system catalog indexes on general principle, this is about an existing limitation. Deduplication in catalog indexes can be revisited if and when somebody figures out a way to make storage parameters work with system catalog indexes. * Basic user documentation -- this still needs work, but the basic shape is now in place. I think that we should outline how the feature works by describing the internals, including details of the data structures. This provides guidance to users on when they should disable or enable the feature. This is discussed in the existing chapter on B-Tree internals. This felt natural because it's similar to how GIN explains its compression related features -- the discussion of the storage parameters in the CREATE INDEX page of the docs links to a description of GIN internals from "66.4. Implementation [of GIN]". * nbtdedup.c "single value" strategy stuff now considers the contribution of the page high key when considering how to deduplicate such that nbtsplitloc.c's "single value" strategy has a usable split point that helps it to hit its target free space. Not a very important detail. It's nice to be consistent with the corresponding code within nbtsplitloc.c. * Worked through all remaining XXX/TODO/FIXME comments, except one: The one that talks about the need for opclass infrastructure to deal with cases like btree/numeric_ops, or text with a nondeterministic collation. The user docs now reference the BITWISE opclass stuff that we're discussing over on the other thread. That's the only really notable open item now IMV. -- Peter Geoghegan
Attachment
On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > * Disabled deduplication in system catalog indexes by deeming it > generally unsafe. I (continue to) think that deduplication is a terrible name, because you're not getting rid of the duplicates. You are using a compressed representation of the duplicates. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 13, 2019 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 12, 2019 at 6:22 PM Peter Geoghegan <pg@bowt.ie> wrote: > > * Disabled deduplication in system catalog indexes by deeming it > > generally unsafe. > > I (continue to) think that deduplication is a terrible name, because > you're not getting rid of the duplicates. You are using a compressed > representation of the duplicates. "Deduplication" never means that you get rid of duplicates. According to Wikipedia's deduplication article: "Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently, the intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy". This seemed like it fit what this patch does. We're concerned with a specific, simple kind of redundancy. Also: * From the user's point of view, we're merging together what they'd call duplicates. They don't really think of the heap TID as part of the key. * The term "compression" suggests a decompression penalty when reading, which is not the case here. * The term "compression" confuses the feature added by the patch with TOAST compression. Now we may have two very different varieties of compression in the same index. Can you suggest an alternative? -- Peter Geoghegan
On Wed, Nov 13, 2019 at 2:51 PM Peter Geoghegan <pg@bowt.ie> wrote: > "Deduplication" never means that you get rid of duplicates. According > to Wikipedia's deduplication article: "Whereas compression algorithms > identify redundant data inside individual files and encodes this > redundant data more efficiently, the intent of deduplication is to > inspect large volumes of data and identify large sections – such as > entire files or large sections of files – that are identical, and > replace them with a shared copy". Hmm. Well, maybe I'm just behind the times. But that same wikipedia article also says that deduplication works on large chunks "such as entire files or large sections of files" thus differentiating it from compression algorithms which work on the byte level, so it seems to me that what you are doing still sounds more like ad-hoc compression. > Can you suggest an alternative? My instinct is to pick a name that somehow involves compression and just put enough other words in there to make it clear e.g. duplicate value compression, or something of that sort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 15, 2019 at 5:16 AM Robert Haas <robertmhaas@gmail.com> wrote: > Hmm. Well, maybe I'm just behind the times. But that same wikipedia > article also says that deduplication works on large chunks "such as > entire files or large sections of files" thus differentiating it from > compression algorithms which work on the byte level, so it seems to me > that what you are doing still sounds more like ad-hoc compression. I see your point. One reason for my avoiding the word "compression" is that other DB systems that have something similar don't use the word compression either. Actually, they don't really call it *anything*. Posting lists are simply the way that secondary indexes work. The "Modern B-Tree techniques" book/survey paper mentions the idea of using a TID list in its "3.7 Duplicate Key Values" section, not in the two related sections that follow ("Bitmap Indexes", and "Data Compression"). That doesn't seem like a very good argument, now that I've typed it out. The patch applies deduplication/compression/whatever at the point where we'd otherwise have to split the page, unlike GIN. GIN eagerly maintains posting lists (doing in-place updates for most insertions seems pretty bad to me). My argument could reasonably be made about GIN, which really does consider posting lists the natural way to store duplicate tuples. I cannot really make that argument about nbtree with this patch, though -- delaying a page split by re-encoding tuples (changing their physical representation without changing their logical contents) justifies using the word "compression" in the name. > > Can you suggest an alternative? > > My instinct is to pick a name that somehow involves compression and > just put enough other words in there to make it clear e.g. duplicate > value compression, or something of that sort. Does anyone else want to weigh in on this? Anastasia? I will go along with whatever the consensus is. I'm very close to the problem we're trying to solve, which probably isn't helping me here. -- Peter Geoghegan
On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I haven't measured how these changes affect WAL size yet. > > Do you have any suggestions on how to automate testing of new WAL records? > > Is there any suitable place in regression tests? > > I don't know about the regression tests (I doubt that there is a > natural place for such a test), but I came up with a rough test case. > I more or less copied the approach that you took with the index build > WAL reduction patches, though I also figured out a way of subtracting > heapam WAL overhead to get a real figure. I attach the test case -- > note that you'll need to use the "land" database with this. (This test > case might need to be improved, but it's a good start.) I used a test script similar to the "nbtree_wal_test.sql" test script I posted on September 11th today. I am concerned about the WAL overhead for cases that don't benefit from the patch (usually because they turn off deduplication altogether). The details of the index tested were different this time, though. I used an index that had the smallest possible tuple size: 16 bytes (this is the smallest possible size on 64-bit systems, but that's what almost everybody uses these days). So any index with one or two int4 columns (or one int8 column) will generally have 16 byte IndexTuples, at least when there are no NULLs in the index. In general, 16 byte wide tuples are very, very common. What I saw suggests that we will need to remove the new "postingoff" field from xl_btree_insert. (We can create a new XLog record for leaf page inserts that also need to split a posting list, without changing much else.) The way that *alignment* of WAL records affects these common 16 byte IndexTuple cases is the real problem. Adding "postingoff" to xl_btree_insert increases the WAL required for INSERT_LEAF records by two bytes (sizeof(OffsetNumber)), as you'd expect -- pg_waldump output shows that they're 66 bytes, whereas they're only 64 bytes on the master branch. That doesn't sound that bad, but once you consider the alignment of whole records, it's really an extra 8 bytes. That is totally unacceptable. The vast majority of nbtree WAL records are bound to be INSERT_LEAF records, so as things stand we have added (almost) 12.5% space overhead to nbtree for these common cases, that don't benefit. I haven't really looked into other types of WAL record just yet. The real world overhead that we're adding to xl_btree_vacuum records is something that I will have to look into separately. I'm already pretty sure that adding two bytes to xl_btree_split is okay, though, because they're far less numerous than xl_btree_insert records, and aren't affected by alignment in the same way (they're already several hundred bytes in almost all cases). I also noticed something positive: The overhead of xl_btree_dedup WAL records seems to be very low with indexes that have hundreds of logical tuples for each distinct integer value. We don't seem to have a problem with "deduplication thrashing". -- Peter Geoghegan
On 11/13/19 11:51 AM, Peter Geoghegan wrote: > Can you suggest an alternative? Dupression -- Mark Dilger
On Fri, Nov 15, 2019 at 5:43 PM Mark Dilger <hornschnorter@gmail.com> wrote: > On 11/13/19 11:51 AM, Peter Geoghegan wrote: > > Can you suggest an alternative? > > Dupression This suggestion makes me feel better about "deduplication". -- Peter Geoghegan
Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.
From
Peter Geoghegan
Date:
On Sun, Sep 15, 2019 at 3:47 AM Oleg Bartunov <obartunov@postgrespro.ru> wrote: > Is it worth to make a provision to add an ability to control how > duplicates are sorted ? Duplicates will continue to be sorted based on TID, in effect. We want to preserve the ability to perform retail index tuple deletion. I believe that that will become important in the future. > If we speak about GIN, why not take into > account our experiments with RUM (https://github.com/postgrespro/rum) > ? FWIW, I think that it's confusing that RUM almost shares its name with the "RUM conjecture": http://daslab.seas.harvard.edu/rum-conjecture/ -- Peter Geoghegan
Moin, On 2019-11-16 01:04, Peter Geoghegan wrote: > On Fri, Nov 15, 2019 at 5:16 AM Robert Haas <robertmhaas@gmail.com> > wrote: >> Hmm. Well, maybe I'm just behind the times. But that same wikipedia >> article also says that deduplication works on large chunks "such as >> entire files or large sections of files" thus differentiating it from >> compression algorithms which work on the byte level, so it seems to me >> that what you are doing still sounds more like ad-hoc compression. > > I see your point. > > One reason for my avoiding the word "compression" is that other DB > systems that have something similar don't use the word compression > either. Actually, they don't really call it *anything*. Posting lists > are simply the way that secondary indexes work. The "Modern B-Tree > techniques" book/survey paper mentions the idea of using a TID list in > its "3.7 Duplicate Key Values" section, not in the two related > sections that follow ("Bitmap Indexes", and "Data Compression"). > > That doesn't seem like a very good argument, now that I've typed it > out. The patch applies deduplication/compression/whatever at the point > where we'd otherwise have to split the page, unlike GIN. GIN eagerly > maintains posting lists (doing in-place updates for most insertions > seems pretty bad to me). My argument could reasonably be made about > GIN, which really does consider posting lists the natural way to store > duplicate tuples. I cannot really make that argument about nbtree with > this patch, though -- delaying a page split by re-encoding tuples > (changing their physical representation without changing their logical > contents) justifies using the word "compression" in the name. > >> > Can you suggest an alternative? >> >> My instinct is to pick a name that somehow involves compression and >> just put enough other words in there to make it clear e.g. duplicate >> value compression, or something of that sort. > > Does anyone else want to weigh in on this? Anastasia? > > I will go along with whatever the consensus is. I'm very close to the > problem we're trying to solve, which probably isn't helping me here. I'm in favor of deduplication and not compression. Compression is a more generic term and can involve deduplication, but it hasn't to do so. (It could for instance just encode things in a more compact form). While deduplication does not involve compression, it just means store multiple things once, which by coincidence also amounts to using less space like compression can do. ZFS also follows this by having both deduplication (store the same blocks only once with references) and compression (compress block contents, regardless wether they are stored once or many times). So my vote is for deduplication (if I understand the thread correctly this is what the code no does, by storing the exact same key not that many times but only once with references or a count?). best regards, Tels
On Fri, Nov 15, 2019 at 5:02 PM Peter Geoghegan <pg@bowt.ie> wrote: > What I saw suggests that we will need to remove the new "postingoff" > field from xl_btree_insert. (We can create a new XLog record for leaf > page inserts that also need to split a posting list, without changing > much else.) Attached is v24. This revision doesn't fix the problem with xl_btree_insert record bloat, but it does fix the bitrot against the master branch that was caused by commit 50d22de9. (This patch has had a surprisingly large number of conflicts against the master branch recently.) Other changes: * The pageinspect patch has been cleaned up. I now propose that it be committed alongside the main patch. The big change here is that posting lists are represented as an array of TIDs within bt_page_items(), much like gin_leafpage_items(). Also added documentation that goes into the ways in which ctid can be used to encode information (arguably some of this should have been included with the Postgres 12 B-Tree work). * Basic tests that cover deduplication within unique indexes. We ought to have code coverage of the case where _bt_check_unique() has to step right (actually, we don't have that on the master branch either). -- Peter Geoghegan
Attachment
On Mon, Nov 18, 2019 at 05:26:37PM -0800, Peter Geoghegan wrote: > Attached is v24. This revision doesn't fix the problem with > xl_btree_insert record bloat, but it does fix the bitrot against the > master branch that was caused by commit 50d22de9. (This patch has had > a surprisingly large number of conflicts against the master branch > recently.) Please note that I have moved this patch to next CF per this last update. Anastasia, the ball is waiting on your side of the field, as the CF entry is marked as waiting on author for some time now. -- Michael
Attachment
On Mon, Nov 18, 2019 at 5:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v24. This revision doesn't fix the problem with > xl_btree_insert record bloat Attached is v25. This version: * Adds more documentation. * Adds a new GUC -- bree_deduplication. A new GUC seems necessary. Users will want to be able to configure the feature system-wide. A storage parameter won't let them do that -- only a GUC will. This also makes it easy to enable the feature with unique indexes. * Fixes the xl_btree_insert record bloat issue. * Fixes a smaller issue with VACUUM/xl_btree_vacuum record bloat. We shouldn't be using noticeably more WAL than before, at least in cases that don't use deduplication. These two items fix cases where that was possible. There is a new refactoring patch including with v25 that helps with the xl_btree_vacuum issue. This new patch removes unnecessary "pin scan" code used by B-Tree VACUUMs, which was effectively disabled by commit 3e4b7d87 without being removed. This is independently useful work that I planned on doing already, that also cleans things up for VACUUM with posting list tuples. It reclaims some space within the xl_btree_vacuum record type that was wasted (we don't even use the lastBlockVacuumed field anymore), allowing us to use that space for new deduplication-related fields without increasing total WAL space. Anastasia: I hope to be able to commit the first patch before too long. It would be great if you could review that. -- Peter Geoghegan
Attachment
On Tue, Nov 12, 2019 at 3:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > * Decided to go back to turning deduplication on by default with > non-unique indexes, and off by default using unique indexes. > > The unique index stuff was regressed enough with INSERT-heavy > workloads that I was put off, despite my initial enthusiasm for > enabling deduplication everywhere. I have changed my mind about this again. I now think that it would make sense to treat deduplication within unique indexes as a separate feature that cannot be disabled by the GUC at all (though we'd probably still respect the storage parameter for debugging purposes). I have found that fixing the WAL record size issue has helped remove what looked like a performance penalty for deduplication (but was actually just a general regression). Also, I have found a way of selectively applying deduplication within unique indexes that seems to have no downside, and considerable upside. The new criteria/heuristic for unique indexes is very simple: If a unique index has an existing item that is a duplicate on the incoming item at the point that we might have to split the page, then apply deduplication. Otherwise (when the incoming item has no duplicates), don't apply deduplication at all -- just accept that we'll have to split the page. We already cache the bounds of our initial binary search in insert state, so we can reuse that information within _bt_findinsertloc() when considering deduplication in unique indexes. This heuristic makes sense because deduplication within unique indexes should only target leaf pages that cannot possibly receive new values. In many cases, the only reason why almost all primary key leaf pages can ever split is because of non-HOT updates whose new HOT chain needs a new, equal entry in the primary key. This is the case with your standard identity column/serial primary key, for example (only the rightmost page will have a page split due to the insertion of new logical rows -- everything other variety of page split must be due to new physical tuples/versions). I imagine that it is possible for a leaf page to be a "mixture" of these two basic/general tendencies, but not for long. It really doesn't matter if we occasionally fail to delay a page split where that was possible, nor does it matter if we occasionally apply deduplication when that won't delay a split for very long -- pretty soon the page will split anyway. A split ought to separate the parts of the keyspace that exhibit each tendency. In general, we're only interested in delaying page splits in unique indexes *indefinitely*, since in effect that will prevent them *entirely*. (So the goal is *significantly* different to our general goal for deduplication -- it's about buying time for VACUUM to run or whatever, rather than buying space.) This heuristic helps the TPC-C "old order" tables PK from bloating quite noticeably, since that was the only unique index that is really affected by non-HOT UPDATEs (i.e. the UPDATE queries that touch that table happen to not be HOT-safe in general, which is not the case for any other table). It doesn't regress anything else from TPC-C, since there really isn't a benefit for other tables. More importantly, the working/draft version of the patch will often avoid a huge amount of bloat in a pgbench-style workload that has an extra index on the pgbench_accounts table, to prevent HOT updates. The accounts primary key (pgbench_accounts_pkey) hardly grows at all with the patch, but grows 2x on master. This 2x space saving seems to occur reliably, unless there is a lot of contention on individual *pages*, in which case the bloat can be delayed but not prevented. We get that 2x space saving with either uniformly distributed random updates on pgbench_accounts (i.e. the pgbench default), or with a skewed distribution that hashes the PRNG's value. Hashing like this simulates a workload where there the skew isn't concentrated in one part of the key space (i.e. there is skew, but very popular values are scattered throughout the index evenly, rather than being concentrated together in just a few leaf pages). Can anyone think of an adversarial case, that we may not do so well on with the new "only deduplicate within unique indexes when new item already has a duplicate" strategy? I'm having difficulty identifying some kind of worst case. -- Peter Geoghegan
On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote: > The new criteria/heuristic for unique indexes is very simple: If a > unique index has an existing item that is a duplicate on the incoming > item at the point that we might have to split the page, then apply > deduplication. Otherwise (when the incoming item has no duplicates), > don't apply deduplication at all -- just accept that we'll have to > split the page. > the working/draft version of the patch will often avoid a huge amount of > bloat in a pgbench-style workload that has an extra index on the > pgbench_accounts table, to prevent HOT updates. The accounts primary > key (pgbench_accounts_pkey) hardly grows at all with the patch, but > grows 2x on master. I have numbers from my benchmark against my working copy of the patch, with this enhanced design for unique index deduplication. With an extra index on pgbench_accounts's abalance column (that is configured to not use deduplication for the test), and with the aid variable (i.e. UPDATEs on pgbench_accounts) configured to use skew, I have a variant of the standard pgbench TPC-B like benchmark. The pgbench script I used was as follows: \set r random_gaussian(1, 100000 * :scale, 4.0) \set aid abs(hash(:r)) % (100000 * :scale) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; Results from interlaced 2 hour runs at pgbench scale 5,000 are as follows (shown in reverse chronological order): master_2_run_16.out: "tps = 7263.948703 (including connections establishing)" patch_2_run_16.out: "tps = 7505.358148 (including connections establishing)" master_1_run_32.out: "tps = 9998.868764 (including connections establishing)" patch_1_run_32.out: "tps = 9781.798606 (including connections establishing)" master_1_run_16.out: "tps = 8812.269270 (including connections establishing)" patch_1_run_16.out: "tps = 9455.476883 (including connections establishing)" The patch comes out ahead in the first 2 hour run, with later runs looking like a more even match. I think that each run didn't last long enough to even out the effects of autovacuum, but this is really about index size rather than overall throughput, so it's not that important. (I need to get a large server to do further performance validation work, rather than just running overnight benchmarks on my main work machine like this.) The primary key index (pgbench_accounts_pkey) starts out at 10.45 GiB in size, and ends at 12.695 GiB in size with the patch. Whereas with master, it also starts out at 10.45 GiB, but finishes off at 19.392 GiB. Clearly this is a significant difference -- the index is only ~65% of its master-branch size with the patch. See attached tar archive with logs, and pg_buffercache output after each run. (The extra index on pgbench_accounts.abalance is pretty much the same size for patch/master, since deduplication was disabled for the patch runs.) And, as I said, I believe that we can make this unique index deduplication stuff an internal thing that isn't even documented (maybe a passing reference is appropriate when talking about general deduplication). -- Peter Geoghegan
Attachment
On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote: > The new criteria/heuristic for unique indexes is very simple: If a > unique index has an existing item that is a duplicate on the incoming > item at the point that we might have to split the page, then apply > deduplication. Otherwise (when the incoming item has no duplicates), > don't apply deduplication at all -- just accept that we'll have to > split the page. We already cache the bounds of our initial binary > search in insert state, so we can reuse that information within > _bt_findinsertloc() when considering deduplication in unique indexes. Attached is v26, which adds this new criteria/heuristic for unique indexes. We now seem to consistently get good results with unique indexes. Other changes: * A commit message is now included for the main patch/commit. * The btree_deduplication GUC is now a boolean, since it is no longer up to the user to indicate when deduplication is appropriate in unique indexes (the new heuristic does that instead). The GUC now only affects non-unique indexes. * Simplified the user docs. They now only mention deduplication of unique indexes in passing, in line with the general idea that deduplication in unique indexes is an internal optimization. * Fixed bug that made backwards scans that touch posting lists fail to set LP_DEAD bits when that was possible (i.e. the kill_prior_tuple optimization wasn't always applied there with posting lists, for no good reason). Also documented the assumptions made by the new code in _bt_readpage()/_bt_killitems() -- if that was clearer in the first place, then the LP_DEAD/kill_prior_tuple bug might never have happened. * Fixed some memory leaks in nbtree VACUUM. Still waiting for some review of the first patch, to get it out of the way. Anastasia? -- Peter Geoghegan
Attachment
On Thu, Dec 12, 2019 at 06:21:20PM -0800, Peter Geoghegan wrote: > On Tue, Dec 3, 2019 at 12:13 PM Peter Geoghegan <pg@bowt.ie> wrote: > > The new criteria/heuristic for unique indexes is very simple: If a > > unique index has an existing item that is a duplicate on the incoming > > item at the point that we might have to split the page, then apply > > deduplication. Otherwise (when the incoming item has no duplicates), > > don't apply deduplication at all -- just accept that we'll have to > > split the page. We already cache the bounds of our initial binary > > search in insert state, so we can reuse that information within > > _bt_findinsertloc() when considering deduplication in unique indexes. > > Attached is v26, which adds this new criteria/heuristic for unique > indexes. We now seem to consistently get good results with unique > indexes. In the past we tried to increase the number of cases where HOT updates can happen but were unable to. Would this help with non-HOT updates? Do we have any benchmarks where non-HOT updates cause slowdowns that we can test on this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Tue, Dec 17, 2019 at 1:58 PM Bruce Momjian <bruce@momjian.us> wrote: > > Attached is v26, which adds this new criteria/heuristic for unique > > indexes. We now seem to consistently get good results with unique > > indexes. > > In the past we tried to increase the number of cases where HOT updates > can happen but were unable to. Right -- the WARM project. The Z-heap project won't change the fundamentals here. It isn't going to solve the fundamental problem of requiring that the index AM create a new set of physical index tuples in at least *some* cases. A heap tuple cannot be updated in-place when even one indexed column changes -- you're not much better off than you were with the classic heapam, because indexes get bloated in a way that wouldn't happen with Oracle. (Even still, Z-heap isn't sensitive to when and how opportunistic heap pruning takes place, and doesn't have the same issue with having to fit the heap tuple on the same page or create a new HOT chain. This will make things much better with some workloads.) > Would this help with non-HOT updates? Definitely, yes. The strategy used with unique indexes is specifically designed to avoid "unnecessary" page splits altogether -- it only makes sense because of the possibility of non-HOT UPDATEs with mostly-unchanged index tuples. Thinking about what's going on here from first principles is what drove the unique index deduplication design: With many real world unique indexes, the true reason behind most or all B-Tree page splits is "version churn". I view these page splits as a permanent solution to a temporary problem -- we *permanently* degrade the index structure in order to deal with a *temporary* burst in versions that need to be stored. That's really bad. Consider a classic pgbench workload, for example. The smaller indexes on the smaller tables (pgbench_tellers_pkey and pgbench_branches_pkey) have leaf pages that will almost certainly be split a few minutes in, even though the UPDATEs on the underlying tables never modify indexed columns (i.e. even though HOT is as effective as it possibly could be with this unthrottled workload). Actually, even the resulting split pages will themselves usually be split again, and maybe even once more after that. We started out with leaf pages that stored just under 370 items on each leaf page (with fillfactor 90 + 8KiB BLCKSZ), and end up with leaf pages that often have less than 50 items (sometimes as few as 10). Even though the "logical contents" of the index are *totally* unchanged. This could almost be considered pathological by users. Of course, it's easy to imagine a case where it matters a lot more than classic pgbench (pgbench_tellers_pkey and pgbench_branches_pkey are always small, so it's easy to see the effect, which is why I went with that example). For example, you could have a long running transaction, which would probably have the effect of significantly bloating even the large pgbench index (pgbench_accounts_pkey) -- typically you won't see that with classic pgbench until you do something to frustrate VACUUM (and opportunistic cleanup). (I have mostly been using non-HOT UPDATEs to test the patch, though.) In theory we could go even further than this by having some kind of version store for indexes, and using this to stash old versions rather than performing a page split. Then you wouldn't have any page splits in the pgbench indexes; VACUUM would eventually be able to return the index to its "pristine" state. The trade-off with that design would be that index scans would have to access two index pages for a while (a leaf page, plus its subsidiary old version page). Maybe we can actually go that far in the future -- there are various database research papers that describe designs like this (the designs described within these papers do things like determine whether a "version split" or a "value split" should be performed). What we have now is an incremental improvement, that doesn't have any apparent downside with unique indexes -- the way that deduplication is triggered for unique indexes is almost certain to be a win. When deduplication isn't triggered, everything works in the same way as before -- it's "zero overhead" for unique indexes that don't benefit. The design augments existing garbage collection mechanisms, particularly the way in which we set LP_DEAD bits within _bt_check_unique(). > Do we have any benchmarks where non-HOT updates cause slowdowns that we > can test on this? AFAICT, any workload that has lots of non-HOT updates will benefit at least a little bit -- indexes will finish up smaller, there will be higher throughput, and there will be a reduction in latency for queries. With the right distribution of values, it's not that hard to mostly control bloat in an index that doubles in size without the optimization, which is much more significant. I have already reported on this [1]. I've also been able to observe increases of 15%-20% in TPS with similar workloads (with commensurate reductions in query latency) more recently. This was with a simple gaussian distribution for pgbench_accounts.aid, and a non-unique index with deduplication enabled on pgbench_accounts.abalance. (The patch helps control the size of both indexes, especially the extra non-unique one.) [1] https://postgr.es/m/CAH2-WzkXHhjhmUYfVvu6afbojU97MST8RUT1U=hLd2W-GC5FNA@mail.gmail.com -- Peter Geoghegan
On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote: > With many real world unique indexes, the true reason behind most or > all B-Tree page splits is "version churn". I view these page splits as > a permanent solution to a temporary problem -- we *permanently* > degrade the index structure in order to deal with a *temporary* burst > in versions that need to be stored. That's really bad. Yes, I was thinking why do we need to optimize duplicates in a unique index but then remembered is a version problem. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Tue, Dec 17, 2019 at 5:18 PM Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Dec 17, 2019 at 03:30:33PM -0800, Peter Geoghegan wrote: > > With many real world unique indexes, the true reason behind most or > > all B-Tree page splits is "version churn". I view these page splits as > > a permanent solution to a temporary problem -- we *permanently* > > degrade the index structure in order to deal with a *temporary* burst > > in versions that need to be stored. That's really bad. > > Yes, I was thinking why do we need to optimize duplicates in a unique > index but then remembered is a version problem. The whole idea of deduplication in unique indexes is hard to explain. It just sounds odd. Also, it works using the same infrastructure as regular deduplication, while having rather different goals. Fortunately, it seems like we don't really have to tell users about it in order for them to see a benefit -- there will be no choice for them to make there (they just get it). The regular deduplication stuff isn't confusing at all, though. It has some noticeable though small downside, so it will be documented and configurable. (I'm optimistic that it can be enabled by default, because even with high cardinality non-unique indexes the downside is rather small -- we waste some CPU cycles just before a page is split.) -- Peter Geoghegan
On Thu, Dec 12, 2019 at 6:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > Still waiting for some review of the first patch, to get it out of the > way. Anastasia? I plan to commit this first patch [1] in the next day or two, barring any objections. It's clear that the nbtree "pin scan" VACUUM code is totally unnecessary -- it really should have been fully removed by commit 3e4b7d87 back in 2016. [1] https://www.postgresql.org/message-id/flat/CAH2-WzkWLRDzCaxsGvA_pZoaix_2AC9S6%3D-D6JMLkQYhqrJuEg%40mail.gmail.com#daed349a71ff9d7ac726cc0e3e01a436 -- Peter Geoghegan
On Tue, Dec 17, 2019 at 7:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > I plan to commit this first patch [1] in the next day or two, barring > any objections. I pushed this earlier today -- it became commit 9f83468b. Attached is v27, which fixes the bitrot against the master branch. Other changes: * Updated _bt_form_posting() to consistently MAXALIGN(). No behavioral changes here. The defensive SHORTALIGN()s we had in v26 should have been defensive MAXALIGN()s -- this has been fixed. Also, we now also explain our precise assumptions around alignment. * Cleared up the situation around _bt_dedup_one_page()'s responsibilities as far as LP_DEAD items go. * Fixed bug in 32 KiB BLCKSZ builds. We now apply an additional INDEX_SIZE_MASK cap on posting list tuple size. -- Peter Geoghegan
Attachment
On Thu, Dec 19, 2019 at 6:55 PM Peter Geoghegan <pg@bowt.ie> wrote: > I pushed this earlier today -- it became commit 9f83468b. Attached is > v27, which fixes the bitrot against the master branch. Attached is v28, which fixes bitrot from my recent commits to refactor VACUUM-related code in nbtpage.c. Other changes: * A big overhaul of the nbtree README changes -- "posting list splits" now becomes its own section. I tried to get the general idea across about posting lists in this new section without repeating myself too much. Posting list splits are probably the most subtle part of the overall design of the patch. Posting lists piggy-back on a standard atomic action (insertion into a leaf page, or leaf page split) on the one hand. On the other hand, they're a separate and independent step at the conceptual level. Hopefully the general idea comes across as clearly as possible. Some feedback on that would be good. * PageIndexTupleOverwrite() is now used for VACUUM's "updates", and has been taught to not unset an LP_DEAD bit that happens to already be set. As the comments added by my recent commit 4b25f5d0 now mention, it's important that VACUUM not unset LP_DEAD bits accidentally. VACUUM will falsely unset the BTP_HAS_GARBAGE page flag at times, which isn't ideal. Even still, unsetting LP_DEAD bits themselves is much worse (even though BTP_HAS_GARBAGE exists purely to hint that one or more LP_DEAD bits are set on the page). Maybe we should go further here, and reconsider whether or not VACUUM should *ever* unset BTP_HAS_GARBAGE. AFAICT, the only advantage of nbtree VACUUM clearing it is that doing so might save a backend a useless scan of the line pointer array to check for the LP_DEAD bits directly. But the backend will have to split the page when that happens anyway, which is a far greater cost. It's probably not even noticeable, since we're already doing lots of stuff with the page when it happens. The BTP_HAS_GARBAGE hint probably mattered back when the "getting tired" mechanism was used (i.e. prior to commit dd299df8). VACUUM sometimes had a choice to make about which page to use, so quickly getting an idea about LP_DEAD bits made a certain amount of sense...but that's not how it works anymore. (Granted, we still do it that way with pg_upgrade'd indexes from before Postgres 12, but I don't think that that needs to be given any weight now.) Thoughts on this? -- Peter Geoghegan
Attachment
On 04/01/2020 03:47, Peter Geoghegan wrote: > Attached is v28, which fixes bitrot from my recent commits to refactor > VACUUM-related code in nbtpage.c. I started to read through this gigantic patch. I got about 1/3 way through. I wrote minor comments directly in the attached patch file, search for "HEIKKI:". I wrote them as I read the patch from beginning to end, so it's possible that some of my questions are answered later in the patch. I didn't have the stamina to read through the whole patch yet, I'll continue later. One major design question here is about the LP_DEAD tuples. There's quite a lot of logic and heuristics and explanations related to unique indexes. To make them behave differently from non-unique indexes, to keep the LP_DEAD optimization effective. What if we had a separate LP_DEAD flag for every item in a posting list, instead? I think we wouldn't need to treat unique indexes differently from non-unique indexes, then. I tried to search this thread to see if that had been discussed already, but I didn't see anyone proposing that approach. Another important decision here is the on-disk format of these tuples. The format of IndexTuples on a b-tree page has become really complicated. The v12 changes to store TIDs in order did a lot of that, but this makes it even more complicated. I know there are strong backwards-compatibility reasons for the current format, but nevertheless, if we were to design this from scratch, what would the B-tree page and tuple format be like? - Heikki
Attachment
On Wed, Jan 8, 2020 at 5:56 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 04/01/2020 03:47, Peter Geoghegan wrote: > > Attached is v28, which fixes bitrot from my recent commits to refactor > > VACUUM-related code in nbtpage.c. > > I started to read through this gigantic patch. Oh come on, it's not that big. :-) > I got about 1/3 way > through. I wrote minor comments directly in the attached patch file, > search for "HEIKKI:". I wrote them as I read the patch from beginning to > end, so it's possible that some of my questions are answered later in > the patch. I didn't have the stamina to read through the whole patch > yet, I'll continue later. Thanks for the review! Anything that you've written that I do not respond to directly can be assumed to have been accepted by me. I'll start with responses to the points that you raise in your patch that need a response Patch comments ============== * Furthermore, deduplication can be turned on or off as needed, or applied HEIKKI: When would it be needed? I believe that hardly anybody will want to turn off deduplication in practice. My point here is that we're flexible -- we're not maintaining posting lists like GIN. We're just deduplicating as and when needed. We can change our preference about that any time. Turning off deduplication won't magically undo past deduplications, of course, but everything mostly works in the same way when deduplication is on or off. We're just talking about an alternative physical representation of the same logical contents. * HEIKKI: How do LP_DEAD work on posting list tuples? Same as before, except that it applies to all TIDs in the tuple together (will mention this in commit message, though). Note that the fact that we delay deduplication also means that we delay merging the LP_DEAD bits. And we always prefer to remove LP_DEAD items. Finally, we refuse to do a posting list split when its LP_DEAD bit is set, so it's now possible to delete LP_DEAD bit set tuples a little early, before a page split has to be avoided -- see the new code and comments at the end of _bt_findinsertloc(). See also: my later response to your e-mail remarks on LP_DEAD bits, unique indexes, and space accounting. * HEIKKI: When is it [deduplication] not safe? With opclasses like btree/numeric_ops, where display scale messes things up. See this thread for more information on the infrastructure that we need for that: https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com * HEIKKI: Why is it safe to read on version 3 indexes? Because unused space is set to zeros? Yes. Same applies to version 4 indexes that come from Postgres 12 -- users must REINDEX to call _bt_opclasses_support_dedup() and set metapage field, but we can rely on the new field being all zeroes before that happens. (It would be possible to teach pg_upgrade to set the field for compatible indexes from Postgres 12, but I don't want to bother with that. We probably cannot safely call _bt_opclasses_support_dedup() with a buffer lock held, so that seems like the only way.) * HEIKKI: Do we need it as a separate flag, isn't it always safe with version 4 indexes, and never with version 3? No, it isn't *always* safe with version 4 indexes, for reasons that have nothing to do with the on-disk representation (like the display scale issue, nondeterministic collations, etc). It really is a distinct condition. (Deduplication is never safe with version 3 indexes, obviously.) It occurs to me now that we probably don't even want to make the metapage field about deduplication (though that's what it says right now). Rather, it should be about supporting a general category of optimizations that include deduplication, and might also include prefix compression in the future. Note that whether or not we should actually apply these optimizations is always a separate question. * + * Non-pivot tuples complement pivot tuples, which only have key columns. HEIKKI: What does it mean that they complement pivot tuples? It means that all tuples are either pivot tuples, or are non-pivot tuples. * + * safely (index storage parameter separately indicates if deduplication is HEIKKI: Is there really an "index storage parameter" for that? What is that, something in the WITH clause? Yes, there is actually an index storage parameter named "deduplication" (something in the WITH clause). This is deliberately not named "btree_deduplication", the current name of the GUC. This exists to make the optimization controllable at the index level. (Though I should probably mention the GUC first in this code comment, or not even mention the less significant storage parameter.) * HEIKKI: How much memory does this [BTScanPosData.items array of width MaxBTreeIndexTuplesPerPage] need now? Should we consider pallocing this separately? But BTScanPosData isn't allocated on the stack anyway. * HEIKKI: Would it be more clear to have a separate struct for the posting list split case? (i.e. don't reuse xl_btree_insert) I doubt it, but I'm open to it. We don't do it that way in a number of existing cases. * HEIKKI: Do we only generate one posting list in one WAL record? I would assume it's better to deduplicate everything on the page, since we're modifying it anyway. You might be right about that. Let me get back to you on that. HEIKKI: Does this [xl_btree_vacuum WAL record] store a whole copy of the remaining posting list on an updated tuple? Wouldn't it be simpler and more space-efficient to store just the deleted TIDs? It would certainly be more space efficient in cases where we delete some but not all TIDs -- hard to know how much that matters. Don't think that it would be simpler, though. I have an open mind about this. I can try it the other way if you like. * HEIKKI: Do we ever do that? Do we ever set the LP_DEAD bit on a posting list tuple? As I said, we are able to set LP_DEAD bits on posting list tuples, if and only if all the TIDs are dead (i.e. if all-but-one TID is dead, it cannot be set). This limitation does not seem to matter in practice, in part because LP_DEAD bits can be set before we deduplicate -- that's another benefit of delaying deduplication until the point where we'd usually have to split the page. See also: my later response to your e-mail remarks on LP_DEAD bits, unique indexes, and space accounting. * HEIKKI: Well, it's optimized for that today, but if it [a posting list] was compressed, a btree would be useful in more situations... I agree, but I think that we should do compression by inventing a new type of leaf page that only stores TIDs, and use that when we do a single value mode split in nbtsplitloc.c. So we don't even use tuples at that point (except the high key), and we compress the entire page. That way, we don't have to worry about posting list splits and stuff like that, which seems like the best of both worlds. Maybe we can use a true bitmap on these special leaf pages. ... Now to answer the feedback from your actual e-mail ... E-mail ====== > One major design question here is about the LP_DEAD tuples. There's > quite a lot of logic and heuristics and explanations related to unique > indexes. The unique index stuff hasn't really been discussed on the thread until now. Those parts are all my work. > To make them behave differently from non-unique indexes, to > keep the LP_DEAD optimization effective. What if we had a separate > LP_DEAD flag for every item in a posting list, instead? I think we > wouldn't need to treat unique indexes differently from non-unique > indexes, then. I don't think that that's quite true -- it's not so much about LP_DEAD bits as it is about our *goals* with unique indexes. We have no reason to deduplicate other than to delay an immediate page split, so it isn't really about space efficiency. Having individual LP_DEAD bits for each TID wouldn't change the picture for _bt_dedup_one_page() -- I would still want a checkingunique flag there. But individual LP_DEAD bits would make a lot of other things much more complicated. Unique indexes are kind of special, in general. The thing that I prioritized keeping simple in the patch is page space accounting, particularly the nbtsplitloc.c logic, which doesn't need any changes to continue to work (it's also important for page space accounting to be simple within _bt_dedup_one_page()). I did teach nbtsplitloc.c to take posting lists from the firstright tuple into account, but only because they're often unusually large, making it a worthwhile optimization. Exactly the same thing could take place with non-key INCLUDE columns, but typically the extra payload is not very large, so I haven't bothered with that before now. If you had a "supplemental header" to store per-TID LP_DEAD bits, that would make things complicated for page space accounting. Even if it was only one byte, you'd have to worry about it taking up an entire extra MAXALIGN() quantum within _bt_dedup_one_page(). And then there is the complexity within _bt_killitems(), needed to make the kill_prior_tuple stuff work. I might actually try to do it that way if I thought that it would perform better, or be simpler than what I came up with. I doubt that, though. In summary: while it would be possible to have per-TID LP_DEAD bits, but I don't think it would be even remotely worth it. I can go into my doubts about the performance benefits if you want. Note also: I tend to think of the LP_DEAD bit setting within _bt_check_unique() as almost a separate optimization to the kill_prior_tuple stuff, even though they both involve LP_DEAD bits. The former is much more important than the latter. The kill_prior_tuple thing was severely regressed in Postgres 9.5 without anyone really noticing [1]. > Another important decision here is the on-disk format of these tuples. > The format of IndexTuples on a b-tree page has become really > complicated. The v12 changes to store TIDs in order did a lot of that, > but this makes it even more complicated. It adds two new functions: BTreeTupleIsPivot(), and BTreeTupleIsPosting(). This means that there are three basic kinds of B-Tree tuple layout. We can detect which kind any given tuple is in a low context way. The three possible cases are: * Pivot tuples. * Posting list tuples (non-pivot tuples that have at least two head TIDs). * Regular/small non-pivot tuples -- this representation has never changed in all the time I've worked on Postgres. You'll notice that there are lots of new assertions, including in places that don't have anything to do with the new code -- BTreeTupleIsPivot() and BTreeTupleIsPosting() assertions. I think that there is only really one wart here that tends to come up outside the nbtree.h code itself again and again: the fact that !heapkeyspace indexes may give false negatives when BTreeTupleIsPivot() is used. So any BTreeTupleIsPivot() assertion has to include some nearby heapkeyspace field to cover that case (or else make sure that the index is a v4+/heapkeyspace index in some other way). Note, however, that we can safely assert !BTreeTupleIsPivot() -- that won't produce spurious assertion failures with heapkeyspace indexes. Note also that the new BTreeTupleIsPosting() function works reliably on all B-Tree versions. The only future requirements that I can anticipate for the tuple format in are: 1. The need to support wider TIDs. (I am strongly of the opinion that this shouldn't work all that differently to what we have now.) 2. The need for a page-level prefix compression feature. This can work by requiring decompression code to assume that the common prefix for the page just isn't present. This seems doable within the confines of the current/proposed B-Tree tuple format. Though we still need to have a serious discussion about the future of TIDs in light of stuff like ZedStore. I think that fully logical table identifiers are worth supporting, but they had better behave pretty much like a TID within index access method code -- they better show temporal and spatial locality in about the same way TIDs do. They should be compared as generic integers, and accept reasonable limits on TID width. It should be possible to do cheap binary searches on posting lists in about the same way. > I know there are strong > backwards-compatibility reasons for the current format, but > nevertheless, if we were to design this from scratch, what would the > B-tree page and tuple format be like? That's a good question, but my answer depends on the scope of the question. If you define "from scratch" to mean "5 years ago", then I believe that it would be exactly the same as what we have now. I specifically anticipated the need to have posting list TIDs (existing v12 era comments in nbtree.h and amcheck things about posting lists). And what I came up with is almost the same as the GIN format, except that we have explicit pivot tuples (to make suffix truncation work), and use the 13th IndexTupleData header bit (INDEX_ALT_TID_MASK) in a way that makes it possible to store non-pivot tuples in a space-efficient way when they are all unique. A plain GIN tuple used an extra MAXALIGN() quantum to store an entry tree tuple that only has one TID. If, on the other hand, you're talking about a totally green field situation, then I would probably not use IndexTuple at all. I think that a representation that stores offsets right in the tuple header (so no separate varlena headers) has more advantages than disadvantages. It would make it easier to do both suffix truncation and prefix compression. It also makes it cheap to skip to the end of the tuple. In general, it would be nice if the IndexTupleData TID was less special, but that assumption is baked into a lot of code -- most of which is technically not in nbtree. We expect very specific things about the alignment of TIDs -- they are assumed to be 3 SHORTALIGN()'d uint16 fields. Within nbtree, we assume SHORTALIGN()'d access to the t_info field by IndexTupleSize() will be okay within btree_xlog_split(). I bet that there are a number of subtle assumptions about our use of IndexTupleData + ItemPointerData that we have no idea about right now. So changing it won't be that easy. As for page level stuff, I believe that we mostly do things the right way already. I would prefer it if the line pointer array was at the end of the page so that tuples could go at the start of the page, and be appending front to back (maybe the special area would still be at the end). That's a very general issue, though -- Andres says that that would help branch prediction, though I'm not sure of the details offhand. Questions ========= Finally, I have some specific questions for you about the patch: 1. How do you feel about the design of posting list splits, and my explanation of that design in the nbtree README? 2. How do you feel about the idea of stopping VACUUM from clearing the BTP_HAS_GARBAGE page level flag? I suspect that it's much better to have it falsely set than to have it falsely unset. The extra cost is that we do a useless extra call to _bt_vacuum_one_page(), but that's very cheap in the context of having to deal with a page that's full, that we might have to split (or deduplicate) anyway. But the extra benefit could perhaps be quite large. This question doesn't really have that much to do with deduplication. [1] https://www.postgresql.org/message-id/flat/CAH2-Wz%3DSfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA%40mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d -- Peter Geoghegan
On Wed, Jan 8, 2020 at 2:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > Thanks for the review! Anything that you've written that I do not > respond to directly can be assumed to have been accepted by me. Here is a version with most of the individual changes you asked for -- this is v29. I just pushed a couple of small tweaks to nbtree.h, that you suggested I go ahead with immediately. v29 also refactors some of the "single value strategy" stuff in nbtdedup.c. This is code that anticipates the needs of nbtsplitloc.c's single value strategy -- deduplication is designed to work together with page splits/nbtsplitloc.c. Still, v29 doesn't resolve the following points you've raised, where I haven't reached a final opinion on what to do myself. These items are as follows (I'm quoting your modified patch file sent on January 8th here): * HEIKKI: Do we only generate one posting list in one WAL record? I would assume it's better to deduplicate everything on the page, since we're modifying it anyway. * HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the remaining posting list on an updated tuple? Wouldn't it be simpler and more space-efficient to store just the deleted TIDs? * HEIKKI: Would it be more clear to have a separate struct for the posting list split case? (i.e. don't reuse xl_btree_insert) v29 of the patch also doesn't change anything about how LP_DEAD bits work, apart from going into the LP_DEAD stuff in the commit message. This doesn't seem to be in the same category as the other three open items, since it seems like we disagree here -- that must be worked out through further discussion and/or benchmarking. -- Peter Geoghegan
Attachment
On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > * HEIKKI: Do we only generate one posting list in one WAL record? I > would assume it's better to deduplicate everything on the page, since > we're modifying it anyway. Still thinking about this one. > * HEIKKI: Does xl_btree_vacuum WAL record store a whole copy of the > remaining posting list on an updated tuple? Wouldn't it be simpler and > more space-efficient to store just the deleted TIDs? This probably makes sense. The btreevacuumposting() code that generates "updated" index tuples (tuples that VACUUM uses to replace existing ones when some but not all of the TIDs need to be removed) was derived from GIN's ginVacuumItemPointers(). That approach works well enough, but we can do better now. It shouldn't be that hard. My preferred approach is a little different to your suggested approach of storing the deleted TIDs directly. I would like to make it work by storing an array of uint16 offsets into a posting list, one array per "updated" tuple (with one offset per deleted TID within each array). These arrays (which must include an array size indicator at the start) can appear in the xl_btree_vacuum record, at the same place the patch currently puts a raw IndexTuple. They'd be equivalent to a raw IndexTuple -- the REDO routine would reconstruct the same raw posting list tuple on its own. This approach seems simpler, and is clearly very space efficient. This approach is similar to the approach used by REDO routines to handle posting list splits. Posting list splits must call _bt_swap_posting() on the primary, while the corresponding REDO routines also call _bt_swap_posting(). For space efficient "updates", we'd have to invent a sibling utility function -- we could call it _bt_delete_posting(), and put it next to _bt_swap_posting() within nbtdedup.c. How do you feel about that approach? (And how do you feel about the existing "REDO routines call _bt_swap_posting()" business that it's based on?) > * HEIKKI: Would it be more clear to have a separate struct for the > posting list split case? (i.e. don't reuse xl_btree_insert) I've concluded that this one probably isn't worthwhile. We'd have to carry a totally separate record on the stack within _bt_insertonpg(). If you feel strongly about it, I will reconsider. -- Peter Geoghegan
On Fri, Jan 10, 2020 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > Still, v29 doesn't resolve the following points you've raised, where I > haven't reached a final opinion on what to do myself. These items are > as follows (I'm quoting your modified patch file sent on January 8th > here): Still no progress on these items, but I am now posting v30. A new version seems warranted, because I now want to revive a patch from a couple of years back as part of the deduplication project -- it would be good to get feedback on that sooner rather than later. This is a patch that you [Heikki] are already familiar with -- the patch to speed up compactify_tuples() [1]. Sokolov Yura is CC'd here, since he is the original author. The deduplication patch is much faster with this in place. For example, with v30: pg@regression:5432 [25216]=# create unlogged table foo(bar int4); CREATE TABLE pg@regression:5432 [25216]=# create index unlogged_foo_idx on foo(bar); CREATE INDEX pg@regression:5432 [25216]=# insert into foo select g from generate_series(1, 1000000) g, generate_series(1,10) i; INSERT 0 10000000 Time: 17842.455 ms (00:17.842) If I revert the "Bucket sort for compactify_tuples" commit locally, then the same insert statement takes 31.614 seconds! In other words, the insert statement is made ~77% faster by that commit alone. The improvement is stable and reproducible. Clearly there is a big compactify_tuples() bottleneck that comes from PageIndexMultiDelete(). The hot spot is quite visible with "perf top -e branch-misses". The compactify_tuples() patch stalled because it wasn't clear if it was worth the trouble at the time. It was originally written to address a much smaller PageRepairFragmentation() bottleneck in heap pruning. ISTM that deduplication alone is a good enough reason to commit this patch. I haven't really changed anything about the 2017/2018 patch -- I need to do more review of that. We probably don't need to qsort() inlining stuff (the bucket sort thing is the real win), but I included it in v30 all the same. Other changes in v30: * We now avoid extra _bt_compare() calls within _bt_check_unique() -- no need to call _bt_compare() once per TID (once per equal tuple is quite enough). This is a noticeable performance win, even though the change was originally intended to make the logic in _bt_check_unique() clearer. * Reduced the limit on the size of a posting list tuple to 1/6 of a page -- down from 1/3. This seems like a good idea on the grounds that it keeps our options open if we split a page full of duplicates due to UPDATEs rather than INSERTs (i.e. we split a page full of duplicates that isn't also the rightmost page among pages that store only those duplicates). A lower limit is more conservative, and yet doesn't cost us that much space. * Refined nbtsort.c/CREATE INDEX to work sensibly with non-standard fillfactor settings. This last item is a minor bugfix, really. [1] https://commitfest.postgresql.org/14/1138/ -- Peter Geoghegan
Attachment
On Tue, Jan 14, 2020 at 6:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > Still no progress on these items, but I am now posting v30. A new > version seems warranted, because I now want to revive a patch from a > couple of years back as part of the deduplication project -- it would > be good to get feedback on that sooner rather than later. Actually, I decided that this wasn't necessary -- I won't be touching compactify_tuples() at all (at least not right now). Deduplication doesn't even need to use PageIndexMultiDelete() in the attached revision of the patch, v31, so speeding up compactify_tuples() is no longer relevant. v31 simplifies everything quite a bit. This is something that I came up with more or less as a result of following Heikki's feedback. I found that reviving the v17 approach of using a temp page buffer in _bt_dedup_one_page() (much like _bt_split() always has) was a good idea. This approach was initially revived in order to make dedup WAL logging work on a whole-page basis -- Heikki suggested we do it that way, and so now we do. But this approach is also a lot faster in general, and has additional benefits besides that. When we abandoned the temp buffer approach back in September of last year, the unique index stuff was totally immature and unsettled, and it looked like a very incremental approach might make sense for unique indexes. It doesn't seem like a good idea now, though. In fact, I no longer even believe that a custom checkingunique/unique index strategy in _bt_dedup_one_page() is useful. That is also removed in v31, which will also make Heikki happy -- he expressed a separate concern about the extra complexity there. I've done a lot of optimization work since September, making these simplification possible now. The problems that I saw that justified the complexity seem to have gone away now. I'm pretty sure that the recent _bt_check_unique() posting list tuple _bt_compare() optimization is the biggest part of that. The checkingunique/unique index strategy in _bt_dedup_one_page() always felt overfit to my microbenchmarks, so I'm glad to be rid of it. Note that v31 changes nothing about how we think about deduplication in unique indexes in general, nor how it is presented to users. There is still special criteria around how deduplication is *triggered* in unique indexes. We continue to trigger a deduplication pass based on seeing a duplicate within _bt_check_unique() + _bt_findinsertloc() -- otherwise we never attempt deduplication in a unique index (same as before). Plus the GUC still doesn't affect unique indexes, unique index deduplication still isn't really documented in the user docs (it just gets a passing mention in B-Tree internals section), etc. In my opinion, the patch is now pretty close to being committable. I do have two outstanding open items for the patch, though. These items are: * We still need infrastructure that marks B-Tree opclasses as safe for deduplication, to avoid things like the numeric display scale problem, collations that are unsafe for deduplication because they're nondeterministic, etc. I talked to Anastasia about this over private e-mail recently. This is going well; I'm expecting a revision later this week. It will be based on all feedback to date over on the other thread [1] that we have for this part of the project. * Make VACUUM's WAL record more space efficient when it contains one or more "updates" to an existing posting list tuple. Currently, when VACUUM must delete some but not all TIDs from a posting list, we generate a new posting list tuple and dump it into the WAL stream -- the REDO routine simply overwrites the existing item with a version lacking the TIDs that have to go. This could be space inefficient with certain workloads, such as workloads where only one or two TIDs are deleted from a very large posting list tuple again and again. Heikki suggested I do something about this. I intend to at least research the problem, and can probably go ahead with implementing it without any trouble. What nbtree VACUUM does in the patch right now is roughly the same as what GIN's VACUUM does for posting lists within posting tree pages -- see ginVacuumPostingTreeLeaf() (we're horribly inefficient about WAL logging when VACUUM'ing a GIN entry tree leaf page, which works differently, and isn't what I'm talking about -- see ginVacuumEntryPage()). We might as well do better than GIN/ginVacuumPostingTreeLeaf() here if we can. The patch is pretty clever about minimizing the volume of WAL in all other contexts, managing to avoid any other case of what could be described as "WAL space amplification". Maybe we should do the same with the xl_btree_vacuum record just to be *consistent* about it. [1] https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan
Attachment
On Tue, Jan 28, 2020 at 5:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > In my opinion, the patch is now pretty close to being committable. Attached is v32, which is even closer to being committable. > I do have two outstanding open items for the patch, though. These items > are: > > * We still need infrastructure that marks B-Tree opclasses as safe for > deduplication, to avoid things like the numeric display scale problem, > collations that are unsafe for deduplication because they're > nondeterministic, etc. No progress on this item for v32, though. It's now my only open item for this entire project. Getting very close. > * Make VACUUM's WAL record more space efficient when it contains one > or more "updates" to an existing posting list tuple. * I've focussed on this item in v32 -- it has been closed out. v32 doesn't explicitly WAL-log post-update index tuples during vacuuming of posting list tuples, making the WAL records a lot smaller in some cases. v32 represents the posting list TIDs that must be deleted instead. It does this in the most WAL-space-efficient manner possible: by storing an array of uint16 offsets for each "updated" posting list within xl_btree_vacuum records -- each entry in each array is an offset to remove (i.e. a TID that should not appear in the updated version of the tuple). We use a new nbtdedup.c utility function for this, _bt_update_posting(). The new function is similar to its neighbor function, _bt_swap_posting(), which is the nbtdedup.c utility function used during posting list splits. Just like _bt_swap_posting(), we call _bt_update_posting() both during the initial action, and from the REDO routine that replays that action. Performing vacuuming of posting list tuples this way seems to matter with larger databases that depend on deduplication to control bloat, though I haven't taken the time to figure out exactly how much it matters. I'm satisfied that this is worth having based on microbenchmarks that measure WAL volume using pg_waldump. One microbenchmark showed something like a 10x decrease in the size of all xl_btree_vacuum records taken together compared to v31. I'm pretty sure that v32 makes it all but impossible for deduplication to write out more WAL than an equivalent case with deduplication disabled (I'm excluding FPIs here, of course -- full_page_writes=on cases will see significant benefits from reduced FPIs, simply by having fewer index pages). The per-leaf-page WAL record header accounts for a lot of the space overhead of xl_btree_vacuum records, and we naturally reduce that overhead when deduplicating, so we can now noticeably come out ahead when it comes to overall WAL volume. I wouldn't say that reducing WAL volume (other than FPIs) is actually a goal of this project, but it might end up happening anyway. Apparently Microsoft Azure PostgreSQL uses full_page_writes=off, so not everyone cares about the number of FPIs (everyone cares about raw record size, though). * Removed the GUC that controls the use of deduplication in this new version, per discussion with Robert over on the "Enabling B-Tree deduplication by default" thread. Perhaps we can get by with only an index storage parameter. Let's defer this until after the Postgres 13 beta period is over, and we get feedback from testers. * Turned the documentation on deduplication in the B-Tree internals chapter into a more general discussion of the on-disk format that covers deduplication. Deduplication enhances this on-disk representation, and discussing it outside that wider context always felt awkward to me. Having this kind of discussion in the docs seems like a good idea anyway. -- Peter Geoghegan
Attachment
On Thu, Feb 6, 2020 at 6:18 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v32, which is even closer to being committable. Attached is v33, which adds the last piece we need: opclass infrastructure that tells nbtree whether or not deduplication can be applied safely. This is based on work by Anastasia that was shared with me privately. I may not end up committing 0001-* as a separate patch, but it makes sense to post it that way to make review easier -- this is supposed to be infrastructure that isn't just useful for the deduplication patch. 0001-* adds a new C function, _bt_allequalimage(), which only actually gets called within code added by 0002-* (i.e. the patch that adds the deduplication feature). At this point, my main concern is that I might not have the API exactly right in a world where these new support functions are used by more than just the nbtree deduplication feature. I would like to get detailed review of the new opclass infrastructure stuff, and have asked for it directly, but I don't think that committing the patch needs to block on that. I've now written a fair amount of documentation for both the feature and the underlying opclass infrastructure. It probably needs a bit more copy-editing, but I think that it's generally in fairly good shape. It might be a good idea for those who would like to review the opclass stuff to start with some of my btree.sgml changes, and work backwards -- the shape of the API itself is the important thing within the 0001-* patch. New opclass proc ================ In general, supporting deduplication is the rule for B-Tree opclasses, rather than the exception. Most can use the generic btequalimagedatum() routine as their support function 4, which unconditionally indicates that deduplication is safe. There is a new test that tries to catch opclasses that omitted to do this. Here is the opr_sanity.out changes added by the first patch: -- Almost all Btree opclasses can use the generic btequalimagedatum function -- as their equalimage proc (support function 4). Look for opclasses that -- don't do so; newly added Btree opclasses will usually be able to support -- deduplication with little trouble. SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name, opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype FROM pg_am am JOIN pg_opclass opc ON opc.opcmethod = am.oid JOIN pg_opfamily opf ON opc.opcfamily = opf.oid LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND amproclefttype = opcintype AND amprocnum = 4 WHERE am.amname = 'btree' AND amproc IS DISTINCT FROM 'btequalimagedatum'::regproc ORDER BY amproc::regproc::text, opfamily_name, opclass_name; proc | opfamily_name | opclass_name | opcintype -------------------+------------------+------------------+------------------ bpchar_equalimage | bpchar_ops | bpchar_ops | character btnameequalimage | text_ops | name_ops | name bttextequalimage | text_ops | text_ops | text bttextequalimage | text_ops | varchar_ops | text | array_ops | array_ops | anyarray | enum_ops | enum_ops | anyenum | float_ops | float4_ops | real | float_ops | float8_ops | double precision | jsonb_ops | jsonb_ops | jsonb | money_ops | money_ops | money | numeric_ops | numeric_ops | numeric | range_ops | range_ops | anyrange | record_image_ops | record_image_ops | record | record_ops | record_ops | record | tsquery_ops | tsquery_ops | tsquery | tsvector_ops | tsvector_ops | tsvector (16 rows) Those types/opclasses that you see here with a "proc" that is NULL cannot use deduplication under any circumstances -- they have no pg_amproc entry for B-Tree support function 4. The other four rows at the start (those with a non-NULL "proc") are for collatable types, where using deduplication is conditioned on not using a nondeterministic collation. The details are in the sgml docs for the second patch, where I go into the issue with numeric display scale, why nondeterministic collations disable the use of deduplication, etc. Note that these "equalimage" procs don't take any arguments, which is a first for an index AM support function. Even still, we can take a collation at CREATE INDEX time using the standard PG_GET_COLLATION() mechanism. I suppose that it's a little bit odd to have no arguments but still call PG_GET_COLLATION() in certain support functions. Still, it works just fine, at least as far as the needs of deduplication are concerned. Since using deduplication is supposed to pretty much be the norm from now on, it seemed like it might make sense to add a NOTICE about it during CREATE INDEX -- a notice letting the user know that it isn't being used due to a lack of opclass support: regression=# create table foo(bar numeric); CREATE TABLE regression=# create index on foo(bar); NOTICE: index "foo_bar_idx" cannot use deduplication CREATE INDEX Note that this NOTICE isn't seen with an INCLUDE index, since that's expected to not support deduplication. I have a feeling that not everybody will like this, which is why I'm pointing it out. Thoughts? -- Peter Geoghegan
Attachment
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
On 14.02.2020 05:57, Peter Geoghegan wrote: > Attached is v33, which adds the last piece we need: opclass > infrastructure that tells nbtree whether or not deduplication can be > applied safely. This is based on work by Anastasia that was shared > with me privately. Thank you for this work. I've looked through the patches and they seem to be ready for commit. I haven't yet read recent documentation and readme changes, so maybe I'll send some more feedback tomorrow. > New opclass proc > ================ > > In general, supporting deduplication is the rule for B-Tree opclasses, > rather than the exception. Most can use the generic > btequalimagedatum() routine as their support function 4, which > unconditionally indicates that deduplication is safe. There is a new > test that tries to catch opclasses that omitted to do this. Here is > the opr_sanity.out changes added by the first patch: > > -- Almost all Btree opclasses can use the generic btequalimagedatum function > -- as their equalimage proc (support function 4). Look for opclasses that > -- don't do so; newly added Btree opclasses will usually be able to support > -- deduplication with little trouble. > SELECT amproc::regproc AS proc, opf.opfname AS opfamily_name, > opc.opcname AS opclass_name, opc.opcintype::regtype AS opcintype > FROM pg_am am > JOIN pg_opclass opc ON opc.opcmethod = am.oid > JOIN pg_opfamily opf ON opc.opcfamily = opf.oid > LEFT JOIN pg_amproc ON amprocfamily = opf.oid AND > amproclefttype = opcintype AND > amprocnum = 4 > WHERE am.amname = 'btree' AND > amproc IS DISTINCT FROM 'btequalimagedatum'::regproc > ORDER BY amproc::regproc::text, opfamily_name, opclass_name; > proc | opfamily_name | opclass_name | opcintype > -------------------+------------------+------------------+------------------ > bpchar_equalimage | bpchar_ops | bpchar_ops | character > btnameequalimage | text_ops | name_ops | name > bttextequalimage | text_ops | text_ops | text > bttextequalimage | text_ops | varchar_ops | text > | array_ops | array_ops | anyarray > | enum_ops | enum_ops | anyenum > | float_ops | float4_ops | real > | float_ops | float8_ops | double precision > | jsonb_ops | jsonb_ops | jsonb > | money_ops | money_ops | money > | numeric_ops | numeric_ops | numeric > | range_ops | range_ops | anyrange > | record_image_ops | record_image_ops | record > | record_ops | record_ops | record > | tsquery_ops | tsquery_ops | tsquery > | tsvector_ops | tsvector_ops | tsvector > (16 rows) > Is there any specific reason, why we need separate btnameequalimage, bpchar_equalimage and bttextequalimage functions? As far as I see, they have the same implementation. > Since using deduplication is supposed to pretty much be the norm from > now on, it seemed like it might make sense to add a NOTICE about it > during CREATE INDEX -- a notice letting the user know that it isn't > being used due to a lack of opclass support: > > regression=# create table foo(bar numeric); > CREATE TABLE > regression=# create index on foo(bar); > NOTICE: index "foo_bar_idx" cannot use deduplication > CREATE INDEX > > Note that this NOTICE isn't seen with an INCLUDE index, since that's > expected to not support deduplication. > > I have a feeling that not everybody will like this, which is why I'm > pointing it out. > > Thoughts? I would simply move it to debug level for all cases. Since from user's perspective it doesn't differ that much from the case where deduplication is applicable in general, but not very efficient due to data distribution. I also noticed that this is not consistent with ALTER index. For example, alter index idx_n set (deduplicate_items =true); won't show any message about deduplication. I've tried several combinations with an index on a numeric column: 1) postgres=# create index idx_nd on tbl (n) with (deduplicate_items = true); NOTICE: index "idx_nd" cannot use deduplication CREATE INDEX Here the message seems appropriate. I don't think, we should restrict creation of the index even when deduplicate_items parameter is set explicitly, rather we may warn the user that it won't be efficient. 2) postgres=# create index idx_n on tbl (n) with (deduplicate_items = false); NOTICE: index "idx_n" cannot use deduplication CREATE INDEX In this case the message seems slightly strange to me. Why should we show a notice about the fact that deduplication is not possible if that is exactly what was requested? 3) postgres=# create index idx on tbl (n); NOTICE: index "idx" cannot use deduplication In my opinion, this message is too specific for default behavior. It exposes internal details without explanation and may look to user like something went wrong.
On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Thank you for this work. I've looked through the patches and they seem > to be ready for commit. > I haven't yet read recent documentation and readme changes, so maybe > I'll send some more feedback tomorrow. Great. > Is there any specific reason, why we need separate btnameequalimage, > bpchar_equalimage and bttextequalimage functions? > As far as I see, they have the same implementation. Not really. This approach allows us to reverse the decision to enable deduplication in a point release, which is theoretically useful. OTOH, if that's so important, why not have many more support function 4 implementations (one per opclass)? I suspect that we would just disable deduplication in a hard-coded fashion if we needed to disable it due to some issue that transpired. For example, we could do this by modifying _bt_allequalimage() itself. > I would simply move it to debug level for all cases. Since from user's > perspective it doesn't differ that much from the case where > deduplication is applicable in general, but not very efficient due to > data distribution. I was more concerned about cases where the user would really like to use deduplication, but wants to make sure that it gets used. And doesn't want to install pageinspect to find out. > I also noticed that this is not consistent with ALTER index. For > example, alter index idx_n set (deduplicate_items =true); won't show any > message about deduplication. But that's a change in the user's preference. Not a change in whether or not it's safe in principle. > In my opinion, this message is too specific for default behavior. It > exposes internal details without explanation and may look to user like > something went wrong. You're probably right about that. I just wish that there was some way of showing the same information that was discoverable, and didn't require the use of pageinspect. If I make it a DEBUG1 message, then it cannot really be documented. -- Peter Geoghegan
On Wed, Feb 19, 2020 at 11:16 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > Thank you for this work. I've looked through the patches and they seem > > to be ready for commit. > > I haven't yet read recent documentation and readme changes, so maybe > > I'll send some more feedback tomorrow. > > Great. I should add: I plan to commit the patch within the next 7 days. I believe that the design of deduplication itself is solid; it has many more strengths than weaknesses. It works in a way that complements the existing approach to page splits. The optimization can easily be turned off (and easily turned back on again). contrib/amcheck can detect almost any possible form of corruption that could affect a B-Tree index that has posting list tuples. I have spent months microbenchmarking every little aspect of this patch in isolation. I've also spent a lot of time on conventional benchmarking. It seems quite possible that somebody won't like some aspect of the user interface. I am more than willing to work with other contributors on any issue in that area that comes to light. I don't see any point in waiting for other hackers to speak up before the patch is committed, though. Anastasia posted the first version of this patch in August of 2015, and there have been over 30 revisions of it since the project was revived in 2019. Everyone has been given ample opportunity to offer input. -- Peter Geoghegan
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
From
Anastasia Lubennikova
Date:
On 19.02.2020 22:16, Peter Geoghegan wrote: > On Wed, Feb 19, 2020 at 8:14 AM Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Thank you for this work. I've looked through the patches and they seem >> to be ready for commit. >> I haven't yet read recent documentation and readme changes, so maybe >> I'll send some more feedback tomorrow. The only thing I found is a typo in the comment + int nhtids; /* Number of heap TIDs in nhtids array */ s/nhtids/htids I don't think this patch really needs more nitpicking ) > >> In my opinion, this message is too specific for default behavior. It >> exposes internal details without explanation and may look to user like >> something went wrong. > You're probably right about that. I just wish that there was some way > of showing the same information that was discoverable, and didn't > require the use of pageinspect. If I make it a DEBUG1 message, then it > cannot really be documented. User can discover this with a complex query to pg_index and pg_opclass. To simplify this, we can probably wrap this into function or some field in pg_indexes. Anyway, I would wait for feedback from pre-release testers.
On Thu, Feb 20, 2020 at 7:38 AM Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I don't think this patch really needs more nitpicking ) But when has that ever stopped it? :-) > User can discover this with a complex query to pg_index and pg_opclass. > To simplify this, we can probably wrap this into function or some field > in pg_indexes. A function isn't a real user interface, though -- it probably won't be noticed. I think that there is a good chance that it just won't matter. The number of indexes that won't be able to support deduplication will be very small in practice. The important exceptions are INCLUDE indexes and nondeterministic collations. These exceptions make sense intuitively, and will be documented as limitations of those other features. The numeric/float thing doesn't really make intuitive sense, and numeric is an important datatype. Still, numeric columns and float columns seem to rarely get indexed. That just leaves container type opclasses, like anyarray and jsonb. Nobody cares about indexing container types with a B-Tree index, with the possible exception of expression indexes on a jsonb column. I don't see a way around that, but it doesn't seem all that important. Again, applications are unlikely to have more than one or two of those. The *overall* space saving will probably be almost as good as if the limitation did not exist. > Anyway, I would wait for feedback from pre-release testers. Right -- let's delay making a final decision on it. Just like the decision to enable it by default. It will work this way in the committed version, but that isn't supposed to be the final word on it. -- Peter Geoghegan
On Thu, Feb 20, 2020 at 10:58 AM Peter Geoghegan <pg@bowt.ie> wrote: > I think that there is a good chance that it just won't matter. The > number of indexes that won't be able to support deduplication will be > very small in practice. The important exceptions are INCLUDE indexes > and nondeterministic collations. These exceptions make sense > intuitively, and will be documented as limitations of those other > features. I wasn't clear about the implication of what I was saying here, which is: I will make the NOTICE a DEBUG1 message, and leave everything else as-is in the initial committed version. -- Peter Geoghegan
On Thu, Feb 20, 2020 at 12:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > I wasn't clear about the implication of what I was saying here, which > is: I will make the NOTICE a DEBUG1 message, and leave everything else > as-is in the initial committed version. Attached is v34, which has this change. My plan is to commit something very close to this on Wednesday morning (barring any objections). Other changes: * Now, equalimage functions take a pg_type OID argument, allowing us to reuse the same generic pg_proc-wise function across many of the operator classes from the core distribution. * Rewrote the docs for equalimage functions in the 0001-* patch. * Lots of copy-editing of the "Implementation" section of the B-Tree doc chapter, most of which is about deduplication specifically. -- Peter Geoghegan
Attachment
On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v34, which has this change. My plan is to commit something > very close to this on Wednesday morning (barring any objections). Pushed. I'm going to delay committing the pageinspect patch until tomorrow, since I haven't thought about that aspect of the project in a while. Seems like a good idea to go through it one more time, once it's clear that the buildfarm is stable. The buildfarm appears to be stable now, though there was an issue with a compiler warning earlier. I quickly pushed a fix for that, and can see that longfin is green/passing now. Thanks for sticking with this project, Anastasia. -- Peter Geoghegan
On 2020/02/27 7:43, Peter Geoghegan wrote: > On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan <pg@bowt.ie> wrote: >> Attached is v34, which has this change. My plan is to commit something >> very close to this on Wednesday morning (barring any objections). > > Pushed. Thanks for committing this nice feature! Here is one minor comment. + <primary><varname>deduplicate_items</varname></primary> + <secondary>storage parameter</secondary> This should be <primary><varname>deduplicate_items</varname> storage parameter</primary> <secondary> for reloption is necessary only when the GUC parameter with the same name of the reloption exists. So, for example, you can see that <secondary> is used in vacuum_cleanup_index_scale_factor but not in buffering reloption. Regards, -- Fujii Masao NTT DATA CORPORATION Advanced Platform Technology Group Research and Development Headquarters
On Wed, Feb 26, 2020 at 10:03 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > Thanks for committing this nice feature! You're welcome! > Here is one minor comment. > > + <primary><varname>deduplicate_items</varname></primary> > + <secondary>storage parameter</secondary> > > This should be > > <primary><varname>deduplicate_items</varname> storage parameter</primary> I pushed a fix for this. Thanks -- Peter Geoghegan
On 2020-02-26 14:43:27 -0800, Peter Geoghegan wrote: > On Mon, Feb 24, 2020 at 4:54 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached is v34, which has this change. My plan is to commit something > > very close to this on Wednesday morning (barring any objections). > > Pushed. Congrats!
On Fri, Mar 6, 2020 at 11:00 AM Andres Freund <andres@anarazel.de> wrote: > > Pushed. > > Congrats! Thanks Andres! -- Peter Geoghegan