Thread: [WIP] [B-Tree] Retail IndexTuple deletion
Hi, I have written a code for quick indextuple deletion from an relation by heap tuple TID. The code relate to "Retail IndexTuple deletion" enhancement of btree index on postgresql wiki [1]. Briefly, it includes three steps: 1. Key generation for index tuple searching. 2. Index relation search for tuple with the heap tuple TID. 3. Deletion of the tuple from the index relation. Now, index relation cleaning performs by vacuum which scan all index relation for dead entries sequentially, tuple-by-tuple. This simplistic and safe method can be significantly surpasses in the cases, than a number of dead entries is not large by retail deletions which uses index scans for search dead entries. Also, it can be used by distributed systems for reduce cost of a global index vacuum. Patch '0001-retail-indextuple-deletion' introduce new function amtargetdelete() in access method interface. Patch '0002-quick-vacuum-strategy' implements this function for an alternative strategy of lazy index vacuum, called 'Quick Vacuum'. The code demands hold DEAD tuple storage until scan key will be created. In this version I add 'target_index_deletion_factor' option. If it more than 0, heap_page_prune() uses ItemIdMarkDead() instead of ItemIdSetDead() function for set DEAD flag and hold the tuple storage. Next step is developing background worker which will collect pairs (tid, scankey) of DEAD tuples from heap_page_prune() function. Here are test description and some execution time measurements results showing the benefit of this patches: Test: ----- create table test_range(id serial primary key, value integer); insert into test_range (value) select random()*1e7/10^N from generate_series(1, 1e7); DELETE FROM test_range WHERE value=1; VACUUM test_range; Results: -------- | n | t1, s | t2, s | speedup | |---|---------------------------| | 0 | 0.00003| 0.4476 | 1748.4 | | 1 | 0.00006| 0.5367 | 855.99 | | 2 | 0.0004 | 0.9804 | 233.99 | | 3 | 0.0048 | 1.6493 | 34.576 | | 4 | 0.5600 | 2.4854 | 4.4382 | | 5 | 3.3300 | 3.9555 | 1.2012 | | 6 | 17.700 | 5.6000 | 0.3164 | |---|---------------------------| in the table, t1 - measured execution time of lazy_vacuum_index() function by Quick-Vacuum strategy; t2 - measured execution time of lazy_vacuum_index() function by Lazy-Vacuum strategy; Note, guaranteed allowable time of index scans (used for quick deletion) will be achieved by storing equal-key index tuples in physical TID order [2] with patch [3]. [1] https://wiki.postgresql.org/wiki/Key_normalization#Retail_IndexTuple_deletion [2] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute [3] https://www.postgresql.org/message-id/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Attachment
On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I have written a code for quick indextuple deletion from an relation by heap > tuple TID. The code relate to "Retail IndexTuple deletion" enhancement of > btree index on postgresql wiki [1]. I knew that somebody would eventually read that Wiki page. :-) > Now, index relation cleaning performs by vacuum which scan all index > relation for dead entries sequentially, tuple-by-tuple. This simplistic and > safe method can be significantly surpasses in the cases, than a number of > dead entries is not large by retail deletions which uses index scans for > search dead entries. I assume that the lazy vacuum bulk delete thing is much faster for the situation where you have many dead tuples in the index. However, allowing a B-Tree index to accumulate so much bloat in the first place often has consequences that cannot be reversed with anything less than a REINDEX. It's true that "prevention is better than cure" for all types of bloat. However, this could perhaps be as much as 100x more important for B-Tree bloat, since we cannot place new tuples in any place that is convenient. It's very different to bloat in the heap, even at a high level. > The code demands hold DEAD tuple storage until scan key will be created. In > this version I add 'target_index_deletion_factor' option. If it more than 0, > heap_page_prune() uses ItemIdMarkDead() instead of ItemIdSetDead() function > for set DEAD flag and hold the tuple storage. Makes sense. > Next step is developing background worker which will collect pairs (tid, > scankey) of DEAD tuples from heap_page_prune() function. Makes sense. > | n | t1, s | t2, s | speedup | > |---|---------------------------| > | 0 | 0.00003| 0.4476 | 1748.4 | > | 1 | 0.00006| 0.5367 | 855.99 | > | 2 | 0.0004 | 0.9804 | 233.99 | > | 3 | 0.0048 | 1.6493 | 34.576 | > | 4 | 0.5600 | 2.4854 | 4.4382 | > | 5 | 3.3300 | 3.9555 | 1.2012 | > | 6 | 17.700 | 5.6000 | 0.3164 | > |---|---------------------------| > in the table, t1 - measured execution time of lazy_vacuum_index() function > by Quick-Vacuum strategy; t2 - measured execution time of > lazy_vacuum_index() function by Lazy-Vacuum strategy; The speedup looks promising. However, the real benefit should be in query performance, especially when we have heavy contention. Very eager retail index tuple deletion could help a lot there. It already makes sense to make autovacuum extremely aggressive in this case, to the point when it's running almost constantly. A more targeted cleanup process that can run much faster could do the same job, but be much more eager, and so be much more effective at *preventing* bloating of the key space [1][2]. > Note, guaranteed allowable time of index scans (used for quick deletion) > will be achieved by storing equal-key index tuples in physical TID order [2] > with patch [3]. I now have greater motivation to develop that patch into a real project. I bet that my heap-tid-sort patch will allow you to refine your interface when there are many logical duplicates: You can create one explicit scan key, but have a list of heap TIDs that need to be killed within the range of matching index tuples. That could be a lot more efficient in the event of many non-HOT updates, where most index tuple values won't actually change. You can sort the list of heap TIDs that you want to kill once, and then "merge" it with the tuples at the leaf level as they are matched/killed. It should be safe to avoid rechecking anything other than the heap TID values. [1] http://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html [2] https://www.postgresql.org/message-id/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1NGQ@mail.gmail.com -- Peter Geoghegan
On Mon, Jun 18, 2018 at 4:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Note, guaranteed allowable time of index scans (used for quick deletion) > > will be achieved by storing equal-key index tuples in physical TID order [2] > > with patch [3]. > > I now have greater motivation to develop that patch into a real project. > > I bet that my heap-tid-sort patch will allow you to refine your > interface when there are many logical duplicates: You can create one > explicit scan key, but have a list of heap TIDs that need to be killed > within the range of matching index tuples. That could be a lot more > efficient in the event of many non-HOT updates, where most index tuple > values won't actually change. You can sort the list of heap TIDs that > you want to kill once, and then "merge" it with the tuples at the leaf > level as they are matched/killed. It should be safe to avoid > rechecking anything other than the heap TID values. > > [1] http://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html > [2] https://www.postgresql.org/message-id/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1NGQ@mail.gmail.com Actually, once btree tids are sorted, you can continue tree descent all the way to the exact leaf page that contains the tuple to be deleted. Thus, the single-tuple interface ends up being quite OK. Sure, you can optimize things a bit by scanning a range, but only if vacuum is able to group keys in order to produce the optimized calls, and I don't see that terribly likely. So, IMHO the current interface may be quite enough.
On Mon, Jun 18, 2018 at 1:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Actually, once btree tids are sorted, you can continue tree descent > all the way to the exact leaf page that contains the tuple to be > deleted. > > Thus, the single-tuple interface ends up being quite OK. Sure, you can > optimize things a bit by scanning a range, but only if vacuum is able > to group keys in order to produce the optimized calls, and I don't see > that terribly likely. Andrey talked about a background worker that did processing + index tuple deletion when handed off work by a user backend after it performs HOT pruning of a heap page. I consider that idea to be a good place to go with the patch, because in practice the big problem is workloads that suffer from so-called "write amplification", where most index tuples are created despite being "logically unnecessary" (e.g. one index among several prevents an UPDATE being HOT-safe, making inserts into most of the indexes "logically unnecessary"). I think that it's likely that only descending the tree once in order to kill multiple duplicate index tuples in one shot will in fact be *very* important (unless perhaps you assume that that problem is solved by something else, such as zheap). The mechanism that Andrey describes is rather unlike VACUUM as we know it today, but that's the whole point. -- Peter Geoghegan
On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Patch '0001-retail-indextuple-deletion' introduce new function > amtargetdelete() in access method interface. Patch > '0002-quick-vacuum-strategy' implements this function for an alternative > strategy of lazy index vacuum, called 'Quick Vacuum'. My compiler shows the following warnings: /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c: In function ‘bttargetdelete’: /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c:1053:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation] if (needLock) ^~ /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c:1055:4: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’ npages = RelationGetNumberOfBlocks(irel); ^~~~~~ /code/postgresql/root/build/../source/src/backend/access/nbtree/nbtree.c:1074:3: warning: ‘blkno’ may be used uninitialized in this function [-Wmaybe-uninitialized] cleanup_block(info, stats, blkno); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I think that they're both harmless, though. -- Peter Geoghegan
On Mon, Jun 18, 2018 at 2:54 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Patch '0001-retail-indextuple-deletion' introduce new function >> amtargetdelete() in access method interface. Patch >> '0002-quick-vacuum-strategy' implements this function for an alternative >> strategy of lazy index vacuum, called 'Quick Vacuum'. > > My compiler shows the following warnings: Some real feedback: What we probably want to end up with here is new lazyvacuum.c code that does processing for one heap page (and associated indexes) that is really just a "partial" lazy vacuum. Though it won't do things like advance relfrozenxid, it will do pruning for the heap page, index tuple killing, and finally heap tuple killing. It will do all of these things reliably, just like traditional lazy vacuum. This will be what your background worker eventually uses. I doubt that you can use routines like index_beginscan() within bttargetdelete() at all. I think that you need something closer to _bt_doinsert() or _bt_pagedel(), that manages its own scan (your code should probably live in nbtpage.c). It does not make sense to teach external, generic routines like index_beginscan() about heap TID being an implicit final index attribute, which is one reason for this (I'm assuming that this patch relies on my own patch). Another reason is that you need to hold an exclusive buffer lock at the point that you identify the tuple to be killed, until the point that you actually kill it. You don't do that now. IOW, the approach you've taken in bttargetdelete() isn't quite correct because you imagine that it's okay to occasionally "lose" the index tuple that you originally found, and just move on. That needs to be 100% reliable, or else we'll end up with index tuples that point to the wrong heap tuples in rare cases with concurrent insertions. As I said, we want a "partial" lazy vacuum here, which must mean that it's reliable. Note that _bt_pagedel() actually calls _bt_search() when it deletes a page. Your patch will not be the first patch that makes nbtree vacuuming do an index scan. You should be managing your own insertion scan key, much like _bt_pagedel() does. If you use my patch, _bt_search() can be taught to look for a specific heap TID. Finally, doing things this way would let you delete multiple duplicates in one shot, as I described in an earlier e-mail. Only a single descent of the tree is needed to delete quite a few index tuples, provided that they all happen to be logical duplicates. Again, your background worker will take advantage of this. This code does not follow the Postgres style: > - else > + } > + else { > + if (rootoffnum != latestdead) > + heap_prune_record_unused(prstate, latestdead); > heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]); > + } > } Please be more careful about that. I find it very distracting. -- Peter Geoghegan
On Mon, Jun 18, 2018 at 4:05 PM, Peter Geoghegan <pg@bowt.ie> wrote: > Finally, doing things this way would let you delete multiple > duplicates in one shot, as I described in an earlier e-mail. Only a > single descent of the tree is needed to delete quite a few index > tuples, provided that they all happen to be logical duplicates. Again, > your background worker will take advantage of this. BTW, when you do this you should make sure that there is only one call to _bt_delitems_vacuum(), so that there aren't too many WAL records. Actually, that's not quite correct -- there should be one _bt_delitems_vacuum() call *per leaf page* per bttargetdelete() call, which is slightly different. There should rarely be more than one or two calls to _bt_delitems_vacuum() in total, because your background worker is only going to delete one heap page's duplicates per bttargetdelete() call, and because there will be locality/correlation with my TID order patch. -- Peter Geoghegan
On Mon, Jun 18, 2018 at 4:05 PM, Peter Geoghegan <pg@bowt.ie> wrote: > IOW, the approach you've taken in bttargetdelete() isn't quite correct > because you imagine that it's okay to occasionally "lose" the index > tuple that you originally found, and just move on. That needs to be > 100% reliable, or else we'll end up with index tuples that point to > the wrong heap tuples in rare cases with concurrent insertions. Attached patch adds a new amcheck check within bt_index_parent_check(). With the patch, heap TIDs are accumulated in a tuplesort and sorted at the tail end of verification (before optional heapallindexed verification runs). This will reliably detect the kind of corruption I noticed was possible with your patch. Note that the amcheck enhancement that went along with my heap-tid-btree-sort patch may not have detected this issue, which is why I wrote this patch -- the heap-tid-btree-sort amcheck stuff could detect duplicates, but only when all other attributes happened to be identical when comparing sibling index tuples (i.e. only when we must actually compare TIDs across sibling index tuples). If you add this check, I'm pretty sure that you can detect any possible problem. You should think about using this to debug your patch. I may get around to submitting this to a CF, but that isn't a priority right now. -- Peter Geoghegan
Attachment
On Tue, Jun 19, 2018 at 8:05 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Jun 18, 2018 at 2:54 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov >> <a.lepikhov@postgrespro.ru> wrote: >>> Patch '0001-retail-indextuple-deletion' introduce new function >>> amtargetdelete() in access method interface. Patch >>> '0002-quick-vacuum-strategy' implements this function for an alternative >>> strategy of lazy index vacuum, called 'Quick Vacuum'. Great! >> >> My compiler shows the following warnings: > > Some real feedback: > > What we probably want to end up with here is new lazyvacuum.c code > that does processing for one heap page (and associated indexes) that > is really just a "partial" lazy vacuum. Though it won't do things like > advance relfrozenxid, it will do pruning for the heap page, index > tuple killing, and finally heap tuple killing. It will do all of these > things reliably, just like traditional lazy vacuum. This will be what > your background worker eventually uses. I think that we do the partial lazy vacuum using visibility map even now. That does heap pruning, index tuple killing but doesn't advance relfrozenxid. Since this patch adds an ability to delete small amount of index tuples quickly, what I'd like to do with this patch is to invoke autovacuum more frequently, and do the target index deletion or the index bulk-deletion depending on amount of garbage and index size etc. That is, it might be better if lazy vacuum scans heap in ordinary way and then plans and decides a method of index deletion based on costs similar to what query planning does. Regards, -- Masahiko Sawada
On Tue, Jun 19, 2018 at 2:33 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think that we do the partial lazy vacuum using visibility map even > now. That does heap pruning, index tuple killing but doesn't advance > relfrozenxid. Right, that's what I was thinking. Opportunistic HOT pruning isn't like vacuuming because it doesn't touch indexes. This patch adds an alternative strategy for conventional lazy vacuum that is also able to run a page at a time if needed. Perhaps page-at-a-time operation could later be used for doing something that is opportunistic in the same way that pruning is opportunistic, but it's too early to worry about that. > Since this patch adds an ability to delete small amount > of index tuples quickly, what I'd like to do with this patch is to > invoke autovacuum more frequently, and do the target index deletion or > the index bulk-deletion depending on amount of garbage and index size > etc. That is, it might be better if lazy vacuum scans heap in ordinary > way and then plans and decides a method of index deletion based on > costs similar to what query planning does. That seems to be what Andrey wants to do, though right now the prototype patch actually just always uses its alternative strategy while doing any kind of lazy vacuuming (some simple costing code is commented out right now). It shouldn't be too hard to add some costing to it. Once we do that, and once we polish the patch some more, we can do performance testing. Maybe that alone will be enough to make the patch worth committing; "opportunistic microvacuuming" can come later, if at all. -- Peter Geoghegan
On 19.06.2018 04:05, Peter Geoghegan wrote: > On Mon, Jun 18, 2018 at 2:54 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Sun, Jun 17, 2018 at 9:39 PM, Andrey V. Lepikhov >> <a.lepikhov@postgrespro.ru> wrote: >>> Patch '0001-retail-indextuple-deletion' introduce new function >>> amtargetdelete() in access method interface. Patch >>> '0002-quick-vacuum-strategy' implements this function for an alternative >>> strategy of lazy index vacuum, called 'Quick Vacuum'. >> >> My compiler shows the following warnings: > > Some real feedback: > > What we probably want to end up with here is new lazyvacuum.c code > that does processing for one heap page (and associated indexes) that > is really just a "partial" lazy vacuum. Though it won't do things like > advance relfrozenxid, it will do pruning for the heap page, index > tuple killing, and finally heap tuple killing. It will do all of these > things reliably, just like traditional lazy vacuum. This will be what > your background worker eventually uses. > It is final goal of the patch. > I doubt that you can use routines like index_beginscan() within > bttargetdelete() at all. I think that you need something closer to > _bt_doinsert() or _bt_pagedel(), that manages its own scan (your code > should probably live in nbtpage.c). It does not make sense to teach > external, generic routines like index_beginscan() about heap TID being > an implicit final index attribute, which is one reason for this (I'm > assuming that this patch relies on my own patch). Another reason is > that you need to hold an exclusive buffer lock at the point that you > identify the tuple to be killed, until the point that you actually > kill it. You don't do that now. > > IOW, the approach you've taken in bttargetdelete() isn't quite correct > because you imagine that it's okay to occasionally "lose" the index > tuple that you originally found, and just move on. That needs to be > 100% reliable, or else we'll end up with index tuples that point to > the wrong heap tuples in rare cases with concurrent insertions. As I > said, we want a "partial" lazy vacuum here, which must mean that it's > reliable. Note that _bt_pagedel() actually calls _bt_search() when it > deletes a page. Your patch will not be the first patch that makes > nbtree vacuuming do an index scan. You should be managing your own > insertion scan key, much like _bt_pagedel() does. If you use my patch, > _bt_search() can be taught to look for a specific heap TID. > Agree with this notes. Corrections will made in the next version of the patch. > Finally, doing things this way would let you delete multiple > duplicates in one shot, as I described in an earlier e-mail. Only a > single descent of the tree is needed to delete quite a few index > tuples, provided that they all happen to be logical duplicates. Again, > your background worker will take advantage of this. > It is very interesting idea. According to this, I plan to change bttargetdelete() interface as follows: IndexTargetDeleteStats* amtargetdelete(IndexTargetDeleteInfo *info, IndexTargetDeleteStats *stats, Datum *values, bool *isnull); where structure IndexTargetDeleteInfo contains a TID list of dead heap tuples. All index entries, corresponding to this list, may be deleted (or only some of it) by one call of amtargetdelete() function with single descent of the tree. > This code does not follow the Postgres style: > >> - else >> + } >> + else { >> + if (rootoffnum != latestdead) >> + heap_prune_record_unused(prstate, latestdead); >> heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]); >> + } >> } > > Please be more careful about that. I find it very distracting. > Done -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Hi, According to your feedback, i develop second version of the patch. In this version: 1. High-level functions index_beginscan(), index_rescan() not used. Tree descent made by _bt_search(). _bt_binsrch() used for positioning on the page. 2. TID list introduced in amtargetdelete() interface. Now only one tree descent needed for deletion all tid's from the list with equal scan key value - logical duplicates deletion problem. 3. Only one WAL record for index tuple deletion per leaf page per amtargetdelete() call. 4. VACUUM can sort TID list preliminary for more quick search of duplicates. Background worker will come later. On 19.06.2018 22:38, Peter Geoghegan wrote: > On Tue, Jun 19, 2018 at 2:33 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I think that we do the partial lazy vacuum using visibility map even >> now. That does heap pruning, index tuple killing but doesn't advance >> relfrozenxid. > > Right, that's what I was thinking. Opportunistic HOT pruning isn't > like vacuuming because it doesn't touch indexes. This patch adds an > alternative strategy for conventional lazy vacuum that is also able to > run a page at a time if needed. Perhaps page-at-a-time operation could > later be used for doing something that is opportunistic in the same > way that pruning is opportunistic, but it's too early to worry about > that. > >> Since this patch adds an ability to delete small amount >> of index tuples quickly, what I'd like to do with this patch is to >> invoke autovacuum more frequently, and do the target index deletion or >> the index bulk-deletion depending on amount of garbage and index size >> etc. That is, it might be better if lazy vacuum scans heap in ordinary >> way and then plans and decides a method of index deletion based on >> costs similar to what query planning does. > > That seems to be what Andrey wants to do, though right now the > prototype patch actually just always uses its alternative strategy > while doing any kind of lazy vacuuming (some simple costing code is > commented out right now). It shouldn't be too hard to add some costing > to it. Once we do that, and once we polish the patch some more, we can > do performance testing. Maybe that alone will be enough to make the > patch worth committing; "opportunistic microvacuuming" can come later, > if at all. > -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Attachment
On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > According to your feedback, i develop second version of the patch. > In this version: > 1. High-level functions index_beginscan(), index_rescan() not used. Tree > descent made by _bt_search(). _bt_binsrch() used for positioning on the > page. > 2. TID list introduced in amtargetdelete() interface. Now only one tree > descent needed for deletion all tid's from the list with equal scan key > value - logical duplicates deletion problem. > 3. Only one WAL record for index tuple deletion per leaf page per > amtargetdelete() call. Cool. What is this "race" code about? > + buffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(tid), RBM_NORMAL, NULL); > + LockBuffer(buffer, BUFFER_LOCK_SHARE); > + > + page = (Page) BufferGetPage(buffer); > + offnum = ItemPointerGetOffsetNumber(tid); > + lp = PageGetItemId(page, offnum); > + > + /* > + * VACUUM Races: someone already remove the tuple from HEAP. Ignore it. > + */ > + if (!ItemIdIsUsed(lp)) > + return NULL; It looks wrong -- why should another process have set the item as unused? And if we assume that that's possible at all, what's to stop a third process from actually reusing the item pointer before we arrive (at get_tuple_by_tid()), leaving us to find a tuple that is totally unrelated to the original tuple to be deleted? (Also, you're not releasing the buffer lock before you return.) > 4. VACUUM can sort TID list preliminary for more quick search of duplicates. This version of the patch prevents my own "force unique keys" patch from working, since you're not using my proposed new _bt_search()/_bt_binsrch()/_bt_compare() interface (you're not passing them a heap TID). It is essential that your patch be able to quickly reach any tuple that it needs to kill. Otherwise, the worst case performance is totally unacceptable; it will never be okay to go through 10%+ of the index to kill a single tuple hundreds or even thousands of times per VACUUM. It seems to me that doing this tid_list_search() binary search is pointless -- you should instead be relying on logical duplicates using their heap TID as a tie-breaker. Rather than doing a binary search within tid_list_search(), you should instead match the presorted heap TIDs at the leaf level against the sorted in-memory TID list. You know, a bit like a merge join. I suggest that you go even further than this: I think that you should just start distributing my patch as part of your patch series. You can change my code if you need to. I also suggest using "git format patch" with simple, short commit messages to produce patches. This makes it a lot easier to track the version of the patch, changes over time, etc. I understand why you'd hesitate to take ownership of my code (it has big problems!), but the reality is that all the problems that my patch has are also problems for your patch. One patch cannot get committed without the other, so they are already the same project. As a bonus, my patch will probably improve the best case performance for your patch, since multi-deletions will now have much better locality of access. -- Peter Geoghegan
On Fri, Jun 22, 2018 at 12:43 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> According to your feedback, i develop second version of the patch. >> In this version: >> 1. High-level functions index_beginscan(), index_rescan() not used. Tree >> descent made by _bt_search(). _bt_binsrch() used for positioning on the >> page. >> 2. TID list introduced in amtargetdelete() interface. Now only one tree >> descent needed for deletion all tid's from the list with equal scan key >> value - logical duplicates deletion problem. >> 3. Only one WAL record for index tuple deletion per leaf page per >> amtargetdelete() call. > > Cool. > > What is this "race" code about? I noticed another bug in your patch, when running a "wal_consistency_checking=all" smoke test. I do this simple, generic test for anything that touches WAL-logging, actually -- it's a good practice to adopt. I enable "wal_consistency_checking=all" on the installation, create a streaming replica with pg_basebackup (which also has "wal_consistency_checking=all"), and then run "make installcheck" against the primary. Here is what I see on the standby when I do this with v2 of your patch applied: 9524/2018-06-22 13:03:12 PDT LOG: entering standby mode 9524/2018-06-22 13:03:12 PDT LOG: consistent recovery state reached at 0/30000D0 9524/2018-06-22 13:03:12 PDT LOG: invalid record length at 0/30000D0: wanted 24, got 0 9523/2018-06-22 13:03:12 PDT LOG: database system is ready to accept read only connections 9528/2018-06-22 13:03:12 PDT LOG: started streaming WAL from primary at 0/3000000 on timeline 1 9524/2018-06-22 13:03:12 PDT LOG: redo starts at 0/30000D0 9524/2018-06-22 13:03:32 PDT FATAL: inconsistent page found, rel 1663/16384/1259, forknum 0, blkno 0 9524/2018-06-22 13:03:32 PDT CONTEXT: WAL redo at 0/3360B00 for Heap2/CLEAN: remxid 599 9523/2018-06-22 13:03:32 PDT LOG: startup process (PID 9524) exited with exit code 1 9523/2018-06-22 13:03:32 PDT LOG: terminating any other active server processes 9523/2018-06-22 13:03:32 PDT LOG: database system is shut down I haven't investigated this at all, but I assume that the problem is a simple oversight. The new ItemIdSetDeadRedirect() concept that you've introduced probably necessitates changes in both the WAL logging routines and the redo/recovery routines. You need to go make those changes. (By the way, I don't think you should be using the constant "3" with the ItemIdIsDeadRedirection() macro definition.) Let me know if you get stuck on this, or need more direction. -- Peter Geoghegan
On 23.06.2018 01:14, Peter Geoghegan wrote: > On Fri, Jun 22, 2018 at 12:43 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov >> <a.lepikhov@postgrespro.ru> wrote: >>> According to your feedback, i develop second version of the patch. >>> In this version: >>> 1. High-level functions index_beginscan(), index_rescan() not used. Tree >>> descent made by _bt_search(). _bt_binsrch() used for positioning on the >>> page. >>> 2. TID list introduced in amtargetdelete() interface. Now only one tree >>> descent needed for deletion all tid's from the list with equal scan key >>> value - logical duplicates deletion problem. >>> 3. Only one WAL record for index tuple deletion per leaf page per >>> amtargetdelete() call. >> >> Cool. >> >> What is this "race" code about? > I introduce this check because keep in mind about another vacuum workers, which can make cleaning a relation concurrently. may be it is redundant. > I noticed another bug in your patch, when running a > "wal_consistency_checking=all" smoke test. I do this simple, generic > test for anything that touches WAL-logging, actually -- it's a good > practice to adopt. > > I enable "wal_consistency_checking=all" on the installation, create a > streaming replica with pg_basebackup (which also has > "wal_consistency_checking=all"), and then run "make installcheck" > against the primary. Here is what I see on the standby when I do this > with v2 of your patch applied: > > 9524/2018-06-22 13:03:12 PDT LOG: entering standby mode > 9524/2018-06-22 13:03:12 PDT LOG: consistent recovery state reached > at 0/30000D0 > 9524/2018-06-22 13:03:12 PDT LOG: invalid record length at 0/30000D0: > wanted 24, got 0 > 9523/2018-06-22 13:03:12 PDT LOG: database system is ready to accept > read only connections > 9528/2018-06-22 13:03:12 PDT LOG: started streaming WAL from primary > at 0/3000000 on timeline 1 > 9524/2018-06-22 13:03:12 PDT LOG: redo starts at 0/30000D0 > 9524/2018-06-22 13:03:32 PDT FATAL: inconsistent page found, rel > 1663/16384/1259, forknum 0, blkno 0 > 9524/2018-06-22 13:03:32 PDT CONTEXT: WAL redo at 0/3360B00 for > Heap2/CLEAN: remxid 599 > 9523/2018-06-22 13:03:32 PDT LOG: startup process (PID 9524) exited > with exit code 1 > 9523/2018-06-22 13:03:32 PDT LOG: terminating any other active server processes > 9523/2018-06-22 13:03:32 PDT LOG: database system is shut down > > I haven't investigated this at all, but I assume that the problem is a > simple oversight. The new ItemIdSetDeadRedirect() concept that you've > introduced probably necessitates changes in both the WAL logging > routines and the redo/recovery routines. You need to go make those > changes. (By the way, I don't think you should be using the constant > "3" with the ItemIdIsDeadRedirection() macro definition.) > > Let me know if you get stuck on this, or need more direction. > I was investigated the bug of the simple smoke test. You're right: make any manipulations with line pointer in heap_page_prune() without reflection in WAL record is no good idea. But this consistency problem arises even on clean PostgreSQL installation (without my patch) with ItemIdSetDead() -> ItemIdMarkDead() replacement. Byte-by-byte comparison of master and replay pages shows only 2 bytes difference in the tuple storage part of page. I don't stuck on yet, but good ideas are welcome. -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
On Fri, Jun 22, 2018 at 8:24 PM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Hi, > According to your feedback, i develop second version of the patch. > In this version: > 1. High-level functions index_beginscan(), index_rescan() not used. Tree > descent made by _bt_search(). _bt_binsrch() used for positioning on the > page. > 2. TID list introduced in amtargetdelete() interface. Now only one tree > descent needed for deletion all tid's from the list with equal scan key > value - logical duplicates deletion problem. > 3. Only one WAL record for index tuple deletion per leaf page per > amtargetdelete() call. > 4. VACUUM can sort TID list preliminary for more quick search of duplicates. > > Background worker will come later. > > Thank you for updating patches! Here is some comments for the latest patch. +static void +quick_vacuum_index(Relation irel, Relation hrel, + IndexBulkDeleteResult **overall_stats, + LVRelStats *vacrelstats) +{ (snip) + /* + * Collect statistical info + */ + lazy_cleanup_index(irel, *overall_stats, vacrelstats); +} I think that we should not call lazy_cleanup_index at the end of quick_vacuum_index because we call it multiple times during a lazy vacuum and index statistics can be changed during vacuum. We already call lazy_cleanup_index at the end of lazy_scan_heap. bttargetdelete doesn't delete btree pages even if pages become empty. I think we should do that. Otherwise empty page never be recycled. But please note that if we delete btree pages during bttargetdelete, recyclable pages might not be recycled. That is, if we choose the target deletion method every time then the deleted-but-not-recycled pages could never be touched, unless reaching vacuum_cleanup_index_scale_factor. So I think we need to either run bulk-deletion method or do cleanup index before btpo.xact wraparound. + ivinfo.indexRelation = irel; + ivinfo.heapRelation = hrel; + qsort((void *)vacrelstats->dead_tuples, vacrelstats->num_dead_tuples, sizeof(ItemPointerData), tid_comparator); I think the sorting vacrelstats->dead_tuples is not necessary because garbage TIDs are stored in a sorted order. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 26.06.2018 15:31, Masahiko Sawada wrote: > On Fri, Jun 22, 2018 at 8:24 PM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Hi, >> According to your feedback, i develop second version of the patch. >> In this version: >> 1. High-level functions index_beginscan(), index_rescan() not used. Tree >> descent made by _bt_search(). _bt_binsrch() used for positioning on the >> page. >> 2. TID list introduced in amtargetdelete() interface. Now only one tree >> descent needed for deletion all tid's from the list with equal scan key >> value - logical duplicates deletion problem. >> 3. Only one WAL record for index tuple deletion per leaf page per >> amtargetdelete() call. >> 4. VACUUM can sort TID list preliminary for more quick search of duplicates. >> >> Background worker will come later. >> >> > > Thank you for updating patches! Here is some comments for the latest patch. > > +static void > +quick_vacuum_index(Relation irel, Relation hrel, > + IndexBulkDeleteResult **overall_stats, > + LVRelStats *vacrelstats) > +{ > (snip) > + /* > + * Collect statistical info > + */ > + lazy_cleanup_index(irel, *overall_stats, vacrelstats); > +} > > I think that we should not call lazy_cleanup_index at the end of > quick_vacuum_index because we call it multiple times during a lazy > vacuum and index statistics can be changed during vacuum. We already > call lazy_cleanup_index at the end of lazy_scan_heap. > Ok > bttargetdelete doesn't delete btree pages even if pages become empty. > I think we should do that. Otherwise empty page never be recycled. But > please note that if we delete btree pages during bttargetdelete, > recyclable pages might not be recycled. That is, if we choose the > target deletion method every time then the deleted-but-not-recycled > pages could never be touched, unless reaching > vacuum_cleanup_index_scale_factor. So I think we need to either run > bulk-deletion method or do cleanup index before btpo.xact wraparound. > > + ivinfo.indexRelation = irel; > + ivinfo.heapRelation = hrel; > + qsort((void *)vacrelstats->dead_tuples, > vacrelstats->num_dead_tuples, sizeof(ItemPointerData), > tid_comparator); > Ok. I think caller of bttargetdelete() must decide when to make index cleanup. > I think the sorting vacrelstats->dead_tuples is not necessary because > garbage TIDs are stored in a sorted order. > Sorting was introduced because I keep in mind background worker and more flexible cleaning strategies, not only full tuple-by-tuple relation and block scan. Caller of bttargetdelete() can set info->isSorted to prevent sorting operation. > Regards, > > -- > Masahiko Sawada > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
On 23.06.2018 00:43, Peter Geoghegan wrote: > On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> According to your feedback, i develop second version of the patch. >> In this version: >> 1. High-level functions index_beginscan(), index_rescan() not used. Tree >> descent made by _bt_search(). _bt_binsrch() used for positioning on the >> page. >> 2. TID list introduced in amtargetdelete() interface. Now only one tree >> descent needed for deletion all tid's from the list with equal scan key >> value - logical duplicates deletion problem. >> 3. Only one WAL record for index tuple deletion per leaf page per >> amtargetdelete() call. > > Cool. > > What is this "race" code about? > >> + buffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(tid), RBM_NORMAL, NULL); >> + LockBuffer(buffer, BUFFER_LOCK_SHARE); >> + >> + page = (Page) BufferGetPage(buffer); >> + offnum = ItemPointerGetOffsetNumber(tid); >> + lp = PageGetItemId(page, offnum); >> + >> + /* >> + * VACUUM Races: someone already remove the tuple from HEAP. Ignore it. >> + */ >> + if (!ItemIdIsUsed(lp)) >> + return NULL; > > It looks wrong -- why should another process have set the item as > unused? And if we assume that that's possible at all, what's to stop a > third process from actually reusing the item pointer before we arrive > (at get_tuple_by_tid()), leaving us to find a tuple that is totally > unrelated to the original tuple to be deleted? > > (Also, you're not releasing the buffer lock before you return.) > >> 4. VACUUM can sort TID list preliminary for more quick search of duplicates. > > This version of the patch prevents my own "force unique keys" patch > from working, since you're not using my proposed new > _bt_search()/_bt_binsrch()/_bt_compare() interface (you're not passing > them a heap TID). It is essential that your patch be able to quickly > reach any tuple that it needs to kill. Otherwise, the worst case > performance is totally unacceptable; it will never be okay to go > through 10%+ of the index to kill a single tuple hundreds or even > thousands of times per VACUUM. It seems to me that doing this > tid_list_search() binary search is pointless -- you should instead be > relying on logical duplicates using their heap TID as a tie-breaker. > Rather than doing a binary search within tid_list_search(), you should > instead match the presorted heap TIDs at the leaf level against the > sorted in-memory TID list. You know, a bit like a merge join. I agree with you. Binary search was developed in awaiting your patch. > > I suggest that you go even further than this: I think that you should > just start distributing my patch as part of your patch series. You can > change my code if you need to. Good. I am ready to start distributing your patch. At the beginning of the work I planned to make patch for physical TID ordering in the btree index. Your patch will make it much easier. I also suggest using "git format patch" > with simple, short commit messages to produce patches. This makes it a > lot easier to track the version of the patch, changes over time, etc. > Ok > I understand why you'd hesitate to take ownership of my code (it has > big problems!), but the reality is that all the problems that my patch > has are also problems for your patch. One patch cannot get committed > without the other, so they are already the same project. As a bonus, > my patch will probably improve the best case performance for your > patch, since multi-deletions will now have much better locality of > access. > I still believe that the patch for physical TID ordering in btree: 1) has its own value, not only for target deletion, 2) will require only a few local changes in my code, and this patches can be developed independently. I prepare third version of the patches. Summary: 1. Mask DEAD tuples at a page during consistency checking (See comments for the mask_dead_tuples() function). 2. Still not using physical TID ordering. 3. Index cleanup() after each quick_vacuum_index() call was excluded. -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Attachment
On Tue, Jun 26, 2018 at 3:31 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > bttargetdelete doesn't delete btree pages even if pages become empty. > I think we should do that. Otherwise empty page never be recycled. But > please note that if we delete btree pages during bttargetdelete, > recyclable pages might not be recycled. That is, if we choose the > target deletion method every time then the deleted-but-not-recycled > pages could never be touched, unless reaching > vacuum_cleanup_index_scale_factor. So I think we need to either run > bulk-deletion method or do cleanup index before btpo.xact wraparound. As you pointed out, we can certainly never fully delete or recycle half-dead pages using bttargetdelete. We already need to make some kind of compromise around page deletion, and it may not be necessary to insist that bttargetdelete does any kind of page deletion. I'm unsure of that, though. -- Peter Geoghegan
On Tue, Jun 26, 2018 at 11:40 PM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I still believe that the patch for physical TID ordering in btree: > 1) has its own value, not only for target deletion, > 2) will require only a few local changes in my code, > and this patches can be developed independently. I want to be clear on something now: I just don't think that this patch has any chance of getting committed without something like my own patch to go with it. The worst case for your patch without that component is completely terrible. It's not really important for you to actually formally make it part of your patch, so I'm not going to insist on that or anything, but the reality is that my patch does not have independent value -- and neither does yours. I'm sorry if that sounds harsh, but this is a difficult, complicated project. It's better to be clear about this stuff earlier on. > I prepare third version of the patches. Summary: > 1. Mask DEAD tuples at a page during consistency checking (See comments for > the mask_dead_tuples() function). > 2. Still not using physical TID ordering. > 3. Index cleanup() after each quick_vacuum_index() call was excluded. How does this patch affect opportunistic pruning in particular? Not being able to immediately reclaim tuple space in the event of a dead hot chain that is marked LP_DEAD could hurt quite a lot, including with very common workloads, such as pgbench (pgbench accounts tuples are quite a lot wider than a raw item pointer, and opportunistic pruning is much more important than vacuuming). Is that going to be acceptable, do you think? Have you measured the effects? Can we do something about it, like make pruning behave differently when it's opportunistic? Are you aware of the difference between _bt_delitems_delete() and _bt_delitems_vacuum(), and the considerations for hot standby? I think that that's another TODO list item for this patch. -- Peter Geoghegan
On 28.06.2018 05:00, Peter Geoghegan wrote: > On Tue, Jun 26, 2018 at 11:40 PM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> I still believe that the patch for physical TID ordering in btree: >> 1) has its own value, not only for target deletion, >> 2) will require only a few local changes in my code, >> and this patches can be developed independently. > > I want to be clear on something now: I just don't think that this > patch has any chance of getting committed without something like my > own patch to go with it. The worst case for your patch without that > component is completely terrible. It's not really important for you to > actually formally make it part of your patch, so I'm not going to > insist on that or anything, but the reality is that my patch does not > have independent value -- and neither does yours. > As I wrote before in the last email, I will integrate TID sorting to my patches right now. Please, give me access to the last version of your code, if it possible. You can track the progress at https://github.com/danolivo/postgres git repository > I'm sorry if that sounds harsh, but this is a difficult, complicated > project. It's better to be clear about this stuff earlier on. Ok. It is clear now. > >> I prepare third version of the patches. Summary: >> 1. Mask DEAD tuples at a page during consistency checking (See comments for >> the mask_dead_tuples() function). >> 2. Still not using physical TID ordering. >> 3. Index cleanup() after each quick_vacuum_index() call was excluded. > > How does this patch affect opportunistic pruning in particular? Not > being able to immediately reclaim tuple space in the event of a dead > hot chain that is marked LP_DEAD could hurt quite a lot, including > with very common workloads, such as pgbench (pgbench accounts tuples > are quite a lot wider than a raw item pointer, and opportunistic > pruning is much more important than vacuuming). Is that going to be > acceptable, do you think? Have you measured the effects? Can we do > something about it, like make pruning behave differently when it's > opportunistic? This is the most "tasty" part of the work. I plan some experimental research on it at the end of patches developing (including TID sort) and parametrized opportunistic pruning for flexibility of switching between strategies on the fly. My current opinion on this question: we can develop flexible strategy based on parameters: free space at a block, frequency of UPDATE/DELETE queries, percent of DEAD tuples in a block/relation. Background cleaner, raised by heap_page_prune(), give an opportunity for using different ways for each block or relation. This technique should be able to configure from fully non-storage DEAD tuples+vacuum to all-storage DEAD tuples+target deletion by DB admin. > > Are you aware of the difference between _bt_delitems_delete() and > _bt_delitems_vacuum(), and the considerations for hot standby? I think > that that's another TODO list item for this patch. > Ok -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
On Wed, Jun 27, 2018 at 12:10 PM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I prepare third version of the patches. Summary: > 1. Mask DEAD tuples at a page during consistency checking (See comments for > the mask_dead_tuples() function). - ItemIdSetDead(lp); + if (target_index_deletion_factor > 0) + ItemIdMarkDead(lp); + else + ItemIdSetDead(lp); IIUC, you want to hold the storage of DEAD tuples to form the ScanKey which is required for the index scan in the second phase of quick vacuum strategy. To achieve that, you've only marked the tuple as DEAD during pruning. Hence, PageRepairFragmentation cannot claim the space for DEAD tuples since they still have storage associated with them. But, you've WAL-logged it as DEAD tuples having no storage. So, when the WAL record is replayed in standby(or crash recovery), the tuples will be marked as DEAD having no storage and their space may be reclaimed by PageRepairFragmentation. Now, if you do byte-by-byte comparison with wal_consistency tool, it may fail even for normal tuple as well. Please let me know if you feel the same way. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 29.06.2018 10:00, Kuntal Ghosh wrote: > On Wed, Jun 27, 2018 at 12:10 PM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> I prepare third version of the patches. Summary: >> 1. Mask DEAD tuples at a page during consistency checking (See comments for >> the mask_dead_tuples() function). > > - ItemIdSetDead(lp); > + if (target_index_deletion_factor > 0) > + ItemIdMarkDead(lp); > + else > + ItemIdSetDead(lp); > IIUC, you want to hold the storage of DEAD tuples to form the ScanKey > which is required for the index scan in the second phase of quick > vacuum strategy. To achieve that, you've only marked the tuple as DEAD > during pruning. Hence, PageRepairFragmentation cannot claim the space > for DEAD tuples since they still have storage associated with them. > But, you've WAL-logged it as DEAD tuples having no storage. So, when > the WAL record is replayed in standby(or crash recovery), the tuples > will be marked as DEAD having no storage and their space may be > reclaimed by PageRepairFragmentation. Now, if you do byte-by-byte > comparison with wal_consistency tool, it may fail even for normal > tuple as well. Please let me know if you feel the same way. > Thanks for your analysis. In this development version of the patch I expect the same prune() strategy on a master and standby (i.e. target_index_deletion_factor is equal for both). In this case storage of a DEAD tuple holds during replay or recovery in the same way. On some future step of development I plan to use more flexible prune() strategy. This will require to append a 'isDeadStorageHold' field to the WAL record. -- Regards, Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
On Fri, Jun 29, 2018 at 11:04 AM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > On 29.06.2018 10:00, Kuntal Ghosh wrote: >> >> On Wed, Jun 27, 2018 at 12:10 PM, Andrey V. Lepikhov >> <a.lepikhov@postgrespro.ru> wrote: >>> >>> I prepare third version of the patches. Summary: >>> 1. Mask DEAD tuples at a page during consistency checking (See comments >>> for >>> the mask_dead_tuples() function). >> >> >> - ItemIdSetDead(lp); >> + if (target_index_deletion_factor > 0) >> + ItemIdMarkDead(lp); >> + else >> + ItemIdSetDead(lp); >> IIUC, you want to hold the storage of DEAD tuples to form the ScanKey >> which is required for the index scan in the second phase of quick >> vacuum strategy. To achieve that, you've only marked the tuple as DEAD >> during pruning. Hence, PageRepairFragmentation cannot claim the space >> for DEAD tuples since they still have storage associated with them. >> But, you've WAL-logged it as DEAD tuples having no storage. So, when >> the WAL record is replayed in standby(or crash recovery), the tuples >> will be marked as DEAD having no storage and their space may be >> reclaimed by PageRepairFragmentation. Now, if you do byte-by-byte >> comparison with wal_consistency tool, it may fail even for normal >> tuple as well. Please let me know if you feel the same way. >> > Thanks for your analysis. > In this development version of the patch I expect the same prune() strategy > on a master and standby (i.e. target_index_deletion_factor is equal for > both). > In this case storage of a DEAD tuple holds during replay or recovery in the > same way. Okay. That's acceptable for now. > On some future step of development I plan to use more flexible prune() > strategy. This will require to append a 'isDeadStorageHold' field to the WAL > record. That sounds interesting. I'll be waiting for your next patches. Few minor comments: + qsort((void *)vacrelstats->dead_tuples, vacrelstats->num_dead_tuples, sizeof(ItemPointerData), tid_comparator); + ivinfo.isSorted = true; My understanding is that vacrelstats->dead_tuples are already sorted based on their tids. I'm referring to the following part in lazy_scan_heap(), for (blk=0 to nblocks) { for (offset=1 to max offset) { if certain conditions are met store the tuple in vacrelstats->dead_tuples using lazy_record_dead_tuple(); } } So, you don't have to sort them again, right? + slot = MakeSingleTupleTableSlot(RelationGetDescr(hrel)); + econtext->ecxt_scantuple = slot; + ExecDropSingleTupleTableSlot(slot); You don't have to do this for every tuple. Before storing the tuple, you can just call MemoryContextReset(econtext->ecxt_per_tuple_memory); Perhaps, you can check IndexBuildHeapRangeScan for the details. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 27, 2018 at 12:10 PM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > > On 23.06.2018 00:43, Peter Geoghegan wrote: >> >> On Fri, Jun 22, 2018 at 4:24 AM, Andrey V. Lepikhov >> <a.lepikhov@postgrespro.ru> wrote: >>> >>> According to your feedback, i develop second version of the patch. >>> In this version: I had quick look at your patch, I have few comments. 1. + if (use_quick_strategy) + quick_vacuum_index(Irel[i], onerel, vacrelstats); + else + lazy_vacuum_index(Irel[i], I noticed that inside quick_vacuum_index again we check if we don't have am routine for the target delete then we call lazy_vacuum_index. I think we can have that check here itself? 2. + else + { + IndexBulkDeleteResult **stats; + + lazy_vacuum_index(irel, stats, vacrelstats); + } These stats will be lost, I think if you fix first then this will issue will be solved, otherwise, you might need to pass indexstats as a parameter to quick_vacuum_index function. 3. +} + +#include "access/nbtree.h" +#include "catalog/index.h" +#include "executor/executor.h" You can move these header inclusions at the top of the file. 4. Many places PG coding standard is not followed, few examples a. + bool use_quick_strategy = (vacrelstats->num_dead_tuples/vacrelstats->old_live_tuples < target_index_deletion_factor); + space between operator and variable vacrelstats->num_dead_tuples/vacrelstats->old_live_tuples -> vacrelstats->num_dead_tuples / vacrelstats->old_live_tuples b.qsort((void *)vacrelstats -> qsort((void *) vacrelstats -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
чт, 28 июн. 2018 г., 8:37 Andrey V. Lepikhov <a.lepikhov@postgrespro.ru>:
On 28.06.2018 05:00, Peter Geoghegan wrote:
> On Tue, Jun 26, 2018 at 11:40 PM, Andrey V. Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> I still believe that the patch for physical TID ordering in btree:
>> 1) has its own value, not only for target deletion,
>> 2) will require only a few local changes in my code,
>> and this patches can be developed independently.
>
> I want to be clear on something now: I just don't think that this
> patch has any chance of getting committed without something like my
> own patch to go with it. The worst case for your patch without that
> component is completely terrible. It's not really important for you to
> actually formally make it part of your patch, so I'm not going to
> insist on that or anything, but the reality is that my patch does not
> have independent value -- and neither does yours.
>
As I wrote before in the last email, I will integrate TID sorting to my
patches right now. Please, give me access to the last version of your
code, if it possible.
You can track the progress at https://github.com/danolivo/postgres git
repository
Peter is absolutely right, imho: tie-breaking by TID within index
ordering is inevitable for reliable performance of this patch.
With regards,
Sokolov Yura.
On 29.06.2018 14:07, Юрий Соколов wrote: > чт, 28 июн. 2018 г., 8:37 Andrey V. Lepikhov <a.lepikhov@postgrespro.ru > <mailto:a.lepikhov@postgrespro.ru>>: > > > > On 28.06.2018 05:00, Peter Geoghegan wrote: > > On Tue, Jun 26, 2018 at 11:40 PM, Andrey V. Lepikhov > > <a.lepikhov@postgrespro.ru <mailto:a.lepikhov@postgrespro.ru>> wrote: > >> I still believe that the patch for physical TID ordering in btree: > >> 1) has its own value, not only for target deletion, > >> 2) will require only a few local changes in my code, > >> and this patches can be developed independently. > > > > I want to be clear on something now: I just don't think that this > > patch has any chance of getting committed without something like my > > own patch to go with it. The worst case for your patch without that > > component is completely terrible. It's not really important for > you to > > actually formally make it part of your patch, so I'm not going to > > insist on that or anything, but the reality is that my patch does not > > have independent value -- and neither does yours. > > > As I wrote before in the last email, I will integrate TID sorting to my > patches right now. Please, give me access to the last version of your > code, if it possible. > You can track the progress at https://github.com/danolivo/postgres git > repository > > > Peter is absolutely right, imho: tie-breaking by TID within index > ordering is inevitable for reliable performance of this patch. > In the new version the patch [1] was used in cooperation with 'retail indextuple deletion' and 'quick vacuum strategy' patches (see '0004-Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf-.patch'. To demonstrate the potential benefits, I did a test: CREATE TABLE test (id serial primary key, value integer, factor integer); INSERT INTO test (value, factor) SELECT random()*1e5, random()*1e3 FROM generate_series(1, 1e7); CREATE INDEX ON test(value); VACUUM; DELETE FROM test WHERE (factor = 1); VACUUM test; Execution time of last "VACUUM test;" command on my notebook was: with bulk deletion: 1.6 s; with Quick Vacuum Strategy: 5.2 s; with Quick Vacuum Strategy & TID sorting: 0.6 s. [1] https://www.postgresql.org/message-id/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com > With regards, > Sokolov Yura. -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Attachment
On Mon, Jul 2, 2018 at 7:29 AM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > In the new version the patch [1] was used in cooperation with 'retail > indextuple deletion' and 'quick vacuum strategy' patches (see > '0004-Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf-.patch'. Cool. I'm going to post a revised version of the unique key patch soon. I've found that it's slightly faster to use DESC ordering for the implicit heap TID attribute. Doing so also restores the old user-visible behavior for DROP dependency management, which allows me to remove all changes to the regression test output > Execution time of last "VACUUM test;" command on my notebook was: > > with bulk deletion: 1.6 s; > with Quick Vacuum Strategy: 5.2 s; > with Quick Vacuum Strategy & TID sorting: 0.6 s. I'm glad that you looked into this. You could make this faster still, by actually passing the lowest heap TID in the list of TIDs to kill to _bt_search() and _bt_binsrch(). You might have to work through several extra B-Tree leaf pages per bttargetdelete() call without this (you'll move right multiple times within bttargetdelete()). -- Peter Geoghegan
On Mon, Jul 2, 2018 at 9:28 AM, Peter Geoghegan <pg@bowt.ie> wrote: >> Execution time of last "VACUUM test;" command on my notebook was: >> >> with bulk deletion: 1.6 s; >> with Quick Vacuum Strategy: 5.2 s; >> with Quick Vacuum Strategy & TID sorting: 0.6 s. > > I'm glad that you looked into this. You could make this faster still, > by actually passing the lowest heap TID in the list of TIDs to kill to > _bt_search() and _bt_binsrch(). You might have to work through several > extra B-Tree leaf pages per bttargetdelete() call without this (you'll > move right multiple times within bttargetdelete()). I should add: I think that this doesn't matter so much in your original test case with v1 of my patch, because you're naturally accessing the index tuples in almost the most efficient way already, since you VACUUM works its way from the start of the table until the end of the table. You'll definitely need to pass a heap TID to routines like _bt_search() once you start using my v2, though, since that puts the heap TIDs in DESC sort order. Otherwise, it'll be almost as slow as the plain "Quick Vacuum Strategy" case was. In general, the big idea with my patch is that heap TID is just another attribute. I am not "cheating" in any way; if it's not possible to descend the tree and arrive at the right leaf page without looking through several leaf pages, then my patch is broken. You might also use _bt_moveright() with my patch. That way, you can quickly detect that you need to move right immediately, without going through all the items on the page. This should only be an issue in the event of a concurrent page split, though. In my patch, I use _bt_moveright() in a special way for unique indexes: I need to start at the first leaf page a duplicate could be on for duplicate checking, but once that's over I want to "jump immediately" to the leaf page the index tuple actually needs to be inserted on. That's when _bt_moveright() is called. (Actually, that looks like it breaks unique index enforcement in the case of my patch, which I need to fix, but you might still do this.) -- Peter Geoghegan
On 03.07.2018 00:40, Peter Geoghegan wrote: > On Mon, Jul 2, 2018 at 9:28 AM, Peter Geoghegan <pg@bowt.ie> wrote: >>> Execution time of last "VACUUM test;" command on my notebook was: >>> >>> with bulk deletion: 1.6 s; >>> with Quick Vacuum Strategy: 5.2 s; >>> with Quick Vacuum Strategy & TID sorting: 0.6 s. >> >> I'm glad that you looked into this. You could make this faster still, >> by actually passing the lowest heap TID in the list of TIDs to kill to >> _bt_search() and _bt_binsrch(). You might have to work through several >> extra B-Tree leaf pages per bttargetdelete() call without this (you'll >> move right multiple times within bttargetdelete()). > > I should add: I think that this doesn't matter so much in your > original test case with v1 of my patch, because you're naturally > accessing the index tuples in almost the most efficient way already, > since you VACUUM works its way from the start of the table until the > end of the table. You'll definitely need to pass a heap TID to > routines like _bt_search() once you start using my v2, though, since > that puts the heap TIDs in DESC sort order. Otherwise, it'll be almost > as slow as the plain "Quick Vacuum Strategy" case was. > > In general, the big idea with my patch is that heap TID is just > another attribute. I am not "cheating" in any way; if it's not > possible to descend the tree and arrive at the right leaf page without > looking through several leaf pages, then my patch is broken. > > You might also use _bt_moveright() with my patch. That way, you can > quickly detect that you need to move right immediately, without going > through all the items on the page. This should only be an issue in the > event of a concurrent page split, though. In my patch, I use > _bt_moveright() in a special way for unique indexes: I need to start > at the first leaf page a duplicate could be on for duplicate checking, > but once that's over I want to "jump immediately" to the leaf page the > index tuple actually needs to be inserted on. That's when > _bt_moveright() is called. (Actually, that looks like it breaks unique > index enforcement in the case of my patch, which I need to fix, but > you might still do this.) > Done. Attachment contains an update for use v.2 of the 'Ensure nbtree leaf tuple keys are always unique' patch. Apply order: 1. 0001-Retail-IndexTuple-Deletion-Access-Method.patch - from previous email 2. 0002-Quick-vacuum-strategy.patch - from previous email 3. v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch - from [1] 4. 0004-Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf.patch [1] https://www.postgresql.org/message-id/CAH2-Wzm6D%3DKnV%2BP8bZE-ZtP4e%2BW64HtVTdOenqd1d7HjJL3xZQ%40mail.gmail.com -- Andrey Lepikhov Postgres Professional: https://postgrespro.com The Russian Postgres Company
Attachment
On Tue, Jul 3, 2018 at 5:17 AM, Andrey V. Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Done. > Attachment contains an update for use v.2 of the 'Ensure nbtree leaf tuple > keys are always unique' patch. My v3 is still pending, but is now a lot better than v2. There were bugs in v2 that were fixed. One area that might be worth investigating is retail index tuple deletion performed within the executor in the event of non-HOT updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record", at least in unique index tuples with no NULL values. The idea is that MVCC index scans can skip over those if they've already found a visible tuple with the same value. Also, when there was about to be a page split, they could be treated a little bit like LP_DEAD items. Of course, the ghost bit would have to be treated as a hint that could be "wrong" (e.g. because the transaction hasn't committed yet), so you'd have to go to the heap in the context of a page split, to double check. Also, you'd need heuristics that let you give up on this strategy when it didn't help. I think that this could work well enough for OLTP workloads, and might be more future-proof than doing it in VACUUM. Though, of course, it's still very complicated. -- Peter Geoghegan
On Fri, Jul 13, 2018 at 4:00 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, Jul 3, 2018 at 5:17 AM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Done. >> Attachment contains an update for use v.2 of the 'Ensure nbtree leaf tuple >> keys are always unique' patch. > > My v3 is still pending, but is now a lot better than v2. There were > bugs in v2 that were fixed. > > One area that might be worth investigating is retail index tuple > deletion performed within the executor in the event of non-HOT > updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record", > at least in unique index tuples with no NULL values. The idea is that > MVCC index scans can skip over those if they've already found a > visible tuple with the same value. I think that's a good idea. The overhead of marking it as ghost seems small and it would speed up index scans. If MVCC index scans have already found a visible tuples with the same value they can not only skip scanning but also kill them? If can, we can kill index tuples without checking the heap. > Also, when there was about to be a > page split, they could be treated a little bit like LP_DEAD items. Of > course, the ghost bit would have to be treated as a hint that could be > "wrong" (e.g. because the transaction hasn't committed yet), so you'd > have to go to the heap in the context of a page split, to double > check. Also, you'd need heuristics that let you give up on this > strategy when it didn't help. > > I think that this could work well enough for OLTP workloads, and might > be more future-proof than doing it in VACUUM. Though, of course, it's > still very complicated. Agreed. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jul 19, 2018 at 4:29 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> One area that might be worth investigating is retail index tuple >> deletion performed within the executor in the event of non-HOT >> updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record", >> at least in unique index tuples with no NULL values. The idea is that >> MVCC index scans can skip over those if they've already found a >> visible tuple with the same value. > > I think that's a good idea. The overhead of marking it as ghost seems > small and it would speed up index scans. If MVCC index scans have > already found a visible tuples with the same value they can not only > skip scanning but also kill them? If can, we can kill index tuples > without checking the heap. I think you're talking about making LP_REDIRECT marking in nbtree represent a "recently dead" hint: the deleting transaction has committed, and so we are 100% sure that the tuple is about to become garbage, but it cannot be LP_DEAD just yet because it needs to stay around for the benefit of at least one existing snapshot/long running transaction. That's a different idea to what I talked about, since it could perhaps work in a way that's much closer to LP_DEAD/kill_prior_tuple (no extra executor stuff is required). I'm not sure if your idea is better or worse than what I suggested, though. It would certainly be easier to implement. -- Peter Geoghegan
This v5 version of the patch is intended to use with v3-0001-Make-all-nbtree-index-tuples-have-unique-keys patch [1]. [1] https://www.postgresql.org/message-id/CAH2-WzkmTRXh%3DzyMAUHyG3%3DO-QQip6CJc2VyNijRO-vzgPxmoQ%40mail.gmail.com 13.07.2018 00:00, Peter Geoghegan пишет: > On Tue, Jul 3, 2018 at 5:17 AM, Andrey V. Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Done. >> Attachment contains an update for use v.2 of the 'Ensure nbtree leaf tuple >> keys are always unique' patch. > My v3 is still pending, but is now a lot better than v2. There were > bugs in v2 that were fixed. > > One area that might be worth investigating is retail index tuple > deletion performed within the executor in the event of non-HOT > updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record", > at least in unique index tuples with no NULL values. The idea is that > MVCC index scans can skip over those if they've already found a > visible tuple with the same value. Also, when there was about to be a > page split, they could be treated a little bit like LP_DEAD items. Of > course, the ghost bit would have to be treated as a hint that could be > "wrong" (e.g. because the transaction hasn't committed yet), so you'd > have to go to the heap in the context of a page split, to double > check. Also, you'd need heuristics that let you give up on this > strategy when it didn't help. > > I think that this could work well enough for OLTP workloads, and might > be more future-proof than doing it in VACUUM. Though, of course, it's > still very complicated. > -- --- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
13.07.2018 00:00, Peter Geoghegan пишет: > One area that might be worth investigating is retail index tuple > deletion performed within the executor in the event of non-HOT > updates. I will try to use this idea. But now I developed a background worker already (experimental state). It clean an index and heap relations with a retail deletion approach. It uses "soft" strategy with conditional buffer locks. Benchmarks on pgbench test demonstrate same execution time as PostgreSQL without the worker, but 90% of dead tuples in index and heap relations cleaned without vacuum. My main efforts is involved to tuning of the worker. --- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
On Sat, Jul 21, 2018 at 3:11 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, Jul 19, 2018 at 4:29 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> One area that might be worth investigating is retail index tuple >>> deletion performed within the executor in the event of non-HOT >>> updates. Maybe LP_REDIRECT could be repurposed to mean "ghost record", >>> at least in unique index tuples with no NULL values. The idea is that >>> MVCC index scans can skip over those if they've already found a >>> visible tuple with the same value. >> >> I think that's a good idea. The overhead of marking it as ghost seems >> small and it would speed up index scans. If MVCC index scans have >> already found a visible tuples with the same value they can not only >> skip scanning but also kill them? If can, we can kill index tuples >> without checking the heap. > > I think you're talking about making LP_REDIRECT marking in nbtree > represent a "recently dead" hint: the deleting transaction has > committed, and so we are 100% sure that the tuple is about to become > garbage, but it cannot be LP_DEAD just yet because it needs to stay > around for the benefit of at least one existing snapshot/long running > transaction. > > That's a different idea to what I talked about, since it could perhaps > work in a way that's much closer to LP_DEAD/kill_prior_tuple (no extra > executor stuff is required). I understood. Thank you for explanation! Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Hi, I wrote a background worker (hcleaner) to demonstrate application of Retail IndexTuple deletion (see patch at attachment). Like Autovacuum it utilizes concept of one launcher and many workers. But one worker correspond to one database. Short description: Backend collects dirty block numbers by a hash table at the point in code immediately after heap_page_prune() call. Backend send a package of dirty block numbers (not one-by-one!) by socket at the end of transaction or if hash table is full. Launcher transfers block numbers to correspond workers. Worker collects dead tuples from a block, clean index relations, clean heap block. It uses conditional locking with waiting list approach if heap block are busy. hcleaner has experimental status, but make check-world passed . For benchmarking i used xiaomi notebook with intel Core i5 8gen processor. BENCHMARK ---------- test: pgbench -i -s 100 && pgbench -c 25 -j 8 -M prepared -P 60 -T 3600 autovacuum = off master: ------- number of transactions actually processed: 6373215 latency average = 14.122 ms latency stddev = 9.458 ms tps = 1770.291436 (including connections establishing) tps = 1770.293191 (excluding connections establishing) VACUUM verbose pgbench_accounts: -------------------------------- INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 237496 row versions DETAIL: CPU: user: 4.67 s, system: 0.27 s, elapsed: 8.05 s INFO: "pgbench_accounts": removed 237496 row versions in 167652 pages DETAIL: CPU: user: 7.54 s, system: 3.40 s, elapsed: 26.10 s INFO: index "pgbench_accounts_pkey" now contains 10000000 row versions in 27422 pages DETAIL: 237496 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "pgbench_accounts": found 165275 removable, 10000000 nonremovable row versions in 167840 out of 167840 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 6373796 There were 82282 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 20.33 s, system: 5.88 s, elapsed: 51.38 s. patched: -------- number of transactions actually processed: 6338593 latency average = 14.199 ms latency stddev = 13.988 ms tps = 1760.685922 (including connections establishing) tps = 1760.688038 (excluding connections establishing) VACUUM verbose pgbench_accounts: -------------------------------- INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 1804 row versions DETAIL: CPU: user: 1.84 s, system: 0.05 s, elapsed: 3.34 s INFO: "pgbench_accounts": removed 1804 row versions in 1468 pages DETAIL: CPU: user: 0.06 s, system: 0.03 s, elapsed: 1.42 s INFO: index "pgbench_accounts_pkey" now contains 10000000 row versions in 27422 pages DETAIL: 1618 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "pgbench_accounts": found 168561 removable, 10000000 nonremovable row versions in 169466 out of 169466 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 6339174 There were 75478 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 12.27 s, system: 4.03 s, elapsed: 31.43 s. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
On Tue, Aug 7, 2018 at 12:19 AM, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > I wrote a background worker (hcleaner) to demonstrate application of Retail > IndexTuple deletion (see patch at attachment). > Like Autovacuum it utilizes concept of one launcher and many workers. But > one worker correspond to one database. > > Short description: > Backend collects dirty block numbers by a hash table at the point in code > immediately after heap_page_prune() call. Backend send a package of dirty > block numbers (not one-by-one!) by socket at the end of transaction or if > hash table is full. > Launcher transfers block numbers to correspond workers. > Worker collects dead tuples from a block, clean index relations, clean heap > block. It uses conditional locking with waiting list approach if heap block > are busy. > > hcleaner has experimental status, but make check-world passed How does this affect ordinary opportunistic pruning? -- Peter Geoghegan
09.08.2018 05:19, Peter Geoghegan пишет: > On Tue, Aug 7, 2018 at 12:19 AM, Andrey Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> I wrote a background worker (hcleaner) to demonstrate application of Retail >> IndexTuple deletion (see patch at attachment). >> Like Autovacuum it utilizes concept of one launcher and many workers. But >> one worker correspond to one database. >> >> Short description: >> Backend collects dirty block numbers by a hash table at the point in code >> immediately after heap_page_prune() call. Backend send a package of dirty >> block numbers (not one-by-one!) by socket at the end of transaction or if >> hash table is full. >> Launcher transfers block numbers to correspond workers. >> Worker collects dead tuples from a block, clean index relations, clean heap >> block. It uses conditional locking with waiting list approach if heap block >> are busy. >> >> hcleaner has experimental status, but make check-world passed > > How does this affect ordinary opportunistic pruning? > As far as I understand, background worker not affect on ordinary opportunistic pruning. For a dirty block it made cleaning work like vacuum (if acquire conditional lock). I tried two ways: 1. In executor: tid of deleted/updated tuples collects directly in executor and send to background worker. Background worker make heap_page_prune() himself. But this is no good variant, because each backend perform opportunistic pruning and it is sufficient work to detect dead tuples. Moreover, in this case the worker compete with backends for pruning and has high load when many backends existed. 2. In heap_page_prune_opt(): backend collects ids of dirty blocks immediately after pruning and send it to the worker. In this case backend perform all work for detection of dead tuples. It let the worker to miss block cleaning during periods of heavy load: later it can clean block. Since locks are conditional we do not hamper backends work during high load periods and utilize idle time for relation cleaning in soft manner. Variant No.2 look better better and now i use it. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
On Tue, Aug 7, 2018 at 4:19 PM, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Hi, > I wrote a background worker (hcleaner) to demonstrate application of Retail > IndexTuple deletion (see patch at attachment). The patch doesn't seem to have the hcleaner code. Could you share it? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
15.08.2018 12:17, Masahiko Sawada пишет: > On Tue, Aug 7, 2018 at 4:19 PM, Andrey Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Hi, >> I wrote a background worker (hcleaner) to demonstrate application of Retail >> IndexTuple deletion (see patch at attachment). > > The patch doesn't seem to have the hcleaner code. Could you share it? I appreciate you pointing out my mistake. Attachment contains full patch set: indexTuple retail deletion, ordering b-tree tuples by tid (provided by Peter Geoghean) and background cleaner. In this version of background worker you can show state of the hcleaner at the 'pg_stat_progress_cleaner' view, like VACUUM. unlike the previous version, hcleaner check presence a block in memory before cleanup (see RBM_NORMAL_NO_READ mode at ReadBufferExtended() call) and do not read blocks from a disk storage (only on shutdown after SIGTERM signal catch). For feature demonstration you can use simple test (autovacuum = off): pgbench -i -s 1 && psql -c $"CREATE INDEX pgbench_accounts_ext ON public.pgbench_accounts USING btree (abalance);" && pgbench -t <n> -c 20 -j 8 -f test.pgb where test.pgb is: ------- \set aid random(1, 100000 * :scale) \set delta random(-5000, 5000) UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; ------- My workstation shows: | master | patched | n |HEAP size | INDEX size | HEAP size | INDEX size | ---------------|-------------------------------------| 2e3 | 13 MB | 2.3 MB | 13 MB | 2.3 MB | 2e4 | 14 MB | 2.7 MB | 13 MB | 2.7 MB | 2e5 | 14 MB | 8.0 MB | 14 MB | 4.8 MB | 2e6 | 61 MB | 58. MB | 14 MB | 6.7 MB | where HEAP size - size of 'pgbench_accounts' relation; INDEX size - size of 'pgbench_accounts_ext' index relation. It is demonstrates a relation 'blowing' problem and influence of hcleaner in an excessive manner. Some problem is regression tests modification, because hcleaner makes physical order of tuples in relations unpredictable. > > Regards, > > -- > Masahiko Sawada > NIPPON TELEGRAPH AND TELEPHONE CORPORATION > NTT Open Source Software Center > -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
Hi, I prepared next version of Background worker (cleaner) based on a retail indextuple deletion patch. This version shows stable behavior on regression tests and pgbench workloads. In this version: 1. Only AccessShareLock are acquired on a cleanup of heap and index relations. 2. Some 'aggressive' cleanup strategy introduced - conditional cleanup locks not used. 3. Cleanup only an in-memory blocks. 4. The Cleaner calls heap_page_prune() before cleanup a block. Benchmarks --------- Two factors were evaluated: performance (tps) and relations blowing. Before each test some rarefaction of pgbench_accounts was modeled by deletion 10% of tuples at each block. Also, I tested normal and Gaussian distribution of queries on pgbench_accounts relation. Autovacuum uses default settings. Script: pgbench -i -s 10 psql -c $"DELETE FROM pgbench_accounts WHERE (random() < 0.1);" psql -c $"VACUUM;" psql -c $"CREATE INDEX pgbench_accounts_ext ON public.pgbench_accounts USING btree (abalance);" && pgbench -T 3600 -c 32 -j 8 -M prepared -P 600 NORMAL distribution: average tps = 1045 (cleaner); = 1077 (autovacuum) Relations size at the end of test, MB: pgbench_accounts: 128 (cleaner); 128 (autovacuum) pgbench_branches: 0.1 (cleaner); 2.1 (autovacuum) pgbench_tellers: 0.4 (cleaner); 2.8 (autovacuum) pgbench_accounts_pkey: 21 (cleaner); 43 (autovacuum) pgbench_accounts_ext: 48 (cleaner); 56 (autovacuum) Gaussian distribution: average tps = 213 (cleaner); = 213 (autovacuum) Relations size at the end of test, MB: pgbench_accounts: 128 (cleaner); 128 (autovacuum) pgbench_accounts_ext: 22 (cleaner); 29 (autovacuum) Conclusions ----------- 1. For retail indextuple deletion purposes i replaced ItemIdSetDead() by ItemIdMarkDead() in heap_page_prune_execute() operation. Hereupon in the case of 100% filling of each relation block we get a blowing HEAP and index , more or less. When the blocks already have free space, the cleaner can delay blowing the heap and index without a vacuum. 2. Cleaner works fine in the case of skewness of access frequency to relation blocks. 3. The cleaner does not cause a decrease of performance. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
And background worker patch in attachment. 05.09.2018 15:25, Andrey Lepikhov пишет: > Hi, > > I prepared next version of Background worker (cleaner) based on a retail > indextuple deletion patch. > This version shows stable behavior on regression tests and pgbench > workloads. > > In this version: > 1. Only AccessShareLock are acquired on a cleanup of heap and index > relations. > 2. Some 'aggressive' cleanup strategy introduced - conditional cleanup > locks not used. > 3. Cleanup only an in-memory blocks. > 4. The Cleaner calls heap_page_prune() before cleanup a block. > > Benchmarks > --------- > > Two factors were evaluated: performance (tps) and relations blowing. > > Before each test some rarefaction of pgbench_accounts was modeled by > deletion 10% of tuples at each block. > Also, I tested normal and Gaussian distribution of queries on > pgbench_accounts relation. > Autovacuum uses default settings. > > Script: > pgbench -i -s 10 > psql -c $"DELETE FROM pgbench_accounts WHERE (random() < 0.1);" > psql -c $"VACUUM;" > psql -c $"CREATE INDEX pgbench_accounts_ext ON public.pgbench_accounts > USING btree (abalance);" && > pgbench -T 3600 -c 32 -j 8 -M prepared -P 600 > > NORMAL distribution: > average tps = 1045 (cleaner); = 1077 (autovacuum) > > Relations size at the end of test, MB: > pgbench_accounts: 128 (cleaner); 128 (autovacuum) > pgbench_branches: 0.1 (cleaner); 2.1 (autovacuum) > pgbench_tellers: 0.4 (cleaner); 2.8 (autovacuum) > pgbench_accounts_pkey: 21 (cleaner); 43 (autovacuum) > pgbench_accounts_ext: 48 (cleaner); 56 (autovacuum) > > Gaussian distribution: > average tps = 213 (cleaner); = 213 (autovacuum) > > Relations size at the end of test, MB: > pgbench_accounts: 128 (cleaner); 128 (autovacuum) > pgbench_accounts_ext: 22 (cleaner); 29 (autovacuum) > > Conclusions > ----------- > 1. For retail indextuple deletion purposes i replaced ItemIdSetDead() by > ItemIdMarkDead() in heap_page_prune_execute() operation. Hereupon in the > case of 100% filling of each relation block we get a blowing HEAP and > index , more or less. When the blocks already have free space, the > cleaner can delay blowing the heap and index without a vacuum. > 2. Cleaner works fine in the case of skewness of access frequency to > relation blocks. > 3. The cleaner does not cause a decrease of performance. > -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
The next version of amtargetdelete() interface and its implementation for nbtree, devoted to the problem of retail indextuple deletion. Uses 'v5-0001-Make-nbtree-indexes-have-unique-keys-in-tuples' patch [1] to delete all logical duplicates by one tree descent. [1] https://www.postgresql.org/message-id/CAH2-WzkfK=JVHjkd17TLDvsFb6psduTt5WYiT8dg+-UFc+rSSQ@mail.gmail.com -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
The v6 version of quick vacuum, which utilizes the amtargetdelete() interface for retail indextuple deletion. Now it is more simple and laconic. It must be applied after Retail-IndexTuple-Deletion-Access-Method.patch. BENCHMARKS: ----------- Initial script: pgbench -i -s $scale # initial DB generation "CREATE INDEX pgbench_accounts_ext ON public.pgbench_accounts USING btree (abalance);" # additional index Comparison with lazy vacuum: script: "DELETE FROM pgbench_accounts WHERE (random() < $factor);" # delete a part of tuples for cleaning strategies comparison "VACUUM pgbench_accounts;" # check time of vacuum process by bash 'date +%s%N | cut -b1-13' command Results: | $scale=10 | $scale=100 | $factor| QVAC | LVAC | QVAC | LVAC | 1E-6 | - | - | 284 | 979 | 1E-5 | 78 | 144 | 288 | 1423 | 1E-4 | 72 | 280 | 388 | 3304 | 1E-3 | 189 | 609 | 2294 | 6029 | 1E-2 | 443 | 783 | 54232| 67884| 1E-1 | 1593 | 1237 | 83092| 86104| where QVAC - forced use of quick vacuum; LVAC - use lazy vacuum for index cleanup. $factor corresponds a number of vacuumed tuples. For example, $scale=10, $factor=1E-1 -> 100000 tuples vacuumed. Time measured in ms. So, quick strategy can be used in a vacuum process effectively up to 1-2% of DEAD tuples in a relation. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Attachment
> On Fri, Sep 21, 2018 at 5:52 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > The v6 version of quick vacuum, which utilizes the amtargetdelete() > interface for retail indextuple deletion. > Now it is more simple and laconic. > It must be applied after Retail-IndexTuple-Deletion-Access-Method.patch. > > BENCHMARKS: > ----------- > > Initial script: > pgbench -i -s $scale # initial DB generation > "CREATE INDEX pgbench_accounts_ext ON public.pgbench_accounts USING > btree (abalance);" # additional index > > Comparison with lazy vacuum: > > script: > "DELETE FROM pgbench_accounts WHERE (random() < $factor);" # delete a > part of tuples for cleaning strategies comparison > "VACUUM pgbench_accounts;" # check time of vacuum process by bash 'date > +%s%N | cut -b1-13' command > > Results: > | $scale=10 | $scale=100 | > $factor| QVAC | LVAC | QVAC | LVAC | > 1E-6 | - | - | 284 | 979 | > 1E-5 | 78 | 144 | 288 | 1423 | > 1E-4 | 72 | 280 | 388 | 3304 | > 1E-3 | 189 | 609 | 2294 | 6029 | > 1E-2 | 443 | 783 | 54232| 67884| > 1E-1 | 1593 | 1237 | 83092| 86104| > > where QVAC - forced use of quick vacuum; LVAC - use lazy vacuum for > index cleanup. $factor corresponds a number of vacuumed tuples. For > example, $scale=10, $factor=1E-1 -> 100000 tuples vacuumed. Time > measured in ms. > > So, quick strategy can be used in a vacuum process effectively up to > 1-2% of DEAD tuples in a relation. Hi, Unfortunately, this patch doesn't compile anymore: index.c: In function ‘htid2IndexDatum’: index.c:4172:2: error: too few arguments to function ‘MakeSingleTupleTableSlot’ TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(hrel)); ^ Also I'm a bit confused about the current status of collaboration between this patch and [1], one is tightly depends on another or not? Does it makes sense to have only one corresponding CF item then? For now I'll move this one to the next CF. [1]: https://www.postgresql.org/message-id/flat/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com
On 2018-11-29 14:27:50 +0100, Dmitry Dolgov wrote: > > On Fri, Sep 21, 2018 at 5:52 AM Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > > > The v6 version of quick vacuum, which utilizes the amtargetdelete() > > interface for retail indextuple deletion. > > Now it is more simple and laconic. > > It must be applied after Retail-IndexTuple-Deletion-Access-Method.patch. > > > > BENCHMARKS: > > ----------- > > > > Initial script: > > pgbench -i -s $scale # initial DB generation > > "CREATE INDEX pgbench_accounts_ext ON public.pgbench_accounts USING > > btree (abalance);" # additional index > > > > Comparison with lazy vacuum: > > > > script: > > "DELETE FROM pgbench_accounts WHERE (random() < $factor);" # delete a > > part of tuples for cleaning strategies comparison > > "VACUUM pgbench_accounts;" # check time of vacuum process by bash 'date > > +%s%N | cut -b1-13' command > > > > Results: > > | $scale=10 | $scale=100 | > > $factor| QVAC | LVAC | QVAC | LVAC | > > 1E-6 | - | - | 284 | 979 | > > 1E-5 | 78 | 144 | 288 | 1423 | > > 1E-4 | 72 | 280 | 388 | 3304 | > > 1E-3 | 189 | 609 | 2294 | 6029 | > > 1E-2 | 443 | 783 | 54232| 67884| > > 1E-1 | 1593 | 1237 | 83092| 86104| > > > > where QVAC - forced use of quick vacuum; LVAC - use lazy vacuum for > > index cleanup. $factor corresponds a number of vacuumed tuples. For > > example, $scale=10, $factor=1E-1 -> 100000 tuples vacuumed. Time > > measured in ms. > > > > So, quick strategy can be used in a vacuum process effectively up to > > 1-2% of DEAD tuples in a relation. > > Hi, > > Unfortunately, this patch doesn't compile anymore: > > index.c: In function ‘htid2IndexDatum’: > index.c:4172:2: error: too few arguments to function ‘MakeSingleTupleTableSlot’ > TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(hrel)); > ^ > > Also I'm a bit confused about the current status of collaboration between this > patch and [1], one is tightly depends on another or not? Does it makes sense > to have only one corresponding CF item then? For now I'll move this one to > the next CF. > > [1]: https://www.postgresql.org/message-id/flat/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com Given this patch has not been updated since Dmitry's message above, I'm marking this as returned with feedback. Greetings, Andres Freund