Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers
From | Shubham Khanna |
---|---|
Subject | Re: Improve eviction algorithm in ReorderBuffer |
Date | |
Msg-id | CAHv8RjJbd-EVs=YJLL6=XAA0ETbK-r8Zm5subP0CAuafOMbj1A@mail.gmail.com Whole thread Raw |
In response to | Re: Improve eviction algorithm in ReorderBuffer (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Improve eviction algorithm in ReorderBuffer
|
List | pgsql-hackers |
On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > > transactions consuming 5% >= of the memory limit? I got the impression > > > > > > that you are saying to maintain them in a separate transaction list > > > > > > which doesn't seems to be the case. > > > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > > maintain the transactions consuming more than 10% of > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > before any xact reaches logical_decoding_work_mem? > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > consuming more than 6.4MB are added to the list. So for example, it's > > > possible that the list has three transactions each of which are > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > Thanks for the clarification. I misunderstood the list to have > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > thing to note is that maintaining these lists by default can also have > > some overhead unless the list of open transactions crosses a certain > > threshold. > > > > On further analysis, I realized that the approach discussed here might > not be the way to go. The idea of dividing transactions into several > subgroups is to divide a large number of entries into multiple > sub-groups so we can reduce the complexity to search for the > particular entry. Since we assume that there are no big differences in > entries' sizes within a sub-group, we can pick the entry to evict in > O(1). However, what we really need to avoid here is that we end up > increasing the number of times to evict entries because serializing an > entry to the disk is more costly than searching an entry on memory in > general. > > I think that it's no problem in a large-entries subgroup but when it > comes to the smallest-entries subgroup, like for entries consuming > less than 5% of the limit, it could end up evicting many entries. For > example, there would be a huge difference between serializing 1 entry > consuming 5% of the memory limit and serializing 5000 entries > consuming 0.001% of the memory limit. Even if we can select 5000 > entries quickly, I think the latter would be slower in total. The more > subgroups we create, the more the algorithm gets complex and the > overheads could cause. So I think we need to search for the largest > entry in order to minimize the number of evictions anyway. > > Looking for data structures and algorithms, I think binaryheap with > some improvements could be promising. I mentioned before why we cannot > use the current binaryheap[1]. The missing pieces are efficient ways > to remove the arbitrary entry and to update the arbitrary entry's key. > The current binaryheap provides binaryheap_remove_node(), which is > O(log n), but it requires the entry's position in the binaryheap. We > can know the entry's position just after binaryheap_add_unordered() > but it might be changed after heapify. Searching the node's position > is O(n). So the improvement idea is to add a hash table to the > binaryheap so that it can track the positions for each entry so that > we can remove the arbitrary entry in O(log n) and also update the > arbitrary entry's key in O(log n). This is known as the indexed > priority queue. I've attached the patch for that (0001 and 0002). > > That way, in terms of reorderbuffer, we can update and remove the > transaction's memory usage in O(log n) (in worst case and O(1) in > average) and then pick the largest transaction in O(1). Since we might > need to call ReorderBufferSerializeTXN() even in non-streaming case, > we need to maintain the binaryheap anyway. I've attached the patch for > that (0003). > > Here are test script for many sub-transactions case: > > create table test (c int); > create or replace function testfn (cnt int) returns void as $$ > begin > for i in 1..cnt loop > begin > insert into test values (i); > exception when division_by_zero then > raise notice 'caught error'; > return; > end; > end loop; > end; > $$ > language plpgsql; > select pg_create_logical_replication_slot('s', 'test_decoding'); > select testfn(50000); > set logical_decoding_work_mem to '4MB'; > select count(*) from pg_logical_slot_peek_changes('s', null, null)"; > > and here are results: > > * HEAD: 16877.281 ms > * HEAD w/ patches (0001 and 0002): 655.154 ms > > There is huge improvement in a many-subtransactions case. I have run the same test and found around 12.53x improvement(the median of five executions): HEAD | HEAD+ v2-0001+ v2-0002 + v2-0003 patch 29197ms | 2329ms I had also run the regression test that you had shared at [1], there was a very very slight dip in this case around it takes around 0.31x more time: HEAD | HEAD + v2-0001+ v2-0002 + v2-0003 patch 4459ms | 4473ms The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System. Also find the detailed info of the performance machine attached. [1] - https://www.postgresql.org/message-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv%3DjzmAp9ex7eyQ%40mail.gmail.com Thanks and Regards, Shubham Khanna.
Attachment
pgsql-hackers by date: