Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Improve eviction algorithm in ReorderBuffer |
Date | |
Msg-id | CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg@mail.gmail.com Whole thread Raw |
In response to | Re: Improve eviction algorithm in ReorderBuffer (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Improve eviction algorithm in ReorderBuffer
Re: Improve eviction algorithm in ReorderBuffer |
List | pgsql-hackers |
On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > transactions consuming 5% >= of the memory limit? I got the impression > > > > > that you are saying to maintain them in a separate transaction list > > > > > which doesn't seems to be the case. > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > maintain the transactions consuming more than 10% of > > > > logical_decoding_work_mem, > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > before any xact reaches logical_decoding_work_mem? > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > consuming more than 6.4MB are added to the list. So for example, it's > > possible that the list has three transactions each of which are > > consuming 10MB while the total memory usage in the reorderbuffer is > > still 30MB (less than logical_decoding_work_mem). > > > > Thanks for the clarification. I misunderstood the list to have > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > thing to note is that maintaining these lists by default can also have > some overhead unless the list of open transactions crosses a certain > threshold. > On further analysis, I realized that the approach discussed here might not be the way to go. The idea of dividing transactions into several subgroups is to divide a large number of entries into multiple sub-groups so we can reduce the complexity to search for the particular entry. Since we assume that there are no big differences in entries' sizes within a sub-group, we can pick the entry to evict in O(1). However, what we really need to avoid here is that we end up increasing the number of times to evict entries because serializing an entry to the disk is more costly than searching an entry on memory in general. I think that it's no problem in a large-entries subgroup but when it comes to the smallest-entries subgroup, like for entries consuming less than 5% of the limit, it could end up evicting many entries. For example, there would be a huge difference between serializing 1 entry consuming 5% of the memory limit and serializing 5000 entries consuming 0.001% of the memory limit. Even if we can select 5000 entries quickly, I think the latter would be slower in total. The more subgroups we create, the more the algorithm gets complex and the overheads could cause. So I think we need to search for the largest entry in order to minimize the number of evictions anyway. Looking for data structures and algorithms, I think binaryheap with some improvements could be promising. I mentioned before why we cannot use the current binaryheap[1]. The missing pieces are efficient ways to remove the arbitrary entry and to update the arbitrary entry's key. The current binaryheap provides binaryheap_remove_node(), which is O(log n), but it requires the entry's position in the binaryheap. We can know the entry's position just after binaryheap_add_unordered() but it might be changed after heapify. Searching the node's position is O(n). So the improvement idea is to add a hash table to the binaryheap so that it can track the positions for each entry so that we can remove the arbitrary entry in O(log n) and also update the arbitrary entry's key in O(log n). This is known as the indexed priority queue. I've attached the patch for that (0001 and 0002). That way, in terms of reorderbuffer, we can update and remove the transaction's memory usage in O(log n) (in worst case and O(1) in average) and then pick the largest transaction in O(1). Since we might need to call ReorderBufferSerializeTXN() even in non-streaming case, we need to maintain the binaryheap anyway. I've attached the patch for that (0003). Here are test script for many sub-transactions case: create table test (c int); create or replace function testfn (cnt int) returns void as $$ begin for i in 1..cnt loop begin insert into test values (i); exception when division_by_zero then raise notice 'caught error'; return; end; end loop; end; $$ language plpgsql; select pg_create_logical_replication_slot('s', 'test_decoding'); select testfn(50000); set logical_decoding_work_mem to '4MB'; select count(*) from pg_logical_slot_peek_changes('s', null, null)"; and here are results: * HEAD: 16877.281 ms * HEAD w/ patches (0001 and 0002): 655.154 ms There is huge improvement in a many-subtransactions case. Finally, we need to note that memory counter updates could happen frequently as we update it for each change. So even though we update the binaryheap in O(log n), it could be a huge overhead if it happens quite often. One idea is to batch the memory counter updates where available. I've attached the patch for that (0004). I'll benchmark overheads for normal cases. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: