Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Improve eviction algorithm in ReorderBuffer |
Date | |
Msg-id | CALDaNm107ed2XWz+2uAKd8prExKZEn1-hWa3n_P4ZCZvC7zs+A@mail.gmail.com Whole thread Raw |
In response to | Re: Improve eviction algorithm in ReorderBuffer (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Improve eviction algorithm in ReorderBuffer
|
List | pgsql-hackers |
On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > > > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > > > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > > > transactions consuming 5% >= of the memory limit? I got the impression > > > > > > > that you are saying to maintain them in a separate transaction list > > > > > > > which doesn't seems to be the case. > > > > > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > > > maintain the transactions consuming more than 10% of > > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > > before any xact reaches logical_decoding_work_mem? > > > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > > consuming more than 6.4MB are added to the list. So for example, it's > > > > possible that the list has three transactions each of which are > > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > > > > Thanks for the clarification. I misunderstood the list to have > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > > thing to note is that maintaining these lists by default can also have > > > some overhead unless the list of open transactions crosses a certain > > > threshold. > > > > > > > On further analysis, I realized that the approach discussed here might > > not be the way to go. The idea of dividing transactions into several > > subgroups is to divide a large number of entries into multiple > > sub-groups so we can reduce the complexity to search for the > > particular entry. Since we assume that there are no big differences in > > entries' sizes within a sub-group, we can pick the entry to evict in > > O(1). However, what we really need to avoid here is that we end up > > increasing the number of times to evict entries because serializing an > > entry to the disk is more costly than searching an entry on memory in > > general. > > > > I think that it's no problem in a large-entries subgroup but when it > > comes to the smallest-entries subgroup, like for entries consuming > > less than 5% of the limit, it could end up evicting many entries. For > > example, there would be a huge difference between serializing 1 entry > > consuming 5% of the memory limit and serializing 5000 entries > > consuming 0.001% of the memory limit. Even if we can select 5000 > > entries quickly, I think the latter would be slower in total. The more > > subgroups we create, the more the algorithm gets complex and the > > overheads could cause. So I think we need to search for the largest > > entry in order to minimize the number of evictions anyway. > > > > Looking for data structures and algorithms, I think binaryheap with > > some improvements could be promising. I mentioned before why we cannot > > use the current binaryheap[1]. The missing pieces are efficient ways > > to remove the arbitrary entry and to update the arbitrary entry's key. > > The current binaryheap provides binaryheap_remove_node(), which is > > O(log n), but it requires the entry's position in the binaryheap. We > > can know the entry's position just after binaryheap_add_unordered() > > but it might be changed after heapify. Searching the node's position > > is O(n). So the improvement idea is to add a hash table to the > > binaryheap so that it can track the positions for each entry so that > > we can remove the arbitrary entry in O(log n) and also update the > > arbitrary entry's key in O(log n). This is known as the indexed > > priority queue. I've attached the patch for that (0001 and 0002). > > > > That way, in terms of reorderbuffer, we can update and remove the > > transaction's memory usage in O(log n) (in worst case and O(1) in > > average) and then pick the largest transaction in O(1). Since we might > > need to call ReorderBufferSerializeTXN() even in non-streaming case, > > we need to maintain the binaryheap anyway. > > Since if the number of transactions being decoded is small, updating > max-heap for each memory counter update could lead to some > regressions, I've measured it with the case where updating memory > counter happens frequently: > > setup script: > create table test (c int); > select pg_create_logical_replication_slot('s', 'test_decoding'); > insert into test select generate_series(1, 8000000); > > benchmark script: > set work_mem to '3GB'; > set logical_decoding_work_mem to '5GB'; > select count(*) from pg_logical_slot_peek_changes('s', null, null); > > Here are results (the median of five executions): > > * HEAD > 5274.765 ms > > * HEAD + 0001-0003 patch > 5532.203 ms > > There were approximately 5% performance regressions. > > An improvement idea is that we use two strategies for updating > max-heap depending on the number of transactions. That is, if the > number of transactions being decoded is small, we add a transaction to > max-heap by binaryheap_add_unordered(), which is O(1), and heapify it > just before picking the largest transactions, which is O(n). That way, > we can minimize the overhead of updating the memory counter. Once the > number of transactions being decoded exceeds the threshold, say 1024, > we use another strategy. We call binaryheap_update_up/down() when > updating the memory counter to preserve heap property, which is O(log > n), and pick the largest transaction in O(1). This strategy minimizes > the cost of picking the largest transactions instead of paying some > costs to update the memory counters. > > I've experimented with this idea and run the same tests: > > * HEAD + new patches (0001 - 0003) > 5277.524 ms > > The number looks good. I've attached these patches. Feedback is very welcome. Few comments: 1) Here we are changing memtrack_state to REORDER_BUFFER_MEM_TRACK_NORMAL immediately once the size is less than REORDE_BUFFER_MEM_TRACK_THRESHOLD. In this scenario we will be building the heap many times if there are transactions getting added and removed. How about we wait for txn_heap to become less than 95% of REORDE_BUFFER_MEM_TRACK_THRESHOLD to avoid building the heap many times in this scenario. + { + Assert(rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP); + + /* + * If the number of transactions gets lowered than the threshold, switch + * to the state where we heapify the max-heap right before picking the + * largest transaction while doing nothing for memory counter update. + */ + if (binaryheap_size(rb->txn_heap) < REORDE_BUFFER_MEM_TRACK_THRESHOLD) + rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NORMAL; } 2) I felt init variable is not needed, we can directly check txn->size instead like it is done in the else case: + bool init = (txn->size == 0); + txn->size += sz; rb->size += sz; /* Update the total size in the top transaction. */ toptxn->total_size += sz; + + /* Update the transaction in the max-heap */ + if (init) + { + /* Add the transaction to the max-heap */ + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NORMAL) + binaryheap_add_unordered(rb->txn_heap, PointerGetDatum(txn)); + else + binaryheap_add(rb->txn_heap, PointerGetDatum(txn)); + } + else if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP) + { + /* + * If we're maintaining max-heap even while updating the memory counter, + * we reflect the updates to the max-heap. + */ + binaryheap_update_up(rb->txn_heap, PointerGetDatum(txn)); + } 3) we can add some comments for this: +typedef enum ReorderBufferMemTrackState +{ + REORDER_BUFFER_MEM_TRACK_NORMAL, + REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP, +} ReorderBufferMemTrackState; + 4) This should be added to typedefs.list: +typedef enum ReorderBufferMemTrackState +{ + REORDER_BUFFER_MEM_TRACK_NORMAL, + REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP, +} ReorderBufferMemTrackState; + 5)Few typos: 5.a) enlareable should be enlargeable [PATCH v2 1/4] Make binaryheap enlareable. 5.b) subtranasctions should be subtransactions: On the other hand, when the number of transactions being decoded is fairly large, such as when a transaction has many subtranasctions, 5.c) evaludate should be evaluate: XXX: updating the transaction's memory counter and the max-heap is now O(log n), so we need to evaludate it. If there are some regression, we 5.d) pudate should be update: It could be too expensive to pudate max-heap while preserving the heap property each time the transaction's memory counter is updated, as it 5.e) frquently should be frequently: could happen very frquently. So when the number of transactions being decoded is small, we add the transactions to max-heap but don't 6) This should be added to typedefs.list: +/* + * Struct for A hash table element to store the node's index in the bh_nodes + * array. + */ +typedef struct bh_nodeidx_entry +{ + bh_node_type key; + char status; + int idx; +} bh_nodeidx_entry; Regards, Vignesh
pgsql-hackers by date: