Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Improve eviction algorithm in ReorderBuffer |
Date | |
Msg-id | 787cd90b-48ff-4c4a-a63e-a2b3f494385d@enterprisedb.com Whole thread Raw |
In response to | Re: Improve eviction algorithm in ReorderBuffer (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Improve eviction algorithm in ReorderBuffer
|
List | pgsql-hackers |
On 2/26/24 07:46, Masahiko Sawada wrote: > On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >>... >> >> overall design >> -------------- >> >> As for the design, I agree with the approach of using a binaryheap to >> track transactions by size. When going over the thread history, >> describing the initial approach with only keeping "large" transactions >> above some threshold (e.g. 10%), I was really concerned that'll either >> lead to abrupt changes in behavior (when transactions move just around >> the 10%), or won't help with many common cases (with most transactions >> being below the limit). >> >> I was going to suggest some sort of "binning" - keeping lists for >> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and >> evicting transactions from a list, i.e. based on approximate size. But >> if the indexed binary heap seems to be cheap enough, I think it's a >> better solution. > > I've also considered the binning idea. But it was not clear to me how > it works well in a case where all transactions belong to the > particular class. For example, if we need to free up 1MB memory, we > could end up evicting 2000 transactions consuming 50 bytes instead of > 100 transactions consuming 1000 bytes, resulting in that we end up > serializing more transactions. Also, I'm concerned about the cost of > maintaining the binning lists. > I don't think the list maintenance would be very costly - in particular, the lists would not need to be sorted by size. You're right in some extreme cases we might evict the smallest transactions in the list. I think on average we'd evict transactions with average size, which seems OK for this use case. Anyway, I don't think we need to be distracted with this. I mentioned it merely to show it was considered, but the heap seems to work well enough, and in the end is even simpler because the complexity is hidden outside reorderbuffer. >> >> The one thing I'm a bit concerned about is the threshold used to start >> using binary heap - these thresholds with binary decisions may easily >> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior >> with significant runtime change (e.g. you add/remove one transaction and >> the code takes a much more expensive path). The value (1024) seems >> rather arbitrary, I wonder if there's something to justify that choice. > > True. 1024 seems small to me. In my environment, I started to see a > big difference from around 40000 transactions. But it varies depending > on environments and workloads. > > I think that this performance problem we're addressing doesn't > normally happen as long as all transactions being decoded are > top-level transactions. Otherwise, we also need to improve > ReorderBufferLargestStreamableTopTXN(). Given this fact, I think > max_connections = 1024 is a possible value in some systems, and I've > observed such systems sometimes. On the other hand, I've observed > > 5000 in just a few cases, and having more than 5000 transactions in > ReorderBuffer seems unlikely to happen without subtransactions. I > think we can say it's an extreme case, the number is still an > arbitrary number though. > > Or probably we can compute the threshold based on max_connections, > e.g., max_connections * 10. That way, we can ensure that users won't > incur the max-heap maintenance costs as long as they don't use > subtransactions. > Tying this to max_connections seems like an interesting option. It'd make this adaptive to a system. I haven't thought about the exact value (m_c * 10), but it seems better than arbitrary hard-coded values. >> >> In any case, I agree it'd be good to have some dampening factor, to >> reduce the risk of trashing because of adding/removing a single >> transaction to the decoding. >> >> >> related stuff / GenerationContext >> --------------------------------- >> >> It's not the fault of this patch, but this reminds me I have some doubts >> about how the eviction interferes with using the GenerationContext for >> some of the data. I suspect we can easily get into a situation where we >> evict the largest transaction, but that doesn't actually reduce the >> memory usage at all, because the memory context blocks are shared with >> some other transactions and don't get 100% empty (so we can't release >> them). But it's actually worse, because GenerationContext does not even >> reuse this memory. So do we even gain anything by the eviction? >> >> When the earlier patch versions also considered age of the transaction, >> to try evicting the older ones first, I think that was interesting. I >> think we may want to do something like this even with the binary heap. > > Thank you for raising this issue. This is one of the highest priority > items in my backlog. We've seen cases where the logical decoding uses > much more memory than logical_decoding_work_mem value[1][2] (e.g. it > used 4GB memory even though the logical_decoding_work_mem was 256kB). > I think that the problem would still happen even with this improvement > on the eviction. > > I believe these are separate problems we can address, and evicting > large transactions first would still be the right strategy. We might > want to improve how we store changes in memory contexts. For example, > it might be worth having per-transaction memory context so that we can > actually free memory blocks by the eviction. We can discuss it in a > separate thread. > Yes, I think using per-transaction context for large transactions might work. I don't think we want too many contexts, so we'd start with the shared context, and then at some point (when the transaction exceeds say 5% of the memory limit) we'd switch it to a separate one. But that's a matter for a separate patch, so let's discuss elsewhere. >> >> For example, a system may be doing a lot of eviction / spilling with >> logical_decoding_work_mem=64MB, but setting 128MB may completely >> eliminate that. Of course, if there are large transactions, this may not >> be possible (the GUC would have to exceed RAM). But I don't think that's >> very common, the incidents that I've observed were often resolved by >> bumping the logical_decoding_work_mem by a little bit. >> >> I wonder if there's something we might do to help users to tune this. We >> should be able to measure the "peak" memory usage (how much memory we'd >> need to not spill), so maybe we could log that as a WARNING, similarly >> to checkpoints - there we only log "checkpoints too frequent, tune WAL >> limits", but perhaps we might do more here? Or maybe we could add the >> watermark to the system catalog? > > Interesting ideas. > > The statistics such as spill_count shown in pg_stat_replication_slots > view could already give hints to users to increase the > logical_decoding_work_mem. In addition to that, it's an interesting > idea to have the high water mark in the view. > The spill statistics are useful, but I'm not sure it can answer the main question: How high would the memory limit need to be to not spill? Maybe there's something we can measure / log to help with this. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: