Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Improve eviction algorithm in ReorderBuffer |
Date | |
Msg-id | 23e03725-8c77-431d-9f53-51709260ffeb@enterprisedb.com Whole thread Raw |
In response to | Re: Improve eviction algorithm in ReorderBuffer (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Improve eviction algorithm in ReorderBuffer
|
List | pgsql-hackers |
Hi, I did a basic review and testing of this patch today. Overall I think the patch is in very good shape - I agree with the tradeoffs it makes, and I like the approach in general. I do have a couple minor comments about the code, and then maybe a couple thoughts about the approach. First, some comments - I'll put them here, but I also kept them in "review" commits, because that makes it easier to show the exact place in the code the comment is about. 1) binaryheap_allocate got a new "indexed" argument, but the comment is not updated to document it 2) I think it's preferable to use descriptive argument names for bh_set_node. I don't think there's a good reason to keep it short. 3) In a couple places we have code like this: if (heap->bh_indexed) bh_nodeidx_delete(heap->bh_nodeidx, result); Maybe it'd be better to have the if condition in bh_nodeidx_delete, so that it can be called without it. 4) Could we check the "found" flag in bh_set_node, somehow? I mean, we either expect to find the node (update of already tracked transaction) or not (when inserting it). The life cycle may be non-trivial (node added, updated and removed, ...), would be useful assert I think. 5) Do we actually need the various mem_freed local variables in a couple places, when we expect the value to be equal to txn->size (there's even assert enforcing that)? 6) ReorderBufferCleanupTXN has a comment about maybe not using the same threshold both to enable & disable usage of the binaryheap. I agree with that, otherwise we could easily end up "trashing" if we add/remove transactions right around the threshold. I think 90-95% for disabling the heap would work fine. 7) The code disabling binaryheap (based on the threshold) is copied in a couple places, perhaps it should be a separate function called from those places. 8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the memory size check internally, to make the calls simpler. 9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate split maybe not very clear. It's not clear to me why it's divided like this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly. performance ----------- I did some benchmarks, to see the behavior in simple good/bad cases (see the attached scripts.tgz). "large" is one large transaction inserting 1M rows, small is 64k single-row inserts, and subxacts is the original case with ~100k subxacts. Finally, subxacts-small is many transactions with 128 subxacts each (the main transactions are concurrent). The results are pretty good, I think: test master patched ----------------------------------------------------- large 2587 2459 95% small 956 856 89% subxacts 138915 2911 2% subxacts-small 13632 13187 97% This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB, where the subxact timing goes way down, but the overall conclusions do not change. I was a bit surprised I haven't seen any clear regression, but in the end that's a good thing, right? There's a couple results in this thread showing ~10% regression, but I've been unable to reproduce those. Perhaps the newer patch versions fix that, I guess. Anyway, I think that at some point we'd have to accept that some cases may have slight regression. I think that's inherent for almost any heuristics - there's always going to be some rare case that defeats it. What's important is that the case needs to be rare and/or the impact very limited. And I think that's true here. overall design -------------- As for the design, I agree with the approach of using a binaryheap to track transactions by size. When going over the thread history, describing the initial approach with only keeping "large" transactions above some threshold (e.g. 10%), I was really concerned that'll either lead to abrupt changes in behavior (when transactions move just around the 10%), or won't help with many common cases (with most transactions being below the limit). I was going to suggest some sort of "binning" - keeping lists for transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and evicting transactions from a list, i.e. based on approximate size. But if the indexed binary heap seems to be cheap enough, I think it's a better solution. The one thing I'm a bit concerned about is the threshold used to start using binary heap - these thresholds with binary decisions may easily lead to a "cliff" and robustness issues, i.e. abrupt change in behavior with significant runtime change (e.g. you add/remove one transaction and the code takes a much more expensive path). The value (1024) seems rather arbitrary, I wonder if there's something to justify that choice. In any case, I agree it'd be good to have some dampening factor, to reduce the risk of trashing because of adding/removing a single transaction to the decoding. related stuff / GenerationContext --------------------------------- It's not the fault of this patch, but this reminds me I have some doubts about how the eviction interferes with using the GenerationContext for some of the data. I suspect we can easily get into a situation where we evict the largest transaction, but that doesn't actually reduce the memory usage at all, because the memory context blocks are shared with some other transactions and don't get 100% empty (so we can't release them). But it's actually worse, because GenerationContext does not even reuse this memory. So do we even gain anything by the eviction? When the earlier patch versions also considered age of the transaction, to try evicting the older ones first, I think that was interesting. I think we may want to do something like this even with the binary heap. related stuff / increase of logical_decoding_work_mem ----------------------------------------------------- In the thread, one of the "alternatives to spilling" suggested in the thread was to enable streaming, but I think there's often a much more efficient alternative - increase the amount of memory, so that we don't actually need to spill. For example, a system may be doing a lot of eviction / spilling with logical_decoding_work_mem=64MB, but setting 128MB may completely eliminate that. Of course, if there are large transactions, this may not be possible (the GUC would have to exceed RAM). But I don't think that's very common, the incidents that I've observed were often resolved by bumping the logical_decoding_work_mem by a little bit. I wonder if there's something we might do to help users to tune this. We should be able to measure the "peak" memory usage (how much memory we'd need to not spill), so maybe we could log that as a WARNING, similarly to checkpoints - there we only log "checkpoints too frequent, tune WAL limits", but perhaps we might do more here? Or maybe we could add the watermark to the system catalog? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: