Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Improve eviction algorithm in ReorderBuffer
Date
Msg-id 23e03725-8c77-431d-9f53-51709260ffeb@enterprisedb.com
Whole thread Raw
In response to Re: Improve eviction algorithm in ReorderBuffer  (vignesh C <vignesh21@gmail.com>)
Responses Re: Improve eviction algorithm in ReorderBuffer
List pgsql-hackers
Hi,

I did a basic review and testing of this patch today. Overall I think
the patch is in very good shape - I agree with the tradeoffs it makes,
and I like the approach in general. I do have a couple minor comments
about the code, and then maybe a couple thoughts about the approach.


First, some comments - I'll put them here, but I also kept them in
"review" commits, because that makes it easier to show the exact place
in the code the comment is about.

1) binaryheap_allocate got a new "indexed" argument, but the comment is
not updated to document it

2) I think it's preferable to use descriptive argument names for
bh_set_node. I don't think there's a good reason to keep it short.

3) In a couple places we have code like this:

    if (heap->bh_indexed)
        bh_nodeidx_delete(heap->bh_nodeidx, result);

Maybe it'd be better to have the if condition in bh_nodeidx_delete, so
that it can be called without it.

4) Could we check the "found" flag in bh_set_node, somehow? I mean, we
either expect to find the node (update of already tracked transaction)
or not (when inserting it). The life cycle may be non-trivial (node
added, updated and removed, ...), would be useful assert I think.

5) Do we actually need the various mem_freed local variables in a couple
places, when we expect the value to be equal to txn->size (there's even
assert enforcing that)?

6) ReorderBufferCleanupTXN has a comment about maybe not using the same
threshold both to enable & disable usage of the binaryheap. I agree with
that, otherwise we could easily end up "trashing" if we add/remove
transactions right around the threshold. I think 90-95% for disabling
the heap would work fine.

7) The code disabling binaryheap (based on the threshold) is copied in a
couple places, perhaps it should be a separate function called from
those places.

8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the
memory size check internally, to make the calls simpler.

9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate
split maybe not very clear. It's not clear to me why it's divided like
this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly.


performance
-----------

I did some benchmarks, to see the behavior in simple good/bad cases (see
the attached scripts.tgz). "large" is one large transaction inserting 1M
rows, small is 64k single-row inserts, and subxacts is the original case
with ~100k subxacts. Finally, subxacts-small is many transactions with
128 subxacts each (the main transactions are concurrent).

The results are pretty good, I think:

             test        master     patched
  -----------------------------------------------------
            large          2587        2459       95%
            small           956         856       89%
         subxacts        138915        2911        2%
   subxacts-small         13632       13187       97%

This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB,
where the subxact timing goes way down, but the overall conclusions do
not change.

I was a bit surprised I haven't seen any clear regression, but in the
end that's a good thing, right? There's a couple results in this thread
showing ~10% regression, but I've been unable to reproduce those.
Perhaps the newer patch versions fix that, I guess.

Anyway, I think that at some point we'd have to accept that some cases
may have slight regression. I think that's inherent for almost any
heuristics - there's always going to be some rare case that defeats it.
What's important is that the case needs to be rare and/or the impact
very limited. And I think that's true here.


overall design
--------------

As for the design, I agree with the approach of using a binaryheap to
track transactions by size. When going over the thread history,
describing the initial approach with only keeping "large" transactions
above some threshold (e.g. 10%), I was really concerned that'll either
lead to abrupt changes in behavior (when transactions move just around
the 10%), or won't help with many common cases (with most transactions
being below the limit).

I was going to suggest some sort of "binning" - keeping lists for
transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
evicting transactions from a list, i.e. based on approximate size. But
if the indexed binary heap seems to be cheap enough, I think it's a
better solution.

The one thing I'm a bit concerned about is the threshold used to start
using binary heap - these thresholds with binary decisions may easily
lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
with significant runtime change (e.g. you add/remove one transaction and
the code takes a much more expensive path). The value (1024) seems
rather arbitrary, I wonder if there's something to justify that choice.

In any case, I agree it'd be good to have some dampening factor, to
reduce the risk of trashing because of adding/removing a single
transaction to the decoding.


related stuff / GenerationContext
---------------------------------

It's not the fault of this patch, but this reminds me I have some doubts
about how the eviction interferes with using the GenerationContext for
some of the data. I suspect we can easily get into a situation where we
evict the largest transaction, but that doesn't actually reduce the
memory usage at all, because the memory context blocks are shared with
some other transactions and don't get 100% empty (so we can't release
them). But it's actually worse, because GenerationContext does not even
reuse this memory. So do we even gain anything by the eviction?

When the earlier patch versions also considered age of the transaction,
to try evicting the older ones first, I think that was interesting. I
think we may want to do something like this even with the binary heap.


related stuff / increase of logical_decoding_work_mem
-----------------------------------------------------

In the thread, one of the "alternatives to spilling" suggested in the
thread was to enable streaming, but I think there's often a much more
efficient alternative - increase the amount of memory, so that we don't
actually need to spill.

For example, a system may be doing a lot of eviction / spilling with
logical_decoding_work_mem=64MB, but setting 128MB may completely
eliminate that. Of course, if there are large transactions, this may not
be possible (the GUC would have to exceed RAM). But I don't think that's
very common, the incidents that I've observed were often resolved by
bumping the logical_decoding_work_mem by a little bit.

I wonder if there's something we might do to help users to tune this. We
should be able to measure the "peak" memory usage (how much memory we'd
need to not spill), so maybe we could log that as a WARNING, similarly
to checkpoints - there we only log "checkpoints too frequent, tune WAL
limits", but perhaps we might do more here?  Or maybe we could add the
watermark to the system catalog?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Давыдов Виталий
Date:
Subject: Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Next
From: Давыдов Виталий
Date:
Subject: Re: Slow catchup of 2PC (twophase) transactions on replica in LR