Re: Improve eviction algorithm in ReorderBuffer - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Improve eviction algorithm in ReorderBuffer
Date
Msg-id 787cd90b-48ff-4c4a-a63e-a2b3f494385d@enterprisedb.com
Whole thread Raw
In response to Re: Improve eviction algorithm in ReorderBuffer  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Improve eviction algorithm in ReorderBuffer
List pgsql-hackers

On 2/26/24 07:46, Masahiko Sawada wrote:
> On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>...
>>
>> overall design
>> --------------
>>
>> As for the design, I agree with the approach of using a binaryheap to
>> track transactions by size. When going over the thread history,
>> describing the initial approach with only keeping "large" transactions
>> above some threshold (e.g. 10%), I was really concerned that'll either
>> lead to abrupt changes in behavior (when transactions move just around
>> the 10%), or won't help with many common cases (with most transactions
>> being below the limit).
>>
>> I was going to suggest some sort of "binning" - keeping lists for
>> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and
>> evicting transactions from a list, i.e. based on approximate size. But
>> if the indexed binary heap seems to be cheap enough, I think it's a
>> better solution.
> 
> I've also considered the binning idea. But it was not clear to me how
> it works well in a case where all transactions belong to the
> particular class. For example, if we need to free up 1MB memory, we
> could end up evicting 2000 transactions consuming 50 bytes instead of
> 100 transactions consuming 1000 bytes, resulting in that we end up
> serializing more transactions. Also, I'm concerned about the cost of
> maintaining the binning lists.
> 

I don't think the list maintenance would be very costly - in particular,
the lists would not need to be sorted by size. You're right in some
extreme cases we might evict the smallest transactions in the list. I
think on average we'd evict transactions with average size, which seems
OK for this use case.

Anyway, I don't think we need to be distracted with this. I mentioned it
merely to show it was considered, but the heap seems to work well
enough, and in the end is even simpler because the complexity is hidden
outside reorderbuffer.

>>
>> The one thing I'm a bit concerned about is the threshold used to start
>> using binary heap - these thresholds with binary decisions may easily
>> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior
>> with significant runtime change (e.g. you add/remove one transaction and
>> the code takes a much more expensive path). The value (1024) seems
>> rather arbitrary, I wonder if there's something to justify that choice.
> 
> True. 1024 seems small to me. In my environment, I started to see a
> big difference from around 40000 transactions. But it varies depending
> on environments and workloads.
> 
> I think that this performance problem we're addressing doesn't
> normally happen as long as all transactions being decoded are
> top-level transactions. Otherwise, we also need to improve
> ReorderBufferLargestStreamableTopTXN(). Given this fact, I think
> max_connections = 1024 is a possible value in some systems, and I've
> observed such systems sometimes. On the other hand, I've observed >
> 5000 in just a few cases, and having more than 5000 transactions in
> ReorderBuffer seems unlikely to happen without subtransactions. I
> think we can say it's an extreme case, the number is still an
> arbitrary number though.
> 
> Or probably we can compute the threshold based on max_connections,
> e.g., max_connections * 10. That way, we can ensure that users won't
> incur the max-heap maintenance costs as long as they don't use
> subtransactions.
> 

Tying this to max_connections seems like an interesting option. It'd
make this adaptive to a system. I haven't thought about the exact value
(m_c * 10), but it seems better than arbitrary hard-coded values.

>>
>> In any case, I agree it'd be good to have some dampening factor, to
>> reduce the risk of trashing because of adding/removing a single
>> transaction to the decoding.
>>
>>
>> related stuff / GenerationContext
>> ---------------------------------
>>
>> It's not the fault of this patch, but this reminds me I have some doubts
>> about how the eviction interferes with using the GenerationContext for
>> some of the data. I suspect we can easily get into a situation where we
>> evict the largest transaction, but that doesn't actually reduce the
>> memory usage at all, because the memory context blocks are shared with
>> some other transactions and don't get 100% empty (so we can't release
>> them). But it's actually worse, because GenerationContext does not even
>> reuse this memory. So do we even gain anything by the eviction?
>>
>> When the earlier patch versions also considered age of the transaction,
>> to try evicting the older ones first, I think that was interesting. I
>> think we may want to do something like this even with the binary heap.
> 
> Thank you for raising this issue. This is one of the highest priority
> items in my backlog. We've seen cases where the logical decoding uses
> much more memory than logical_decoding_work_mem value[1][2] (e.g. it
> used 4GB memory even though the logical_decoding_work_mem was 256kB).
> I think that the problem would still happen even with this improvement
> on the eviction.
> 
> I believe these are separate problems we can address, and evicting
> large transactions first would still be the right strategy. We might
> want to improve how we store changes in memory contexts. For example,
> it might be worth having per-transaction memory context so that we can
> actually free memory blocks by the eviction. We can discuss it in a
> separate thread.
> 

Yes, I think using per-transaction context for large transactions might
work. I don't think we want too many contexts, so we'd start with the
shared context, and then at some point (when the transaction exceeds say
5% of the memory limit) we'd switch it to a separate one.

But that's a matter for a separate patch, so let's discuss elsewhere.

>>
>> For example, a system may be doing a lot of eviction / spilling with
>> logical_decoding_work_mem=64MB, but setting 128MB may completely
>> eliminate that. Of course, if there are large transactions, this may not
>> be possible (the GUC would have to exceed RAM). But I don't think that's
>> very common, the incidents that I've observed were often resolved by
>> bumping the logical_decoding_work_mem by a little bit.
>>
>> I wonder if there's something we might do to help users to tune this. We
>> should be able to measure the "peak" memory usage (how much memory we'd
>> need to not spill), so maybe we could log that as a WARNING, similarly
>> to checkpoints - there we only log "checkpoints too frequent, tune WAL
>> limits", but perhaps we might do more here?  Or maybe we could add the
>> watermark to the system catalog?
> 
> Interesting ideas.
> 
> The statistics such as spill_count shown  in pg_stat_replication_slots
> view could already give hints to users to increase the
> logical_decoding_work_mem. In addition to that, it's an interesting
> idea to have the high water mark in the view.
> 

The spill statistics are useful, but I'm not sure it can answer the main
question:

    How high would the memory limit need to be to not spill?

Maybe there's something we can measure / log to help with this.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Josef Šimánek
Date:
Subject: Re: [PATCH] Add --syntax to postgres for SQL syntax checking
Next
From: Richard Guo
Date:
Subject: Re: "type with xxxx does not exist" when doing ExecMemoize()