Home > mailing lists

Re: Using per-transaction memory contexts for storing decoded tuples - Mailing list pgsql-hackers

From	torikoshia
Subject	Re: Using per-transaction memory contexts for storing decoded tuples
Date	September 13, 2024 08:26:27
Msg-id	162ff6202f3c12b5769f59c3eb2bc0ca@oss.nttdata.com Whole thread Raw
List	pgsql-hackers

Tree view

On 2024-09-12 07:32, Masahiko Sawada wrote:

Thanks a lot for working on this!

> Hi all,
> 
> We have several reports that logical decoding uses memory much more
> than logical_decoding_work_mem[1][2][3]. For instance in one of the
> reports[1], even though users set logical_decoding_work_mem to
> '256MB', a walsender process was killed by OOM because of using more
> than 4GB memory.
> 
> As per the discussion in these threads so far, what happened was that
> there was huge memory fragmentation in rb->tup_context.
> rb->tup_context uses GenerationContext with 8MB memory blocks. We
> cannot free memory blocks until all memory chunks in the block are
> freed. If there is a long-running transaction making changes, its
> changes could be spread across many memory blocks and we end up not
> being able to free memory blocks unless the long-running transaction
> is evicted or completed. Since we don't account fragmentation, block
> header size, nor chunk header size into per-transaction memory usage
> (i.e. txn->size), rb->size could be less than
> logical_decoding_work_mem but the actual allocated memory in the
> context is much higher than logical_decoding_work_mem.
> 
> We can reproduce this issue with the attached patch
> rb_excessive_memory_reproducer.patch (not intended to commit) that
> adds a memory usage reporting and includes the test. After running the
> tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
> usage report in the server logs like follows:
> 
> LOG:  reorderbuffer memory: logical_decoding_work_mem=268435456
> rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
> 
> Which means that the logical decoding allocated 1GB memory in spite of
> logical_decoding_work_mem being 256MB.
> 
> One idea to deal with this problem is that we use
> MemoryContextMemAllocated() as the reorderbuffer's memory usage
> instead of txn->total_size. That is, we evict transactions until the
> value returned by MemoryContextMemAllocated() gets lower than
> logical_decoding_work_mem. However, it could end up evicting a lot of
> (possibly all) transactions because the transaction whose decoded
> tuples data are spread across memory context blocks could be evicted
> after all other larger transactions are evicted (note that we evict
> transactions from largest to smallest). Therefore, if we want to do
> that, we would need to change the eviction algorithm to for example
> LSN order instead of transaction size order. Furthermore,
> reorderbuffer's changes that are counted in txn->size (and therefore
> rb->size too) are stored in different memory contexts depending on the
> kinds. For example, decoded tuples are stored in rb->context,
> ReorderBufferChange are stored in rb->change_context, and snapshot
> data are stored in builder->context. So we would need to sort out
> which data is stored into which memory context and which memory
> context should be accounted for in the reorderbuffer's memory usage.
> Which could be a large amount of work.
> 
> Another idea is to have memory context for storing decoded tuples per
> transaction. That way, we can ensure that all memory blocks of the
> context are freed when the transaction is evicted or completed. I
> think that this idea would be simpler and worth considering, so I
> attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
> decoded tuple data is created individually when the corresponding WAL
> record is decoded but is released collectively when it is released
> (e.g., transaction eviction), the bump memory context would fit the
> best this case. One exception is that we immediately free the decoded
> tuple data if the transaction is already aborted. However, in this
> case, I think it is possible to skip the WAL decoding in the first
> place.

I haven't read the patch yet, but it seems a reasonable approach.

> One thing we need to consider is that the number of transaction
> entries and memory contexts in the reorderbuffer could be very high
> since it has entries for both top-level transaction entries and
> sub-transactions. To deal with that, the patch changes that decoded
> tuples of a subtransaction are stored into its top-level transaction's
> tuple context. We always evict sub-transactions and its top-level
> transaction at the same time, I think it would not be a problem.
> Checking performance degradation due to using many memory contexts
> would have to be done.

Yeah, and I imagine there would be cases where the current 
implementation shows better performance, such as when there are no long 
transactions, but compared to unexpected memory bloat and OOM kill, I 
think it's far more better.

> Even with this idea, we would still have a mismatch between the actual
> amount of allocated memory and rb->size, but it would be much better
> than today. And something like the first idea would be necessary to
> address this mismatch, and we can discuss it in a separate thread.
> 
> Regards,
> 
> [1] 
> https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih%2BqNsWQPciEwos6g_AMbnz_peNoxfHwyw%40mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/17974-f8c9d353a62f414d%40postgresql.org
> [3] 
>
https://www.postgresql.org/message-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252%40DB9PR07MB7180.eurprd07.prod.outlook.com

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

pgsql-hackers by date:

From: Bertrand Drouvot
Date: 13 September 2024, 07:25:46
Subject: Re: per backend I/O statistics

From: Shlok Kyal
Date: 13 September 2024, 08:27:17
Subject: Re: long-standing data loss bug in initial sync of logical replication

Re: Using per-transaction memory contexts for storing decoded tuples - Mailing list pgsql-hackers

Previous

Next