Re: Using per-transaction memory contexts for storing decoded tuples - Mailing list pgsql-hackers
From | torikoshia |
---|---|
Subject | Re: Using per-transaction memory contexts for storing decoded tuples |
Date | |
Msg-id | 162ff6202f3c12b5769f59c3eb2bc0ca@oss.nttdata.com Whole thread Raw |
List | pgsql-hackers |
On 2024-09-12 07:32, Masahiko Sawada wrote: Thanks a lot for working on this! > Hi all, > > We have several reports that logical decoding uses memory much more > than logical_decoding_work_mem[1][2][3]. For instance in one of the > reports[1], even though users set logical_decoding_work_mem to > '256MB', a walsender process was killed by OOM because of using more > than 4GB memory. > > As per the discussion in these threads so far, what happened was that > there was huge memory fragmentation in rb->tup_context. > rb->tup_context uses GenerationContext with 8MB memory blocks. We > cannot free memory blocks until all memory chunks in the block are > freed. If there is a long-running transaction making changes, its > changes could be spread across many memory blocks and we end up not > being able to free memory blocks unless the long-running transaction > is evicted or completed. Since we don't account fragmentation, block > header size, nor chunk header size into per-transaction memory usage > (i.e. txn->size), rb->size could be less than > logical_decoding_work_mem but the actual allocated memory in the > context is much higher than logical_decoding_work_mem. > > We can reproduce this issue with the attached patch > rb_excessive_memory_reproducer.patch (not intended to commit) that > adds a memory usage reporting and includes the test. After running the > tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory > usage report in the server logs like follows: > > LOG: reorderbuffer memory: logical_decoding_work_mem=268435456 > rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264 > > Which means that the logical decoding allocated 1GB memory in spite of > logical_decoding_work_mem being 256MB. > > One idea to deal with this problem is that we use > MemoryContextMemAllocated() as the reorderbuffer's memory usage > instead of txn->total_size. That is, we evict transactions until the > value returned by MemoryContextMemAllocated() gets lower than > logical_decoding_work_mem. However, it could end up evicting a lot of > (possibly all) transactions because the transaction whose decoded > tuples data are spread across memory context blocks could be evicted > after all other larger transactions are evicted (note that we evict > transactions from largest to smallest). Therefore, if we want to do > that, we would need to change the eviction algorithm to for example > LSN order instead of transaction size order. Furthermore, > reorderbuffer's changes that are counted in txn->size (and therefore > rb->size too) are stored in different memory contexts depending on the > kinds. For example, decoded tuples are stored in rb->context, > ReorderBufferChange are stored in rb->change_context, and snapshot > data are stored in builder->context. So we would need to sort out > which data is stored into which memory context and which memory > context should be accounted for in the reorderbuffer's memory usage. > Which could be a large amount of work. > > Another idea is to have memory context for storing decoded tuples per > transaction. That way, we can ensure that all memory blocks of the > context are freed when the transaction is evicted or completed. I > think that this idea would be simpler and worth considering, so I > attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the > decoded tuple data is created individually when the corresponding WAL > record is decoded but is released collectively when it is released > (e.g., transaction eviction), the bump memory context would fit the > best this case. One exception is that we immediately free the decoded > tuple data if the transaction is already aborted. However, in this > case, I think it is possible to skip the WAL decoding in the first > place. I haven't read the patch yet, but it seems a reasonable approach. > One thing we need to consider is that the number of transaction > entries and memory contexts in the reorderbuffer could be very high > since it has entries for both top-level transaction entries and > sub-transactions. To deal with that, the patch changes that decoded > tuples of a subtransaction are stored into its top-level transaction's > tuple context. We always evict sub-transactions and its top-level > transaction at the same time, I think it would not be a problem. > Checking performance degradation due to using many memory contexts > would have to be done. Yeah, and I imagine there would be cases where the current implementation shows better performance, such as when there are no long transactions, but compared to unexpected memory bloat and OOM kill, I think it's far more better. > Even with this idea, we would still have a mismatch between the actual > amount of allocated memory and rb->size, but it would be much better > than today. And something like the first idea would be necessary to > address this mismatch, and we can discuss it in a separate thread. > > Regards, > > [1] > https://www.postgresql.org/message-id/CAMnUB3oYugXCBLSkih%2BqNsWQPciEwos6g_AMbnz_peNoxfHwyw%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/17974-f8c9d353a62f414d%40postgresql.org > [3] > https://www.postgresql.org/message-id/DB9PR07MB71808AC6C7770AF2FB36B95BCB252%40DB9PR07MB7180.eurprd07.prod.outlook.com -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
pgsql-hackers by date: