Thread: Re: Using per-transaction memory contexts for storing decoded tuples

Re: Using per-transaction memory contexts for storing decoded tuples

From
Amit Kapila
Date:
On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> We have several reports that logical decoding uses memory much more
> than logical_decoding_work_mem[1][2][3]. For instance in one of the
> reports[1], even though users set logical_decoding_work_mem to
> '256MB', a walsender process was killed by OOM because of using more
> than 4GB memory.
>
> As per the discussion in these threads so far, what happened was that
> there was huge memory fragmentation in rb->tup_context.
> rb->tup_context uses GenerationContext with 8MB memory blocks. We
> cannot free memory blocks until all memory chunks in the block are
> freed. If there is a long-running transaction making changes, its
> changes could be spread across many memory blocks and we end up not
> being able to free memory blocks unless the long-running transaction
> is evicted or completed. Since we don't account fragmentation, block
> header size, nor chunk header size into per-transaction memory usage
> (i.e. txn->size), rb->size could be less than
> logical_decoding_work_mem but the actual allocated memory in the
> context is much higher than logical_decoding_work_mem.
>

It is not clear to me how the fragmentation happens. Is it because of
some interleaving transactions which are even ended but the memory
corresponding to them is not released? Can we try reducing the size of
8MB memory blocks? The comment atop allocation says: "XXX the
allocation sizes used below pre-date generation context's block
growing code.  These values should likely be benchmarked and set to
more suitable values.", so do we need some tuning here?

> We can reproduce this issue with the attached patch
> rb_excessive_memory_reproducer.patch (not intended to commit) that
> adds a memory usage reporting and includes the test. After running the
> tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
> usage report in the server logs like follows:
>
> LOG:  reorderbuffer memory: logical_decoding_work_mem=268435456
> rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
>
> Which means that the logical decoding allocated 1GB memory in spite of
> logical_decoding_work_mem being 256MB.
>
> One idea to deal with this problem is that we use
> MemoryContextMemAllocated() as the reorderbuffer's memory usage
> instead of txn->total_size. That is, we evict transactions until the
> value returned by MemoryContextMemAllocated() gets lower than
> logical_decoding_work_mem. However, it could end up evicting a lot of
> (possibly all) transactions because the transaction whose decoded
> tuples data are spread across memory context blocks could be evicted
> after all other larger transactions are evicted (note that we evict
> transactions from largest to smallest). Therefore, if we want to do
> that, we would need to change the eviction algorithm to for example
> LSN order instead of transaction size order. Furthermore,
> reorderbuffer's changes that are counted in txn->size (and therefore
> rb->size too) are stored in different memory contexts depending on the
> kinds. For example, decoded tuples are stored in rb->context,
> ReorderBufferChange are stored in rb->change_context, and snapshot
> data are stored in builder->context. So we would need to sort out
> which data is stored into which memory context and which memory
> context should be accounted for in the reorderbuffer's memory usage.
> Which could be a large amount of work.
>
> Another idea is to have memory context for storing decoded tuples per
> transaction. That way, we can ensure that all memory blocks of the
> context are freed when the transaction is evicted or completed. I
> think that this idea would be simpler and worth considering, so I
> attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
> decoded tuple data is created individually when the corresponding WAL
> record is decoded but is released collectively when it is released
> (e.g., transaction eviction), the bump memory context would fit the
> best this case. One exception is that we immediately free the decoded
> tuple data if the transaction is already aborted. However, in this
> case, I think it is possible to skip the WAL decoding in the first
> place.
>
> One thing we need to consider is that the number of transaction
> entries and memory contexts in the reorderbuffer could be very high
> since it has entries for both top-level transaction entries and
> sub-transactions. To deal with that, the patch changes that decoded
> tuples of a subtransaction are stored into its top-level transaction's
> tuple context.
>

Won't that impact the calculation for ReorderBufferLargestTXN() which
can select either subtransaction or top-level xact?

> We always evict sub-transactions and its top-level
> transaction at the same time, I think it would not be a problem.
> Checking performance degradation due to using many memory contexts
> would have to be done.
>

The generation context has been introduced in commit a4ccc1ce which
claims that it has significantly reduced logical decoding's memory
usage and improved its performance. Won't changing it requires us to
validate all the cases which led us to use generation context in the
first place?

--
With Regards,
Amit Kapila.



Re: Using per-transaction memory contexts for storing decoded tuples

From
Masahiko Sawada
Date:
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > We have several reports that logical decoding uses memory much more
> > than logical_decoding_work_mem[1][2][3]. For instance in one of the
> > reports[1], even though users set logical_decoding_work_mem to
> > '256MB', a walsender process was killed by OOM because of using more
> > than 4GB memory.
> >
> > As per the discussion in these threads so far, what happened was that
> > there was huge memory fragmentation in rb->tup_context.
> > rb->tup_context uses GenerationContext with 8MB memory blocks. We
> > cannot free memory blocks until all memory chunks in the block are
> > freed. If there is a long-running transaction making changes, its
> > changes could be spread across many memory blocks and we end up not
> > being able to free memory blocks unless the long-running transaction
> > is evicted or completed. Since we don't account fragmentation, block
> > header size, nor chunk header size into per-transaction memory usage
> > (i.e. txn->size), rb->size could be less than
> > logical_decoding_work_mem but the actual allocated memory in the
> > context is much higher than logical_decoding_work_mem.
> >
>
> It is not clear to me how the fragmentation happens. Is it because of
> some interleaving transactions which are even ended but the memory
> corresponding to them is not released?

In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.

> Can we try reducing the size of
> 8MB memory blocks? The comment atop allocation says: "XXX the
> allocation sizes used below pre-date generation context's block
> growing code.  These values should likely be benchmarked and set to
> more suitable values.", so do we need some tuning here?

Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.

>
> > We can reproduce this issue with the attached patch
> > rb_excessive_memory_reproducer.patch (not intended to commit) that
> > adds a memory usage reporting and includes the test. After running the
> > tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
> > usage report in the server logs like follows:
> >
> > LOG:  reorderbuffer memory: logical_decoding_work_mem=268435456
> > rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
> >
> > Which means that the logical decoding allocated 1GB memory in spite of
> > logical_decoding_work_mem being 256MB.
> >
> > One idea to deal with this problem is that we use
> > MemoryContextMemAllocated() as the reorderbuffer's memory usage
> > instead of txn->total_size. That is, we evict transactions until the
> > value returned by MemoryContextMemAllocated() gets lower than
> > logical_decoding_work_mem. However, it could end up evicting a lot of
> > (possibly all) transactions because the transaction whose decoded
> > tuples data are spread across memory context blocks could be evicted
> > after all other larger transactions are evicted (note that we evict
> > transactions from largest to smallest). Therefore, if we want to do
> > that, we would need to change the eviction algorithm to for example
> > LSN order instead of transaction size order. Furthermore,
> > reorderbuffer's changes that are counted in txn->size (and therefore
> > rb->size too) are stored in different memory contexts depending on the
> > kinds. For example, decoded tuples are stored in rb->context,
> > ReorderBufferChange are stored in rb->change_context, and snapshot
> > data are stored in builder->context. So we would need to sort out
> > which data is stored into which memory context and which memory
> > context should be accounted for in the reorderbuffer's memory usage.
> > Which could be a large amount of work.
> >
> > Another idea is to have memory context for storing decoded tuples per
> > transaction. That way, we can ensure that all memory blocks of the
> > context are freed when the transaction is evicted or completed. I
> > think that this idea would be simpler and worth considering, so I
> > attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
> > decoded tuple data is created individually when the corresponding WAL
> > record is decoded but is released collectively when it is released
> > (e.g., transaction eviction), the bump memory context would fit the
> > best this case. One exception is that we immediately free the decoded
> > tuple data if the transaction is already aborted. However, in this
> > case, I think it is possible to skip the WAL decoding in the first
> > place.
> >
> > One thing we need to consider is that the number of transaction
> > entries and memory contexts in the reorderbuffer could be very high
> > since it has entries for both top-level transaction entries and
> > sub-transactions. To deal with that, the patch changes that decoded
> > tuples of a subtransaction are stored into its top-level transaction's
> > tuple context.
> >
>
> Won't that impact the calculation for ReorderBufferLargestTXN() which
> can select either subtransaction or top-level xact?

Yeah, I missed that we could evict only sub-transactions when choosing
the largest transaction. We need to find a better solution.

>
> > We always evict sub-transactions and its top-level
> > transaction at the same time, I think it would not be a problem.
> > Checking performance degradation due to using many memory contexts
> > would have to be done.
> >
>
> The generation context has been introduced in commit a4ccc1ce which
> claims that it has significantly reduced logical decoding's memory
> usage and improved its performance. Won't changing it requires us to
> validate all the cases which led us to use generation context in the
> first place?

Agreed. Will investigate the thread and check the cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Using per-transaction memory contexts for storing decoded tuples

From
Amit Kapila
Date:
On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > We have several reports that logical decoding uses memory much more
> > > than logical_decoding_work_mem[1][2][3]. For instance in one of the
> > > reports[1], even though users set logical_decoding_work_mem to
> > > '256MB', a walsender process was killed by OOM because of using more
> > > than 4GB memory.
> > >
> > > As per the discussion in these threads so far, what happened was that
> > > there was huge memory fragmentation in rb->tup_context.
> > > rb->tup_context uses GenerationContext with 8MB memory blocks. We
> > > cannot free memory blocks until all memory chunks in the block are
> > > freed. If there is a long-running transaction making changes, its
> > > changes could be spread across many memory blocks and we end up not
> > > being able to free memory blocks unless the long-running transaction
> > > is evicted or completed. Since we don't account fragmentation, block
> > > header size, nor chunk header size into per-transaction memory usage
> > > (i.e. txn->size), rb->size could be less than
> > > logical_decoding_work_mem but the actual allocated memory in the
> > > context is much higher than logical_decoding_work_mem.
> > >
> >
> > It is not clear to me how the fragmentation happens. Is it because of
> > some interleaving transactions which are even ended but the memory
> > corresponding to them is not released?
>
> In a generation context, we can free a memory block only when all
> memory chunks there are freed. Therefore, individual tuple buffers are
> already pfree()'ed but the underlying memory blocks are not freed.
>

I understood this part but didn't understand the cases leading to this
problem. For example, if there is a large (and only) transaction in
the system that allocates many buffers for change records during
decoding, in the end, it should free memory for all the buffers
allocated in the transaction. So, wouldn't that free all the memory
chunks corresponding to the memory blocks allocated? My guess was that
we couldn't free all the chunks because there were small interleaving
transactions that allocated memory but didn't free it before the large
transaction ended.

> > Can we try reducing the size of
> > 8MB memory blocks? The comment atop allocation says: "XXX the
> > allocation sizes used below pre-date generation context's block
> > growing code.  These values should likely be benchmarked and set to
> > more suitable values.", so do we need some tuning here?
>
> Reducing the size of the 8MB memory block would be one solution and
> could be better as it could be back-patchable. It would mitigate the
> problem but would not resolve it. I agree to try reducing it and do
> some benchmark tests. If it reasonably makes the problem less likely
> to happen, it would be a good solution.
>

makes sense.

--
With Regards,
Amit Kapila.



Re: Using per-transaction memory contexts for storing decoded tuples

From
Masahiko Sawada
Date:
On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > We have several reports that logical decoding uses memory much more
> > > > than logical_decoding_work_mem[1][2][3]. For instance in one of the
> > > > reports[1], even though users set logical_decoding_work_mem to
> > > > '256MB', a walsender process was killed by OOM because of using more
> > > > than 4GB memory.
> > > >
> > > > As per the discussion in these threads so far, what happened was that
> > > > there was huge memory fragmentation in rb->tup_context.
> > > > rb->tup_context uses GenerationContext with 8MB memory blocks. We
> > > > cannot free memory blocks until all memory chunks in the block are
> > > > freed. If there is a long-running transaction making changes, its
> > > > changes could be spread across many memory blocks and we end up not
> > > > being able to free memory blocks unless the long-running transaction
> > > > is evicted or completed. Since we don't account fragmentation, block
> > > > header size, nor chunk header size into per-transaction memory usage
> > > > (i.e. txn->size), rb->size could be less than
> > > > logical_decoding_work_mem but the actual allocated memory in the
> > > > context is much higher than logical_decoding_work_mem.
> > > >
> > >
> > > It is not clear to me how the fragmentation happens. Is it because of
> > > some interleaving transactions which are even ended but the memory
> > > corresponding to them is not released?
> >
> > In a generation context, we can free a memory block only when all
> > memory chunks there are freed. Therefore, individual tuple buffers are
> > already pfree()'ed but the underlying memory blocks are not freed.
> >
>
> I understood this part but didn't understand the cases leading to this
> problem. For example, if there is a large (and only) transaction in
> the system that allocates many buffers for change records during
> decoding, in the end, it should free memory for all the buffers
> allocated in the transaction. So, wouldn't that free all the memory
> chunks corresponding to the memory blocks allocated? My guess was that
> we couldn't free all the chunks because there were small interleaving
> transactions that allocated memory but didn't free it before the large
> transaction ended.

We haven't actually checked with the person who reported the problem,
so this is just a guess, but I think there were concurrent
transactions, including long-running INSERT transactions. For example,
suppose a transaction that inserts 10 million rows and many OLTP-like
(short) transactions are running at the same time. The scenario I
thought of was that one 8MB Generation Context Block contains 1MB of
the large insert transaction changes, and the other 7MB contains short
OLTP transaction changes. If there are just 256 such blocks, even
after all short-transactions have completed, the Generation Context
will have allocated 2GB of memory until we decode the commit record of
the large transaction, but rb->size will say 256MB.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Using per-transaction memory contexts for storing decoded tuples

From
Masahiko Sawada
Date:
On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > Can we try reducing the size of
> > > 8MB memory blocks? The comment atop allocation says: "XXX the
> > > allocation sizes used below pre-date generation context's block
> > > growing code.  These values should likely be benchmarked and set to
> > > more suitable values.", so do we need some tuning here?
> >
> > Reducing the size of the 8MB memory block would be one solution and
> > could be better as it could be back-patchable. It would mitigate the
> > problem but would not resolve it. I agree to try reducing it and do
> > some benchmark tests. If it reasonably makes the problem less likely
> > to happen, it would be a good solution.
> >
>
> makes sense.

I've done some benchmark tests for three different code bases with
different test cases. In short, reducing the generation memory context
block size to 8kB seems to be promising; it mitigates the problem
while keeping a similar performance.

Here are three code bases that I used:

* head: current head code.
* per-tx-bump: the proposed idea (with a slight change; each sub and
top-level transactions have its own bump memory context to store
decoded tuples).
* 8kb-mem-block: same as head except for changing the generation
memory block size from 8MB to 8kB.

And here are test cases and results:

1. Memory usage check

I've run the test that I shared before and checked the maximum amount
of memory allocated in the reorderbuffer context shown by
MemoryContextMemAllocated(). Here are results:

head: 2.1GB (while rb->size showing 43MB)
per-tx-bump: 50MB (while rb->size showing 43MB)
8kb-mem-block: 54MB (while rb->size showing 43MB)

I've confirmed that the excessive memory usage issue didn't happen in
the per-tx-bump case and the 8kb-mem-block cases.

2. Decoding many sub transactions

IIUC this kind of workload was a trigger to make us invent the
Generation Context for logical decoding[1]. The single top-level
transaction has 1M sub-transactions each of which insert a tuple. Here
are results:

head: 31694.163 ms (00:31.694)
per-tx-bump: 32661.752 ms (00:32.662)
8kb-mem-block: 31834.872 ms (00:31.835)

The head and 8kb-mem-block showed similar results whereas I see there
is a bit of regression on per-tx-bump. I think this is because of the
overhead of creating and deleting memory contexts for each
sub-transactions.

3. Decoding a big transaction

The next test case I did is to decode a single big transaction that
inserts 10M rows. I set logical_decoding_work_mem large enough to
avoid spilling behavior. Here are results:

head: 19859.113 ms (00:19.859)
per-tx-bump: 19422.308 ms (00:19.422)
8kb-mem-block: 19923.600 ms (00:19.924)

There were no big differences. FYI, I also checked the maximum memory
usage for this test case as well:

head: 1.53GB
per-tx-bump: 1.4GB
8kb-mem-block: 1.53GB

The per-tx-bump used a bit lesser memory probably thanks to bump
memory contexts.

4. Decoding many short transactions.

The last test case I did is to decode a bunch of short pgbench
transactions (10k transactions). Here are results:

head: 31694.163 ms (00:31.694)
per-tx-bump: 32661.752 ms (00:32.662)
8kb-mem-block: Time: 31834.872 ms (00:31.835)

I can see a similar trend of the test case #2 above.

Overall, reducing the generation context memory block size to 8kB
seems to be promising. And using the bump memory context per
transaction didn't bring performance improvement than I expected in
these cases.

Regards,

[1] https://www.postgresql.org/message-id/flat/20160706185502.1426.28143@wrigleys.postgresql.org

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Using per-transaction memory contexts for storing decoded tuples

From
David Rowley
Date:
On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've done some benchmark tests for three different code bases with
> different test cases. In short, reducing the generation memory context
> block size to 8kB seems to be promising; it mitigates the problem
> while keeping a similar performance.

Did you try any sizes between 8KB and 8MB?  1000x reduction seems
quite large a jump. There is additional overhead from having more
blocks. It means more malloc() work and more free() work when deleting
a context. It would be nice to see some numbers with all powers of 2
between 8KB and 8MB.  I imagine the returns are diminishing as the
block size is reduced further.

Another alternative idea would be to defragment transactions with a
large number of changes after they grow to some predefined size.
Defragmentation would just be a matter of performing
palloc/memcpy/pfree for each change. If that's all done at once, all
the changes for that transaction would be contiguous in memory. If
you're smart about what the trigger point is for performing the
defragmentation then I imagine there's not much risk of performance
regression for the general case.  For example, you might only want to
trigger it when MemoryContextMemAllocated() for the generation context
exceeds logical_decoding_work_mem by some factor and only do it for
transactions where the size of the changes exceeds some threshold.

> Overall, reducing the generation context memory block size to 8kB
> seems to be promising. And using the bump memory context per
> transaction didn't bring performance improvement than I expected in
> these cases.

Having a bump context per transaction would cause a malloc() every
time you create a new context and a free() each time you delete the
context when cleaning up the reorder buffer for the transaction.
GenerationContext has a "freeblock" field that it'll populate instead
of freeing a block. That block will be reused next time a new block is
required.  For truly FIFO workloads that never need an oversized
block, all new blocks will come from the freeblock field once the
first block becomes unused. See the comments in GenerationFree(). I
expect this is why bump contexts are slower than the generation
context for your short transaction workload.

David



Re: Using per-transaction memory contexts for storing decoded tuples

From
Fujii Masao
Date:

On 2024/09/19 8:53, Masahiko Sawada wrote:
> On Tue, Sep 17, 2024 at 2:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Mon, Sep 16, 2024 at 10:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>
>>> On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> Can we try reducing the size of
>>>> 8MB memory blocks? The comment atop allocation says: "XXX the
>>>> allocation sizes used below pre-date generation context's block
>>>> growing code.  These values should likely be benchmarked and set to
>>>> more suitable values.", so do we need some tuning here?
>>>
>>> Reducing the size of the 8MB memory block would be one solution and
>>> could be better as it could be back-patchable. It would mitigate the
>>> problem but would not resolve it. I agree to try reducing it and do
>>> some benchmark tests. If it reasonably makes the problem less likely
>>> to happen, it would be a good solution.
>>>
>>
>> makes sense.
> 
> I've done some benchmark tests for three different code bases with
> different test cases. In short, reducing the generation memory context
> block size to 8kB seems to be promising; it mitigates the problem
> while keeping a similar performance.

Sounds good!

I believe the memory bloat issue in the reorder buffer should be
considered a bug. Since this solution isn’t too invasive, I think
it’s worth considering back-patch to older versions.

Then, if we find a better approach, we can apply it to v18 or later.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Using per-transaction memory contexts for storing decoded tuples

From
Amit Kapila
Date:
On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've done some benchmark tests for three different code bases with
> > different test cases. In short, reducing the generation memory context
> > block size to 8kB seems to be promising; it mitigates the problem
> > while keeping a similar performance.
>
> Did you try any sizes between 8KB and 8MB?  1000x reduction seems
> quite large a jump. There is additional overhead from having more
> blocks. It means more malloc() work and more free() work when deleting
> a context. It would be nice to see some numbers with all powers of 2
> between 8KB and 8MB.  I imagine the returns are diminishing as the
> block size is reduced further.
>

Good idea.

> Another alternative idea would be to defragment transactions with a
> large number of changes after they grow to some predefined size.
> Defragmentation would just be a matter of performing
> palloc/memcpy/pfree for each change. If that's all done at once, all
> the changes for that transaction would be contiguous in memory. If
> you're smart about what the trigger point is for performing the
> defragmentation then I imagine there's not much risk of performance
> regression for the general case.  For example, you might only want to
> trigger it when MemoryContextMemAllocated() for the generation context
> exceeds logical_decoding_work_mem by some factor and only do it for
> transactions where the size of the changes exceeds some threshold.
>

After collecting the changes that exceed 'logical_decoding_work_mem',
one can choose to stream the transaction and free the changes to avoid
hitting this problem, however, we can use that or some other constant
to decide the point of defragmentation. The other point we need to
think in this idea is whether we actually need any defragmentation at
all. This will depend on whether there are concurrent transactions
being decoded. This would require benchmarking to see the performance
impact.

--
With Regards,
Amit Kapila.



Re: Using per-transaction memory contexts for storing decoded tuples

From
David Rowley
Date:
On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I've done other benchmarking tests while changing the memory block
> sizes from 8kB to 8MB. I measured the execution time of logical
> decoding of one transaction that inserted 10M rows. I set
> logical_decoding_work_mem large enough to avoid spilling behavior. In
> this scenario, we allocate many memory chunks while decoding the
> transaction and resulting in calling more malloc() in smaller memory
> block sizes. Here are results (an average of 3 executions):

I was interested in seeing the memory consumption with the test that
was causing an OOM due to the GenerationBlock fragmentation.  You saw
bad results with 8MB blocks and good results with 8KB blocks. The
measure that's interesting here is the MemoryContextMemAllocated() for
the GenerationContext in question.

> The fact that we're using rb->size and txn->size to choose the
> transactions to evict could make this idea less attractive. Even if we
> defragment transactions, rb->size and txn->size don't change.
> Therefore, it doesn't mean we can avoid evicting transactions due to
> full of logical_decoding_work_mem, but just mean the amount of
> allocated memory might have been reduced.

I had roughly imagined that you'd add an extra field to store
txn->size when the last fragmentation was done and only defrag the
transaction when the ReorderBufferTXN->size is, say, double the last
size when the changes were last defragmented. Of course, if the first
defragmentation was enough to drop MemoryContextMemAllocated() below
the concerning threshold above logical_decoding_work_mem then further
defrags wouldn't be done, at least, until such times as the
MemoryContextMemAllocated() became concerning again.  If you didn't
want to a full Size variable for the defragmentation size, you could
just store a uint8 to store which power of 2 ReorderBufferTXN->size
was when it was last defragmented. There are plenty of holds in that
struct that could be filled without enlarging the struct.

In general, it's a bit annoying to have to code around this
GenerationContext fragmentation issue. Unfortunately, AllocSet could
also suffer from fragmentation issues too if lots of chunks end up on
freelists that are never reused, so using another context type might
just create a fragmentation issue for a different workload.

David



Re: Using per-transaction memory contexts for storing decoded tuples

From
Amit Kapila
Date:
On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > >
> > > On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > I've done some benchmark tests for three different code bases with
> > > > different test cases. In short, reducing the generation memory context
> > > > block size to 8kB seems to be promising; it mitigates the problem
> > > > while keeping a similar performance.
> > >
> > > Did you try any sizes between 8KB and 8MB?  1000x reduction seems
> > > quite large a jump. There is additional overhead from having more
> > > blocks. It means more malloc() work and more free() work when deleting
> > > a context. It would be nice to see some numbers with all powers of 2
> > > between 8KB and 8MB.  I imagine the returns are diminishing as the
> > > block size is reduced further.
> > >
> >
> > Good idea.
>
> Agreed.
>
> I've done other benchmarking tests while changing the memory block
> sizes from 8kB to 8MB. I measured the execution time of logical
> decoding of one transaction that inserted 10M rows. I set
> logical_decoding_work_mem large enough to avoid spilling behavior. In
> this scenario, we allocate many memory chunks while decoding the
> transaction and resulting in calling more malloc() in smaller memory
> block sizes. Here are results (an average of 3 executions):
>
> 8kB: 19747.870 ms
> 16kB: 19780.025 ms
> 32kB: 19760.575 ms
> 64kB: 19772.387 ms
> 128kB: 19825.385 ms
> 256kB: 19781.118 ms
> 512kB: 19808.138 ms
> 1MB: 19757.640 ms
> 2MB: 19801.429 ms
> 4MB: 19673.996 ms
> 8MB: 19643.547 ms
>
> Interestingly, there were no noticeable differences in the execution
> time. I've checked the number of allocated memory blocks in each case
> and more blocks are allocated in smaller block size cases. For
> example, when the logical decoding used the maximum memory (about
> 1.5GB), we allocated about 80k blocks in 8kb memory block size case
> and 80 blocks in 8MB memory block cases.
>

What exactly do these test results mean? Do you want to prove that
there is no regression by using smaller block sizes?

--
With Regards,
Amit Kapila.



Re: Using per-transaction memory contexts for storing decoded tuples

From
Amit Kapila
Date:
On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I've done other benchmarking tests while changing the memory block
> > sizes from 8kB to 8MB. I measured the execution time of logical
> > decoding of one transaction that inserted 10M rows. I set
> > logical_decoding_work_mem large enough to avoid spilling behavior. In
> > this scenario, we allocate many memory chunks while decoding the
> > transaction and resulting in calling more malloc() in smaller memory
> > block sizes. Here are results (an average of 3 executions):
>
> I was interested in seeing the memory consumption with the test that
> was causing an OOM due to the GenerationBlock fragmentation.
>

+1. That test will be helpful.

> > The fact that we're using rb->size and txn->size to choose the
> > transactions to evict could make this idea less attractive. Even if we
> > defragment transactions, rb->size and txn->size don't change.
> > Therefore, it doesn't mean we can avoid evicting transactions due to
> > full of logical_decoding_work_mem, but just mean the amount of
> > allocated memory might have been reduced.
>
> I had roughly imagined that you'd add an extra field to store
> txn->size when the last fragmentation was done and only defrag the
> transaction when the ReorderBufferTXN->size is, say, double the last
> size when the changes were last defragmented. Of course, if the first
> defragmentation was enough to drop MemoryContextMemAllocated() below
> the concerning threshold above logical_decoding_work_mem then further
> defrags wouldn't be done, at least, until such times as the
> MemoryContextMemAllocated() became concerning again.  If you didn't
> want to a full Size variable for the defragmentation size, you could
> just store a uint8 to store which power of 2 ReorderBufferTXN->size
> was when it was last defragmented. There are plenty of holds in that
> struct that could be filled without enlarging the struct.
>
> In general, it's a bit annoying to have to code around this
> GenerationContext fragmentation issue.
>

Right, and I am also slightly afraid that this may not cause some
regression in other cases where defrag wouldn't help.

--
With Regards,
Amit Kapila.



Re: Using per-transaction memory contexts for storing decoded tuples

From
Masahiko Sawada
Date:
On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > I've done other benchmarking tests while changing the memory block
> > > sizes from 8kB to 8MB. I measured the execution time of logical
> > > decoding of one transaction that inserted 10M rows. I set
> > > logical_decoding_work_mem large enough to avoid spilling behavior. In
> > > this scenario, we allocate many memory chunks while decoding the
> > > transaction and resulting in calling more malloc() in smaller memory
> > > block sizes. Here are results (an average of 3 executions):
> >
> > I was interested in seeing the memory consumption with the test that
> > was causing an OOM due to the GenerationBlock fragmentation.
> >
>
> +1. That test will be helpful.

Sure. Here are results of peak memory usage in bytes reported by
MemoryContextMemAllocated() (when rb->size shows 43MB):

8kB:       52,371,328
16kB:     52,887,424
32kB:     53,428,096
64kB:     55,099,264
128kB:   86,163,328
256kB: 149,340,032
512kB: 273,334,144
1MB:    523,419,520
2MB: 1,021,493,120
4MB: 1,984,085,888
8MB: 2,130,886,528

Probably we can increase the size to 64kB?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Using per-transaction memory contexts for storing decoded tuples

From
David Rowley
Date:
On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > In general, it's a bit annoying to have to code around this
> > GenerationContext fragmentation issue.
>
> Right, and I am also slightly afraid that this may not cause some
> regression in other cases where defrag wouldn't help.

Yeah, that's certainly a possibility. I was hoping that
MemoryContextMemAllocated() being much larger than logical_work_mem
could only happen when there is fragmentation, but certainly, you
could be wasting effort trying to defrag transactions where the
changes all arrive in WAL consecutively and there is no
defragmentation. It might be some other large transaction that's
causing the context's allocations to be fragmented. I don't have any
good ideas on how to avoid wasting effort on non-problematic
transactions. Maybe there's something that could be done if we knew
the LSN of the first and last change and the gap between the LSNs was
much larger than the WAL space used for this transaction. That would
likely require tracking way more stuff than we do now, however.

With the smaller blocks idea, I'm a bit concerned that using smaller
blocks could cause regressions on systems that are better at releasing
memory back to the OS after free() as no doubt malloc() would often be
slower on those systems. There have been some complaints recently
about glibc being a bit too happy to keep hold of memory after free()
and I wondered if that was the reason why the small block test does
not cause much of a performance regression. I wonder how the small
block test would look on Mac, FreeBSD or Windows. I think it would be
risky to assume that all is well with reducing the block size after
testing on a single platform.

David



Re: Using per-transaction memory contexts for storing decoded tuples

From
Amit Kapila
Date:
On Sun, Sep 22, 2024 at 11:27 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 20 Sept 2024 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > > In general, it's a bit annoying to have to code around this
> > > GenerationContext fragmentation issue.
> >
> > Right, and I am also slightly afraid that this may not cause some
> > regression in other cases where defrag wouldn't help.
>
> Yeah, that's certainly a possibility. I was hoping that
> MemoryContextMemAllocated() being much larger than logical_work_mem
> could only happen when there is fragmentation, but certainly, you
> could be wasting effort trying to defrag transactions where the
> changes all arrive in WAL consecutively and there is no
> defragmentation. It might be some other large transaction that's
> causing the context's allocations to be fragmented. I don't have any
> good ideas on how to avoid wasting effort on non-problematic
> transactions. Maybe there's something that could be done if we knew
> the LSN of the first and last change and the gap between the LSNs was
> much larger than the WAL space used for this transaction. That would
> likely require tracking way more stuff than we do now, however.
>

With more information tracking, we could avoid some non-problematic
transactions but still, it would be difficult to predict that we
didn't harm many cases because to make the memory non-contiguous, we
only need a few interleaving small transactions. We can try to think
of ideas for implementing defragmentation in our code if we first can
prove that smaller block sizes cause problems.

> With the smaller blocks idea, I'm a bit concerned that using smaller
> blocks could cause regressions on systems that are better at releasing
> memory back to the OS after free() as no doubt malloc() would often be
> slower on those systems. There have been some complaints recently
> about glibc being a bit too happy to keep hold of memory after free()
> and I wondered if that was the reason why the small block test does
> not cause much of a performance regression. I wonder how the small
> block test would look on Mac, FreeBSD or Windows. I think it would be
> risky to assume that all is well with reducing the block size after
> testing on a single platform.
>

Good point. We need extensive testing on different platforms, as you
suggest, to verify if smaller block sizes caused any regressions.

--
With Regards,
Amit Kapila.



Re: Using per-transaction memory contexts for storing decoded tuples

From
Amit Kapila
Date:
On Fri, Sep 20, 2024 at 10:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Sep 19, 2024 at 10:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Sep 20, 2024 at 5:13 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > >
> > > On Fri, 20 Sept 2024 at 05:03, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > I've done other benchmarking tests while changing the memory block
> > > > sizes from 8kB to 8MB. I measured the execution time of logical
> > > > decoding of one transaction that inserted 10M rows. I set
> > > > logical_decoding_work_mem large enough to avoid spilling behavior. In
> > > > this scenario, we allocate many memory chunks while decoding the
> > > > transaction and resulting in calling more malloc() in smaller memory
> > > > block sizes. Here are results (an average of 3 executions):
> > >
> > > I was interested in seeing the memory consumption with the test that
> > > was causing an OOM due to the GenerationBlock fragmentation.
> > >
> >
> > +1. That test will be helpful.
>
> Sure. Here are results of peak memory usage in bytes reported by
> MemoryContextMemAllocated() (when rb->size shows 43MB):
>
> 8kB:       52,371,328
> 16kB:     52,887,424
> 32kB:     53,428,096
> 64kB:     55,099,264
> 128kB:   86,163,328
> 256kB: 149,340,032
> 512kB: 273,334,144
> 1MB:    523,419,520
> 2MB: 1,021,493,120
> 4MB: 1,984,085,888
> 8MB: 2,130,886,528
>
> Probably we can increase the size to 64kB?
>

Yeah, but before deciding on a particular size, we need more testing
on different platforms as suggested by David.

--
With Regards,
Amit Kapila.



Re: Using per-transaction memory contexts for storing decoded tuples

From
Masahiko Sawada
Date:
On Thu, Sep 19, 2024 at 10:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 19, 2024 at 10:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Sep 18, 2024 at 8:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Sep 19, 2024 at 6:46 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > > >
> > > > On Thu, 19 Sept 2024 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > I've done some benchmark tests for three different code bases with
> > > > > different test cases. In short, reducing the generation memory context
> > > > > block size to 8kB seems to be promising; it mitigates the problem
> > > > > while keeping a similar performance.
> > > >
> > > > Did you try any sizes between 8KB and 8MB?  1000x reduction seems
> > > > quite large a jump. There is additional overhead from having more
> > > > blocks. It means more malloc() work and more free() work when deleting
> > > > a context. It would be nice to see some numbers with all powers of 2
> > > > between 8KB and 8MB.  I imagine the returns are diminishing as the
> > > > block size is reduced further.
> > > >
> > >
> > > Good idea.
> >
> > Agreed.
> >
> > I've done other benchmarking tests while changing the memory block
> > sizes from 8kB to 8MB. I measured the execution time of logical
> > decoding of one transaction that inserted 10M rows. I set
> > logical_decoding_work_mem large enough to avoid spilling behavior. In
> > this scenario, we allocate many memory chunks while decoding the
> > transaction and resulting in calling more malloc() in smaller memory
> > block sizes. Here are results (an average of 3 executions):
> >
> > 8kB: 19747.870 ms
> > 16kB: 19780.025 ms
> > 32kB: 19760.575 ms
> > 64kB: 19772.387 ms
> > 128kB: 19825.385 ms
> > 256kB: 19781.118 ms
> > 512kB: 19808.138 ms
> > 1MB: 19757.640 ms
> > 2MB: 19801.429 ms
> > 4MB: 19673.996 ms
> > 8MB: 19643.547 ms
> >
> > Interestingly, there were no noticeable differences in the execution
> > time. I've checked the number of allocated memory blocks in each case
> > and more blocks are allocated in smaller block size cases. For
> > example, when the logical decoding used the maximum memory (about
> > 1.5GB), we allocated about 80k blocks in 8kb memory block size case
> > and 80 blocks in 8MB memory block cases.
> >
>
> What exactly do these test results mean? Do you want to prove that
> there is no regression by using smaller block sizes?

Yes, there was no noticeable performance regression at least in this
test scenario.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com