Re: Logical Replica ReorderBuffer Size Accounting Issues - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: Logical Replica ReorderBuffer Size Accounting Issues
Date
Msg-id CAD21AoBSw3F1F9jFFtmNxaOJnXorL51wMQkxQpSPEQbGdv_0rQ@mail.gmail.com
Whole thread Raw
In response to RE: Logical Replica ReorderBuffer Size Accounting Issues  ("Wei Wang (Fujitsu)" <wangw.fnst@fujitsu.com>)
Responses RE: Logical Replica ReorderBuffer Size Accounting Issues  ("Wei Wang (Fujitsu)" <wangw.fnst@fujitsu.com>)
List pgsql-bugs
On Tue, May 23, 2023 at 1:11 PM Wei Wang (Fujitsu)
<wangw.fnst@fujitsu.com> wrote:
>
> On Thu, May 9, 2023 at 22:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Tue, May 9, 2023 at 6:06 PM Wei Wang (Fujitsu)
> > > > I think there are two separate issues. One is a pure memory accounting
> > > > issue: since the reorderbuffer accounts the memory usage by
> > > > calculating actual tuple size etc. it includes neither the chunk
> > > > header size nor fragmentations within blocks. So I can understand why
> > > > the output of MemoryContextStats(rb->context) could be two or three
> > > > times higher than logical_decoding_work_mem and doesn't match rb->size
> > > > in some cases.
> > > >
> > > > However it cannot explain the original issue that the memory usage
> > > > (reported by MemoryContextStats(rb->context)) reached 5GB in spite of
> > > > logilca_decoding_work_mem being 256MB, which seems like a memory leak
> > > > bug or something we ignore the memory limit.
> > >
> > > Yes, I agree that the chunk header size or fragmentations within blocks may
> > > cause the allocated space to be larger than the accounted space. However, since
> > > these spaces are very small (please refer to [1] and [2]), I also don't think
> > > this is the cause of the original issue in this thread.
> > >
> > > I think that the cause of the original issue in this thread is the
> > > implementation of 'Generational allocator'.
> > > Please consider the following user scenario:
> > > The parallel execution of different transactions led to very fragmented and
> > > mixed-up WAL records for those transactions. Later, when walsender serially
> > > decodes the WAL, different transactions' chunks were stored on a single block
> > > in rb->tup_context. However, when a transaction ends, the chunks related to
> > > this transaction on the block will be marked as free instead of being actually
> > > released. The block will only be released when all chunks in the block are
> > > free. In other words, the block will only be released when all transactions
> > > occupying the block have ended. As a result, the chunks allocated by some
> > > ending transactions are not being released on many blocks for a long time.
> > Then
> > > this issue occurred. I think this also explains why parallel execution is more
> > > likely to trigger this issue compared to serial execution of transactions.
> > > Please also refer to the analysis details of code in [3].
> >
> > After some investigation, I don't think the implementation of
> > generation allocator is problematic but I agree that your scenario is
> > likely to explain the original issue. Especially, the output of
> > MemoryContextStats() shows:
> >
> >           Tuples: 4311744512 total in 514 blocks (12858943 chunks);
> > 6771224 free (12855411 chunks); 4304973288 used
> >
> > First, since the total memory allocation was 4311744512 bytes in 514
> > blocks we can see there were no special blocks in the context (8MB *
> > 514 = 4311744512 bytes). Second, it shows that the most chunks were
> > free (12858943 chunks vs. 12855411 chunks) but most memory were used
> > (4311744512 bytes vs. 4304973288 bytes), which means that there were
> > some in-use chunks at the tail of each block, i.e. the most blocks
> > were fragmented. I've attached another test to reproduce this
> > behavior. In this test, the memory usage reaches up to almost 4GB.
> >
> > One idea to deal with this issue is to choose the block sizes
> > carefully while measuring the performance as the comment shows:
> >
> >     /*
> >      * XXX the allocation sizes used below pre-date generation context's block
> >      * growing code.  These values should likely be benchmarked and set to
> >      * more suitable values.
> >      */
> >     buffer->tup_context = GenerationContextCreate(new_ctx,
> >                                                   "Tuples",
> >                                                   SLAB_LARGE_BLOCK_SIZE,
> >                                                   SLAB_LARGE_BLOCK_SIZE,
> >                                                   SLAB_LARGE_BLOCK_SIZE);
> >
> > For example, if I use SLAB_DEFAULT_BLOCK_SIZE, 8kB, the maximum memory
> > usage was about 17MB in the test.
>
> Thanks for your idea.
> I did some tests as you suggested. I think the modification mentioned above can
> work around this issue in the test 002_rb_memory_2.pl on [1] (To reach the size
> of large transactions, I set logical_decoding_work_mem to 1MB). But the test
> repreduce.sh on [2] still reproduces this issue.

Yes, it's because the above modification doesn't fix the memory
accounting issue but only reduces memory bloat in some (extremely bad)
cases. Without this modification, it was possible that the maximum
actual memory usage could easily reach several tens of times as
logical_decoding_work_mem (e.g. 4GB vs. 256MB as originally reported).
Since the fact that the reorderbuffer doesn't account for memory
fragmentation etc is still there, it's still possible that the actual
memory usage could reach several times as logical_decoding_work_mem.
In my environment, with reproducer.sh you shared, the total actual
memory usage reached up to about 430MB while logical_decoding_work_mem
being 256MB. Probably even if we use another type for memory allocator
such as AllocSet a similar issue will still happen. If we don't want
the reorderbuffer memory usage never to exceed
logical_decoding_work_mem, we would need to change how the
reorderbuffer uses and accounts for memory, which would require much
work, I guess.

> It seems that this modification
> will fix a subset of use cases, But the issue still occurs for other use cases.
>
> I think that the size of a block may lead to differences in the number of
> transactions stored on the block. For example, before the modification, a block
> could store some changes of 10 transactions, but after the modification, a block
> may only store some changes of 3 transactions. I think this means that once
> these three transactions are committed, this block will be actually released.
> As a result, the probability of the block being actually released is increased
> after the modification.

In addition to that, I think the size of a block may lead to
differences in the consequences of memory fragmentation. The larger
blocks, the larger fragmentation could happen within blocks.

> Additionally, I think that the parallelism of the test
> repreduce.sh is higher than that of the test 002_rb_memory_2.pl, which is also
> the reason why this modification only fixed the issue in the test
> 002_rb_memory_2.pl.

Could you elaborate on why higher parallelism could affect this memory
accounting issue more?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-bugs by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: BUG #17938: could not open shared memory segment "/PostgreSQL.615216676": No such file or directory
Next
From: PG Bug reporting form
Date:
Subject: BUG #17942: vacuumdb doesn't populate extended statistics on partitioned tables