Re: Logical Replica ReorderBuffer Size Accounting Issues - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: Logical Replica ReorderBuffer Size Accounting Issues
Date
Msg-id CAD21AoAa17DCruz4MuJ_5Q_-JOp5FmZGPLDa=M9d+Qzzg8kiBw@mail.gmail.com
Whole thread Raw
In response to RE: Logical Replica ReorderBuffer Size Accounting Issues  ("Wei Wang (Fujitsu)" <wangw.fnst@fujitsu.com>)
Responses RE: Logical Replica ReorderBuffer Size Accounting Issues
List pgsql-bugs
On Tue, May 9, 2023 at 6:06 PM Wei Wang (Fujitsu)
<wangw.fnst@fujitsu.com> wrote:
>
> On Thu, May 9, 2023 at 9:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > On Fri, Jan 13, 2023 at 8:17 PM wangw.fnst@fujitsu.com
> > <wangw.fnst@fujitsu.com> wrote:
> > >
> > > On Thu, Jan 12, 2023 at 21:02 PM Alex Richman <alexrichman@onesignal.com>
> > wrote:
> > > > On Thu, 12 Jan 2023 at 10:44, wangw.fnst@fujitsu.com
> > > > <wangw.fnst@fujitsu.com> wrote:
> > > > > I think parallelism doesn't affect this problem. Because for a walsender, I
> > > > > think it will always read the wal serially in order. Please let me know if I'm
> > > > > missing something.
> > > > I suspect it's more about getting enough changes into the WAL quickly
> > enough
> > > > for walsender to not spend any time idle.  I suppose you could stack the deck
> > > > towards this by first disabling the subscription, doing the updates to spool a
> > > > bunch of changes in the WAL, then enabling the subscription again.  Perhaps
> > > > there is also some impact in the WAL records interleaving from the
> > concurrent
> > > > updates and making more work for the reorder buffer.
> > > > The servers I am testing on are quite beefy, so it might be a little harder to
> > > > generate sufficient load if you're testing locally on a laptop or something.
> > > >
> > > > > And I tried to use the table structure and UPDATE statement you said. But
> > > > > unfortunately I didn't catch 1GB or unexpected (I mean a lot size beyond
> > > > 256MB)
> > > > > usage in rb->tup_context. Could you please help me to confirm my test?
> > Here
> > > > is
> > > > > my test details:
> > > > Here's test scripts that replicate it for me: [1]
> > > > This is on 15.1, installed on debian-11, running on GCP n2-highmem-80
> > (IceLake)
> > > > /w 24x Local SSD in raid0.
> > >
> > > Thanks for the details you shared.
> > >
> > > Yes, I think you are right. I think I reproduced this problem as you suggested
> > > (Update the entire table in parallel). And I can reproduce this problem on both
> > > current HEAD and REL_15_1. The memory used in rb->tup_context can reach
> > 350M
> > > in HEAD and reach 600MB in REL_15_1.
> > >
> > > Here are my steps to reproduce:
> > > 1. Apply the attached diff patch to add some logs for confirmation.
> > > 2. Use the attached reproduction script to reproduce the problem.
> > > 3. Confirm the debug log that is output to the log file pub.log.
> > >
> > > After doing some research, I agree with the idea you mentioned before. I think
> > > this problem is caused by the implementation of 'Generational allocator' or the
> > > way we uses the API of 'Generational allocator'.
> >
> > I think there are two separate issues. One is a pure memory accounting
> > issue: since the reorderbuffer accounts the memory usage by
> > calculating actual tuple size etc. it includes neither the chunk
> > header size nor fragmentations within blocks. So I can understand why
> > the output of MemoryContextStats(rb->context) could be two or three
> > times higher than logical_decoding_work_mem and doesn't match rb->size
> > in some cases.
> >
> > However it cannot explain the original issue that the memory usage
> > (reported by MemoryContextStats(rb->context)) reached 5GB in spite of
> > logilca_decoding_work_mem being 256MB, which seems like a memory leak
> > bug or something we ignore the memory limit.
>
> Yes, I agree that the chunk header size or fragmentations within blocks may
> cause the allocated space to be larger than the accounted space. However, since
> these spaces are very small (please refer to [1] and [2]), I also don't think
> this is the cause of the original issue in this thread.
>
> I think that the cause of the original issue in this thread is the
> implementation of 'Generational allocator'.
> Please consider the following user scenario:
> The parallel execution of different transactions led to very fragmented and
> mixed-up WAL records for those transactions. Later, when walsender serially
> decodes the WAL, different transactions' chunks were stored on a single block
> in rb->tup_context. However, when a transaction ends, the chunks related to
> this transaction on the block will be marked as free instead of being actually
> released. The block will only be released when all chunks in the block are
> free. In other words, the block will only be released when all transactions
> occupying the block have ended. As a result, the chunks allocated by some
> ending transactions are not being released on many blocks for a long time. Then
> this issue occurred. I think this also explains why parallel execution is more
> likely to trigger this issue compared to serial execution of transactions.
> Please also refer to the analysis details of code in [3].

After some investigation, I don't think the implementation of
generation allocator is problematic but I agree that your scenario is
likely to explain the original issue. Especially, the output of
MemoryContextStats() shows:

          Tuples: 4311744512 total in 514 blocks (12858943 chunks);
6771224 free (12855411 chunks); 4304973288 used

First, since the total memory allocation was 4311744512 bytes in 514
blocks we can see there were no special blocks in the context (8MB *
514 = 4311744512 bytes). Second, it shows that the most chunks were
free (12858943 chunks vs. 12855411 chunks) but most memory were used
(4311744512 bytes vs. 4304973288 bytes), which means that there were
some in-use chunks at the tail of each block, i.e. the most blocks
were fragmented. I've attached another test to reproduce this
behavior. In this test, the memory usage reaches up to almost 4GB.

One idea to deal with this issue is to choose the block sizes
carefully while measuring the performance as the comment shows:

    /*
     * XXX the allocation sizes used below pre-date generation context's block
     * growing code.  These values should likely be benchmarked and set to
     * more suitable values.
     */
    buffer->tup_context = GenerationContextCreate(new_ctx,
                                                  "Tuples",
                                                  SLAB_LARGE_BLOCK_SIZE,
                                                  SLAB_LARGE_BLOCK_SIZE,
                                                  SLAB_LARGE_BLOCK_SIZE);

For example, if I use SLAB_DEFAULT_BLOCK_SIZE, 8kB, the maximum memory
usage was about 17MB in the test.

>
> > My colleague Drew
> > Callahan helped me investigate this issue and we found out that in
> > ReorderBufferIterTXNInit() we restore all changes of the top
> > transaction as well as all subtransactions but we don't respect the
> > memory limit when restoring changes. It's normally not a problem since
> > each decoded change is not typically large and we restore up to 4096
> > changes. However, if the transaction has many subtransactions, we
> > restore 4096 changes for each subtransaction at once, temporarily
> > consuming much memory. This behavior can explain the symptoms of the
> > original issue but one thing I'm unsure is that Alex reported
> > replacing MemoryContextAlloc() in ReorderBufferGetChange() with
> > malloc() resolved the issue. I think it can resolve/mitigate the
> > former issue (memory accounting issue) but IIUC it cannot for the
> > latter problem. There might be other factors for this problem. Anyway,
> > I've attached a simple reproducer that works as a tap test for
> > test_decoding.
> >
> > I think we need to discuss two things separately and should deal with
> > at least the latter problem.
>
> Yes, I agree that this is another separate issue we need to discuss.

If this scenario had happened, MemoryContextStats() would have shown
that there were less free chunks compared to in-use chunks.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-bugs by date:

Previous
From: Erik Oomen
Date:
Subject: Re: BUG #17926: Segfault in SELECT
Next
From: Michael Paquier
Date:
Subject: Re: BUG #17804: Assertion failed in pg_stat after fetching from pg_stat_database and switching cache->snapshot