Re: Logical Replica ReorderBuffer Size Accounting Issues - Mailing list pgsql-bugs

From Alex Richman
Subject Re: Logical Replica ReorderBuffer Size Accounting Issues
Date
Msg-id CAMnUB3pGWcUL08fWB4QmO0+2yNBBckXq=ndyLoGAU+V_2WQaCg@mail.gmail.com
Whole thread Raw
In response to RE: Logical Replica ReorderBuffer Size Accounting Issues  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
Responses RE: Logical Replica ReorderBuffer Size Accounting Issues  ("wangw.fnst@fujitsu.com" <wangw.fnst@fujitsu.com>)
List pgsql-bugs
On Tue, 10 Jan 2023 at 11:22, wangw.fnst@fujitsu.com <wangw.fnst@fujitsu.com> wrote:
In summary, with the commit c6e0fe1f2a in master, the additional space
allocated in the context is reduced. But I think this size difference seems to
be inconsistent with what you meet. So I think the issues you meet seems not to
be caused by the problem improved by this commit on master. How do you think?
Agreed - I see a few different places where rb->size can disagree with the allocation size, but nothing that would produce a delta of 200KB vs 7GiB.  I think the issue lies somewhere within the allocator itself (more below).
 
If possible, could you please share which version of PG the issue occurs on,
and could you please also try to reproduce the problem on master?
We run 15.1-1 in prod, I have been trying to replicate the issue on that also.


So far I have a partial replication of the issue by populating a table of schema (id UUID PRIMARY KEY, data JSONB) with some millions of rows, then doing some updates on them (I ran 16 of these concurrently each acting on 1/16th of the rows):
UPDATE test SET data = data || '{"test_0": "1", "test_1": "1", "test_2": "1", "test_3": "1", "test_4": "1", "test_5": "1", "test_6": "1", "test_7": "1", "test_8": "1", "test_9": "1", "test_a": "1", "test_b": "1", "test_c": "1", "test_d": "1", "test_e": "1", "test_f": "1"}' @- '{test_0}';
This does cause the walsender memory to grow to ~1GiB, even with a configured logical_decoding_work_mem of 256MB.  However it is not a perfect replication of the issue we see in prod, because rb->size does eventually grow to 256MB and start streaming transactions so the walsender memory does not grow up to the level we see in prod.

I believe the updates in prod are large numbers of updates to single rows, which may be the relevant difference, but I am having trouble producing enough update volume on the test systems to simulate it.  For some idea of the scale, those update statements in prod are producing ~6million WAL records per minute according to pg_stat_statements.


To tie off the reorder buffer size question, I built a custom 15.1 from source with a patch that passes the tuple alloc_size from ReorderBufferGetTupleBuf through to ReorderBufferChangeSize so it can account rb->size on that instead of t_len.  This had no notable effect on rb->size relative to the process RSS, so I think I agree that the issue lies deeper within the Generation memory context, not within the reorder buffer accounting.


Suspicious of GenerationAlloc, I then patched it to log its decision making, and found that it was disproportionally failing to allocate space within the freeblock+keeper, such that almost every Tuple context allocation was mallocing a new block.  I don't really understand the structure of the allocator, but based on the logging, freeblock was consistently NULL and keeper had GenerationBlockFreeBytes() = 0.  I feel I lack the postgres codebase understanding to investigate further with this.  Incidentally there is a fixup comment [1] which suggests the generation context sizes may be out of date.

Further to the memory allocation strangeness.  I noticed that there is a lot of heap fragmentation within the walsender processes.  After the spike in Tuples memory context has returned to normal, the RSS of the process itself remains at peak for ~10-20 minutes.  Inspecting core dumps of these processes with core_analyzer shows all the memory is actually free and is not being reclaimed due to heap fragmentation.  Perhaps fragmentation within the memory context due to the allocation sizes vs the chunk size in GenerationAlloc is also a factor.

Thanks,
- Alex.
 

pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic
Next
From: Tom Lane
Date:
Subject: Re: BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic