Thread: Re: Logical replication 'invalid memory alloc request size 1585837200' after upgrading to 17.5

On Mon, May 19, 2025 at 8:08 PM Duncan Sands
<duncan.sands@deepbluecap.com> wrote:
>
> PostgreSQL v17.5 (Ubuntu 17.5-1.pgdg24.04+1); Ubuntu 24.04.2 LTS (kernel
> 6.8.0); x86-64
>
> Good morning from DeepBlueCapital.  Soon after upgrading to 17.5 from 17.4, we
> started seeing logical replication failures with publisher errors like this:
>
>    ERROR:  invalid memory alloc request size 1196493216
>
> (the exact size varies).  Here is a typical log extract from the publisher:
>
> 2025-05-19 10:30:14 CEST \[1348336-465] remote\_production\_user\@blue DEBUG:
> 00000: write FB03/349DEF90 flush FB03/349DEF90 apply FB03/349DEF90 reply\_time
> 2025-05-19 10:30:07.467048+02
> 2025-05-19 10:30:14 CEST \[1348336-466] remote\_production\_user\@blue LOCATION:
>   ProcessStandbyReplyMessage, walsender.c:2431
> 2025-05-19 10:30:14 CEST \[1348336-467] remote\_production\_user\@blue DEBUG:
> 00000: skipped replication of an empty transaction with XID: 207637565
> 2025-05-19 10:30:14 CEST \[1348336-468] remote\_production\_user\@blue CONTEXT:
> slot "jnb\_production", output plugin "pgoutput", in the commit callback,
> associated LSN FB03/349FF938
> 2025-05-19 10:30:14 CEST \[1348336-469] remote\_production\_user\@blue LOCATION:
>   pgoutput\_commit\_txn, pgoutput.c:629
> 2025-05-19 10:30:14 CEST \[1348336-470] remote\_production\_user\@blue DEBUG:
> 00000: UpdateDecodingStats: updating stats 0x5ae1616c17a8 0 0 0 0 1 0 1 191
> 2025-05-19 10:30:14 CEST \[1348336-471] remote\_production\_user\@blue LOCATION:
>   UpdateDecodingStats, logical.c:1943
> 2025-05-19 10:30:14 CEST \[1348336-472] remote\_production\_user\@blue DEBUG:
> 00000: found top level transaction 207637519, with catalog changes
> 2025-05-19 10:30:14 CEST \[1348336-473] remote\_production\_user\@blue LOCATION:
>   SnapBuildCommitTxn, snapbuild.c:1150
> 2025-05-19 10:30:14 CEST \[1348336-474] remote\_production\_user\@blue DEBUG:
> 00000: adding a new snapshot and invalidations to 207616976 at FB03/34A1AAE0
> 2025-05-19 10:30:14 CEST \[1348336-475] remote\_production\_user\@blue LOCATION:
>   SnapBuildDistributeSnapshotAndInval, snapbuild.c:915
> 2025-05-19 10:30:14 CEST \[1348336-476] remote\_production\_user\@blue ERROR:
> XX000: invalid memory alloc request size 1196493216
>
> If I'm reading it right, things go wrong on the publisher while preparing the
> message, i.e. it's not a subscriber problem.
>

Right, I also think so.

> This particular instance was triggered by a large number of catalog
> invalidations: I dumped what I think is the relevant WAL with "pg_waldump -s
> FB03/34A1AAE0 -p 17/main/ --xid=207637519" and the output was a single long line:
>
...
...
>
> While it is long, it doesn't seem to merit allocating anything like 1GB of
> memory.  So I'm guessing that postgres is miscalculating the required size somehow.
>

We fixed a bug in commit 4909b38af0 to distribute invalidation at the
transaction end to avoid data loss in certain cases, which could cause
such a problem. I am wondering that even prior to that commit, we
would eventually end up allocating the required memory for a
transaction for all the invalidations because of repalloc in
ReorderBufferAddInvalidations, so why it matter with this commit? One
possibility is that we need allocations for multiple in-progress
transactions now. I'll think more about this. It would be helpful if
you could share more details about the workload, or if possible, a
testcase or script using which we can reproduce this problem.

--
With Regards,
Amit Kapila.



On Wed, May 21, 2025 at 11:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, May 19, 2025 at 8:08 PM Duncan Sands
> <duncan.sands@deepbluecap.com> wrote:
> >
> > While it is long, it doesn't seem to merit allocating anything like 1GB of
> > memory.  So I'm guessing that postgres is miscalculating the required size somehow.
> >
>
> We fixed a bug in commit 4909b38af0 to distribute invalidation at the
> transaction end to avoid data loss in certain cases, which could cause
> such a problem. I am wondering that even prior to that commit, we
> would eventually end up allocating the required memory for a
> transaction for all the invalidations because of repalloc in
> ReorderBufferAddInvalidations, so why it matter with this commit? One
> possibility is that we need allocations for multiple in-progress
> transactions now.
>

I think the problem here is that when we are distributing
invalidations to a concurrent transaction, in addition to queuing the
invalidations as a change, we also copy the distributed invalidations
along with the original transaction's invalidations via repalloc in
ReorderBufferAddInvalidations. So, when there are many in-progress
transactions, each would try to copy all its accumulated invalidations
to the remaining in-progress transactions. This could lead to such an
increase in allocation request size. However, after queuing the
change, we don't need to copy it along with the original transaction's
invalidations. This is because the copy is only required when we don't
process any changes in cases like ReorderBufferForget(). I have
analyzed all such cases, and my analysis is as follows:

ReorderBufferForget()
------------------------------
It is okay not to perform the invalidations that we got from other
concurrent transactions during ReorderBufferForget. This is because
ReorderBufferForget executes invalidations when we skip the
transaction being decoded, as it is not from a database of interest.
So, we execute only to invalidate shared catalogs (See comment at the
caller of ReorderBufferForget). It is sufficient to execute such
invalidations in the source transaction only because the transaction
being skipped wouldn't have loaded anything in the shared catalog.

ReorderBufferAbort()
-----------------------------
ReorderBufferAbort() process invalidation when it has already streamed
some changes. Whenever it would have streamed the change, it would
have processed the concurrent transactions' invalidation messages that
happened before the statement that led to streaming. That should be
sufficient for us.

Consider the following variant of the original case that required the
distribution of invalidations:
1) S1: CREATE TABLE d(data text not null);
2) S1: INSERT INTO d VALUES('d1');
3) S2: BEGIN; INSERT INTO d VALUES('d2');
4) S1: ALTER PUBLICATION pb ADD TABLE d;
5) S2: INSERT INTO unrelated_tab VALUES(1);
6) S2: ROLLBACK;
7) S2: INSERT INTO d VALUES('d3');
8) S1: INSERT INTO d VALUES('d4');

The problem with the sequence is that the insert from 3) could be
decoded *after* 4) in step 5) due to streaming, and that to decode the
insert (which happened before the ALTER) the catalog snapshot and
cache state is from *before* the ALTER TABLE. Because the transaction
started in 3) doesn't modify any catalogs, no invalidations are
executed after decoding it. The result could be that the cache looks
like it did at 3), not like after 4). However, this won't create a
problem because while streaming at 5), we would execute invalidation
from S-1 due to the change added via message
REORDER_BUFFER_CHANGE_INVALIDATION in ReorderBufferAddInvalidations.

ReorderBufferInvalidate
--------------------------------
The reason is the same as ReorderBufferForget(), as it executes
invalidations for the same reason, but with a different function to
avoid the cleanup of the buffer at the end.

XLOG_XACT_INVALIDATIONS
-------------------------------------------
While processing XLOG_XACT_INVALIDATIONS, we don't need invalidations
accumulated from other xacts because this is a special case to execute
invalidations from a particular command (DDL) in a transaction. It
won't build any cache, so it can't create any invalid state.

--
With Regards,
Amit Kapila.



Hi Amit and Shlok, thanks for thinking about this issue.  We are working on 
reproducing it in our test environment.  Since it seems likely to be related to 
our primary database being very busy with lots of concurrency and large 
transactions, we are starting by creating a streaming replication copy of our 
primary server (this copy to run 17.5, with the primary on 17.4), with the idea 
of then doing logical replication from the standby to see if we hit the same 
issue.  If so, that gives something to poke at, and we can move on to something 
better from there.

Best wishes, Duncan.

On 21/05/2025 07:48, Amit Kapila wrote:
> On Mon, May 19, 2025 at 8:08 PM Duncan Sands
> <duncan.sands@deepbluecap.com> wrote:
>>
>> PostgreSQL v17.5 (Ubuntu 17.5-1.pgdg24.04+1); Ubuntu 24.04.2 LTS (kernel
>> 6.8.0); x86-64
>>
>> Good morning from DeepBlueCapital.  Soon after upgrading to 17.5 from 17.4, we
>> started seeing logical replication failures with publisher errors like this:
>>
>>     ERROR:  invalid memory alloc request size 1196493216
>>
>> (the exact size varies).  Here is a typical log extract from the publisher:
>>
>> 2025-05-19 10:30:14 CEST \[1348336-465] remote\_production\_user\@blue DEBUG:
>> 00000: write FB03/349DEF90 flush FB03/349DEF90 apply FB03/349DEF90 reply\_time
>> 2025-05-19 10:30:07.467048+02
>> 2025-05-19 10:30:14 CEST \[1348336-466] remote\_production\_user\@blue LOCATION:
>>    ProcessStandbyReplyMessage, walsender.c:2431
>> 2025-05-19 10:30:14 CEST \[1348336-467] remote\_production\_user\@blue DEBUG:
>> 00000: skipped replication of an empty transaction with XID: 207637565
>> 2025-05-19 10:30:14 CEST \[1348336-468] remote\_production\_user\@blue CONTEXT:
>> slot "jnb\_production", output plugin "pgoutput", in the commit callback,
>> associated LSN FB03/349FF938
>> 2025-05-19 10:30:14 CEST \[1348336-469] remote\_production\_user\@blue LOCATION:
>>    pgoutput\_commit\_txn, pgoutput.c:629
>> 2025-05-19 10:30:14 CEST \[1348336-470] remote\_production\_user\@blue DEBUG:
>> 00000: UpdateDecodingStats: updating stats 0x5ae1616c17a8 0 0 0 0 1 0 1 191
>> 2025-05-19 10:30:14 CEST \[1348336-471] remote\_production\_user\@blue LOCATION:
>>    UpdateDecodingStats, logical.c:1943
>> 2025-05-19 10:30:14 CEST \[1348336-472] remote\_production\_user\@blue DEBUG:
>> 00000: found top level transaction 207637519, with catalog changes
>> 2025-05-19 10:30:14 CEST \[1348336-473] remote\_production\_user\@blue LOCATION:
>>    SnapBuildCommitTxn, snapbuild.c:1150
>> 2025-05-19 10:30:14 CEST \[1348336-474] remote\_production\_user\@blue DEBUG:
>> 00000: adding a new snapshot and invalidations to 207616976 at FB03/34A1AAE0
>> 2025-05-19 10:30:14 CEST \[1348336-475] remote\_production\_user\@blue LOCATION:
>>    SnapBuildDistributeSnapshotAndInval, snapbuild.c:915
>> 2025-05-19 10:30:14 CEST \[1348336-476] remote\_production\_user\@blue ERROR:
>> XX000: invalid memory alloc request size 1196493216
>>
>> If I'm reading it right, things go wrong on the publisher while preparing the
>> message, i.e. it's not a subscriber problem.
>>
> 
> Right, I also think so.
> 
>> This particular instance was triggered by a large number of catalog
>> invalidations: I dumped what I think is the relevant WAL with "pg_waldump -s
>> FB03/34A1AAE0 -p 17/main/ --xid=207637519" and the output was a single long line:
>>
> ...
> ...
>>
>> While it is long, it doesn't seem to merit allocating anything like 1GB of
>> memory.  So I'm guessing that postgres is miscalculating the required size somehow.
>>
> 
> We fixed a bug in commit 4909b38af0 to distribute invalidation at the
> transaction end to avoid data loss in certain cases, which could cause
> such a problem. I am wondering that even prior to that commit, we
> would eventually end up allocating the required memory for a
> transaction for all the invalidations because of repalloc in
> ReorderBufferAddInvalidations, so why it matter with this commit? One
> possibility is that we need allocations for multiple in-progress
> transactions now. I'll think more about this. It would be helpful if
> you could share more details about the workload, or if possible, a
> testcase or script using which we can reproduce this problem.
> 




On Wed, May 21, 2025 at 4:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, May 21, 2025 at 11:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, May 19, 2025 at 8:08 PM Duncan Sands
> > <duncan.sands@deepbluecap.com> wrote:
> > >
> > > While it is long, it doesn't seem to merit allocating anything like 1GB of
> > > memory.  So I'm guessing that postgres is miscalculating the required size somehow.
> > >
> >
> > We fixed a bug in commit 4909b38af0 to distribute invalidation at the
> > transaction end to avoid data loss in certain cases, which could cause
> > such a problem. I am wondering that even prior to that commit, we
> > would eventually end up allocating the required memory for a
> > transaction for all the invalidations because of repalloc in
> > ReorderBufferAddInvalidations, so why it matter with this commit? One
> > possibility is that we need allocations for multiple in-progress
> > transactions now.
> >
>
> I think the problem here is that when we are distributing
> invalidations to a concurrent transaction, in addition to queuing the
> invalidations as a change, we also copy the distributed invalidations
> along with the original transaction's invalidations via repalloc in
> ReorderBufferAddInvalidations. So, when there are many in-progress
> transactions, each would try to copy all its accumulated invalidations
> to the remaining in-progress transactions. This could lead to such an
> increase in allocation request size.

I agree with this analysis.

> However, after queuing the
> change, we don't need to copy it along with the original transaction's
> invalidations. This is because the copy is only required when we don't
> process any changes in cases like ReorderBufferForget().

It seems that we use the accumulated invalidation message also after
replaying or concurrently aborting a transaction via
ReorderBufferExecuteInvalidations(). Do we need to consider such cases
too?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Wed, 21 May 2025 at 17:18, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> > I think the problem here is that when we are distributing
> > invalidations to a concurrent transaction, in addition to queuing the
> > invalidations as a change, we also copy the distributed invalidations
> > along with the original transaction's invalidations via repalloc in
> > ReorderBufferAddInvalidations. So, when there are many in-progress
> > transactions, each would try to copy all its accumulated invalidations
> > to the remaining in-progress transactions. This could lead to such an
> > increase in allocation request size. However, after queuing the
> > change, we don't need to copy it along with the original transaction's
> > invalidations. This is because the copy is only required when we don't
> > process any changes in cases like ReorderBufferForget(). I have
> > analyzed all such cases, and my analysis is as follows:
>
> Based on the analysis, I created a PoC which avoids the repalloc().
> Invalidation messages distributed by SnapBuildDistributeSnapshotAndInval() are
> skipped to add in the list, just queued - repalloc can be skipped. Also, the function
> distributes messages only in the list, so received messages won't be sent again.
>
> Now a patch for PG17 is created for testing purpose. Duncan, can you apply this and
> confirms whether the issue can be solved?
>
Hi,

I was able to reproduce the issue with following test:

1. First begin 9 concurrent txn. (BEGIN; INSERT into t1 values(11);)
2. In 10th concurrent txn : perform 1000 DDL (ALTER PUBLICATION ADD/DROP TABLE)
3. For each concurrent 9 txn. Perform:
    i. Add 1000 DDL
    ii. COMMIT;
    iii. BEGIN; INSERT into t1 values(11);
4. Perform step (2 and 3) in loop

This steps reproduced the error:
2025-05-22 19:03:35.111 JST [63150] sub1 ERROR:  invalid memory alloc
request size 1555752832
2025-05-22 19:03:35.111 JST [63150] sub1 STATEMENT:  START_REPLICATION
SLOT "sub1" LOGICAL 0/0 (proto_version '4', streaming 'parallel',
origin 'any', publication_names '"pub1"')

I have also attached the test script for the same.
Also, I tried to run the test with Kuroda-san's patch and it did not
reproduce the issue.

Thanks and Regards,
Shlok Kyal

Attachment
On Thu, May 22, 2025 at 6:29 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Amit, Sawada-san,
>
> > Good point. After replaying the transaction, it doesn't matter because
> > we would have already relayed the required invalidation while
> > processing REORDER_BUFFER_CHANGE_INVALIDATION messages. However
> > for
> > concurrent abort case it could matter. See my analysis for the same
> > below:
> >
> > Simulation of concurrent abort
> > ------------------------------------------
> > 1) S1: CREATE TABLE d(data text not null);
> > 2) S1: INSERT INTO d VALUES('d1');
> > 3) S2: BEGIN; INSERT INTO d VALUES('d2');
> > 4) S2: INSERT INTO unrelated_tab VALUES(1);
> > 5) S1: ALTER PUBLICATION pb ADD TABLE d;
> > 6) S2: INSERT INTO unrelated_tab VALUES(2);
> > 7) S2: ROLLBACK;
> > 8) S2: INSERT INTO d VALUES('d3');
> > 9) S1: INSERT INTO d VALUES('d4');
>
> > The problem with the sequence is that the insert from 3) could be
> > decoded *after* 5) in step 6) due to streaming and that to decode the
> > insert (which happened before the ALTER) the catalog snapshot and
> > cache state is from *before* the ALTER TABLE. Because the transaction
> > started in 3) doesn't actually modify any catalogs, no invalidations
> > are executed after decoding it. Now, assume, while decoding Insert
> > from 4), we detected a concurrent abort, then the distributed
> > invalidation won't be executed, and if we don't have accumulated
> > messages in txn->invalidations, then the invalidation from step 5)
> > won't be performed. The data loss can occur in steps 8 and 9. This is
> > just a theory, so I could be missing something.
>
> I verified this is real or not, and succeeded to reproduce. See appendix the
> detailed steps.
>
> > If the above turns out to be a problem, one idea for fixing it is that
> > for the concurrent abort case (both during streaming and for prepared
> > transaction's processing), we still check all the remaining changes
> > and process only the changes related to invalidations. This has to be
> > done before the current txn changes are freed via
> > ReorderBufferResetTXN->ReorderBufferTruncateTXN.
>
> I roughly implemented the part, PSA the updated version. One concern is whether we
> should consider the case that invalidations can cause ereport(ERROR). If happens,
> the walsender will exit at that time.
>

But, in the catch part, we are already executing invalidations:
...
 /* make sure there's no cache pollution */
ReorderBufferExecuteInvalidations(txn->ninvalidations,  txn->invalidations);
...

So, the behaviour should be the same.

--
With Regards,
Amit Kapila.



Dear Hayato Kuroda, thank you so much for working on this problem.  Your patch 
PG17-0001-Avoid-distributing-invalidation-messages-several-tim.patch solves the 
issue for me.  Without it I get an invalid memory alloc request error within 
about twenty minutes.  With your patch, 24 hours have passed with no errors.

Best wishes, Duncan.

On 21/05/2025 13:48, Hayato Kuroda (Fujitsu) wrote:
> Dear hackers,
> 
>> I think the problem here is that when we are distributing
>> invalidations to a concurrent transaction, in addition to queuing the
>> invalidations as a change, we also copy the distributed invalidations
>> along with the original transaction's invalidations via repalloc in
>> ReorderBufferAddInvalidations. So, when there are many in-progress
>> transactions, each would try to copy all its accumulated invalidations
>> to the remaining in-progress transactions. This could lead to such an
>> increase in allocation request size. However, after queuing the
>> change, we don't need to copy it along with the original transaction's
>> invalidations. This is because the copy is only required when we don't
>> process any changes in cases like ReorderBufferForget(). I have
>> analyzed all such cases, and my analysis is as follows:
> 
> Based on the analysis, I created a PoC which avoids the repalloc().
> Invalidation messages distributed by SnapBuildDistributeSnapshotAndInval() are
> skipped to add in the list, just queued - repalloc can be skipped. Also, the function
> distributes messages only in the list, so received messages won't be sent again.
> 
> Now a patch for PG17 is created for testing purpose. Duncan, can you apply this and
> confirms whether the issue can be solved?
> 
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
>