Re: BUG #18942: walsender memory allocation failure adding snapshot and invalidations to logical replica w/PG 16.9 - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: BUG #18942: walsender memory allocation failure adding snapshot and invalidations to logical replica w/PG 16.9 |
Date | |
Msg-id | CAD21AoD0Ex4AoGzHqSyTv=Bn9f2_y+eg9m6Kb_wwGxrQz4KKzA@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18942: walsender memory allocation failure adding snapshot and invalidations to logical replica w/PG 16.9 (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: BUG #18942: walsender memory allocation failure adding snapshot and invalidations to logical replica w/PG 16.9
|
List | pgsql-bugs |
On Fri, May 30, 2025 at 1:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, May 30, 2025 at 12:53 PM PG Bug reporting form > <noreply@postgresql.org> wrote: > > > > The following bug has been logged on the website: > > > > Bug reference: 18942 > > Logged by: Laurence Parry > > Email address: greenreaper@hotmail.com > > PostgreSQL version: 16.9 > > Operating system: Debian Linux bookworm (customised 6.11.10 kernel) > > Description: > > > > We run a logical replication chain for the purpose of offloading certain > > expensive queries, backup generation, and disaster recovery. Initially PG > > 16.8 from the PostgreSQL project's DEB repo was used for both primary and > > the replica. The replica was upgraded to PG 16.9 without incident. > > Performance of our primary was degraded following its own upgrade a few days > > later. Logs showed the walsender cycling due to a failure to allocate 1.5GB > > of memory: > > ERROR,XX000,"invalid memory alloc request size > > 1627040720",,,,,,"START_REPLICATION SLOT ""replica"" LOGICAL 298B/CB420338 > > (proto_version '4', streaming 'parallel', origin 'any', publication_names > > '""replica""', binary 'true')",,"repalloc, > > mcxt.c:1487","replica","walsender",,0" > > This occurred a few seconds after: > > "logical decoding found consistent point at 298B/CABD75E0","Logical decoding > > will begin using saved snapshot.",,,,,"START_REPLICATION SLOT ""replica"" > > LOGICAL 298B/CB420338 (proto_version '4', streaming 'parallel', origin > > 'any', publication_names '""replica""', binary 'true')",,"SnapBuildRestore, > > snapbuild.c:2043","replica","walsender",,0 > > DEBUG3 log level gave these messages immediately prior to the memory > > allocation failure (same microsecond): > > "UpdateDecodingStats: updating stats 0x5610b1380120 0 0 0 1 1 264 1 > > 621710620",,,,,,,,"UpdateDecodingStats, logical.c:1926" > > "found top level transaction 2650000463, with catalog > > changes",,,,,,,,"SnapBuildCommitTxn, snapbuild.c:1136" > > "adding a new snapshot and invalidations to 2650000421 at > > 298B/CB45EAA8",,,,,,,,"SnapBuildDistributeSnapshotAndInval, snapbuild.c:901" > > [MemoryContextAllocExtended seemingly used via palloc/rpalloc in e.g. > > ReorderBufferAddInvalidations normally only allows allocations of 1GB - this > > does not appear to be end-user tuneable. > > <https://github.com/postgres/postgres/blob/6aa33afe6da1ce5e6a9c68900727db4621f899cd/src/backend/utils/mmgr/mcxt.c#L1251> > > ] > > While debugging I ran out of time (SSD) to fix it and had to drop the > > replica subscription and VACUUM FULL the category tables (bloated 20-60x > > likely due to retained temp/unlogged tables). I saved log messages and the > > first two WAL files containing highlighted transactions. > > Today, after reindexing larger tables concurrently on the same 16.9 primary > > to free up space, I tried to rebuild the replica on 17.5 using the command > > originally used to build it [create subscription replica connection > > 'postgresql://app:pass@hostname:port/dbname?sslmode=verify-ca&connect_timeout=5&application_name=replica' > > publication replica with (binary, streaming = parallel)], and a copy of the > > schema from the 16.9 replica. Almost immediately I ran into the same memory > > allocation issue, but for 1.4GB instead ("invalid memory alloc request size > > 1512643104"), suggesting that the problem related to our workload/schema > > rather than some transitory error. > > I wonder if this issue relates to "Fix data loss in logical replication." > > <https://github.com/postgres/postgres/commit/4909b38af034fa4d2c67c5c71fd8509f870c1695> > > from April which was back-patched to 16 > > <https://github.com/postgres/postgres/commit/9a2f8b4f01d5a40b8b080f6b108ae314a6730cec> > > and triggers the "adding a new snapshot and invalidations" debug message > > immediately prior to the memory allocation failure. The WAL dump contains > > many long and repetitive catcache mentions. > > The workload includes both temporary and unlogged tables being created and > > destroyed on a regular basis (with a lifetime of ~15-30 minutes for the > > unlogged tables, roughly ~5,000 present at any time), used for generated > > gallery listings and search results. Probably not an ideal architecture; > > it's not something we can quickly change. > > The unlogged tables are not included in the publication, though a table > > registering them is. Exclusion of these cache tables and reduction of the > > resulting disk load on the HDD-based replica is one of the reasons logical > > replication is preferable over physical replication for our use-case, > > although we have used the latter before. > > A log of the walsender from an attempt to re-enable replication is available > > at: > > https://www.dropbox.com/scl/fi/bhhots6haujnqv74y88ut/logical-walsender.csv.xz?rlkey=jqd5ezihuuc0oyzigz2z1gb7a&st=lvrlknb7&dl=1 > > XID 2650000463 had 92 changes spilt to disk and then one other change spilt > > to disk with a different stats number: > > "adding a new snapshot and invalidations to 2650000463 at 298B/CB45DEB0" > > "spill 92 changes in XID 2650000463 to disk" > > "UpdateDecodingStats: updating stats 0x5610b1380120 1 1 621710356 0 0 0 0 0" > > [next two with temporary position 4/4693920] > > "spill 1 changes in XID 2650000463 to disk" > > "UpdateDecodingStats: updating stats 0x5610b1380120 0 1 264 0 0 0 0 0" > > "UpdateDecodingStats: updating stats 0x5610b1380120 0 0 0 1 1 264 1 > > 621710620" > > The output of waldump from 298B/CB3868B0 / 2650000420 to 298B/CB46C470 / > > 2650000466 is available at > > https://www.dropbox.com/scl/fi/88zbmtikbwrkjoqtg5y4c/logical-waldump.txt.xz?rlkey=r1fry9gfzlzntyc84hip6dl6y&e=1&st=ztm97lu4&dl=1 > > Modified server variables from the primary: > > https://www.dropbox.com/scl/fi/0b3b00vr1hgm577t9sihj/logical-variables.txt?rlkey=7hxmbjzh5sk9d2ko6633mgkcj&st=n73e4y3p&dl=0 > > If any other information is desired, please let me know. > > Regards, > > Thank you for the report. > > It's likely to face the same problem as the one[1] reported before. > We're discussing the right fix for the problem on that thread. > I pushed the fix for this issue yesterday[1]. Regards, [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d87d07b7ad3b782cb74566cd771ecdb2823adf6a -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-bugs by date: