Thread: BUG #18484: "Cannot enlarge string buffer" during parallel execution of prepared statement/partitioning

The following bug has been logged on the website:

Bug reference:      18484
Logged by:          Michael Banck
Email address:      michael.banck@netapp.com
PostgreSQL version: 16.3
Operating system:   Debian bookworm
Description:

A customer reported to us that a query/prepared statement on their database
aborted with

ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073741812 bytes by 32 more
bytes.

The underlying database schema was migrated from Oracle ("no problems
there") and the base table is two-level partitioned, one range partitioning
per day (currently 135), and then each of those is list-partitioned to
around 50 sub-partitions. Each partition has around 20 indexes.

This leads to around 20k tables and 400k indexes in the schema.

The query consists on a simple select on the table with a lot of OR
conditions like

((((A_FOO = $1) OR (B_FOO = $2)) AND ((START_TIME <= $3) AND (END_TIME >
$4))) 
OR (((A_FOO = $5) OR (B_FOO = $6)) AND ((START_TIME <= $7) AND (END_TIME >
$8))) 
OR (((A_FOO = $9) OR (B_FOO = $10)) AND ((START_TIME <= $11) AND (END_TIME >
$12))) 
OR (((A_FOO = $13) OR (B_FOO = $14)) AND ((START_TIME <= $15) AND (END_TIME
> $16)) 

Overall there are around 100 parameters. The explain plan shows around 15k
Parallel Seq Scans on the various paritions and is around a millions lines
long (in order to get the explain, max_connections and
max_locks_per_transaction need to be bumped up considerably from the
defaults).

They shared the schema with us and we were able to reproduce the error on
the empty schema.  If we set max_parallel_workers_per_gather=0, the prepared
statement executes fine. Also, testing this on 17beta1 did not throw the
error.

The backtrace is:

#0  errfinish (filename=0x55feb1d28100 "./build/../src/common/stringinfo.c",
lineno=303, funcname=0x55feb1d28170 <__func__.0> "enlargeStringInfo") at
./build/../src/backend/utils/error/elog.c:480
#1  0x000055feb16c7292 in enlargeStringInfo (str=str@entry=0x7fff222fe4b0,
needed=32) at ./build/../src/common/stringinfo.c:303
#2  0x000055feb1b9c666 in appendStringInfo (str=str@entry=0x7fff222fe4b0,
fmt=fmt@entry=0x55feb1c9e3fa " :varlevelsup %u") at
./build/../src/common/stringinfo.c:110
#3  0x000055feb18e1810 in _outVar (node=0x560304a6c420, str=0x7fff222fe4b0)
at ./build/../src/backend/nodes/outfuncs.funcs.c:110
#4  outNode (str=0x7fff222fe4b0, obj=0x560304a6c420) at
./build/../src/backend/nodes/outfuncs.switch.c:31
#5  0x000055feb18e4626 in _outList (str=0x7fff222fe4b0, node=0x560304a6c3d8)
at ./build/../src/backend/nodes/outfuncs.c:294
#6  0x000055feb18e0b4a in outNode (str=<optimized out>, obj=<optimized out>)
at ./build/../src/backend/nodes/outfuncs.c:843
#7  0x000055feb18e528c in _outOpExpr (str=str@entry=0x7fff222fe4b0,
node=node@entry=0x560304a6c390) at
./build/../src/backend/nodes/outfuncs.funcs.c:239
#8  0x000055feb18e187d in outNode (str=0x7fff222fe4b0, obj=0x560304a6c390)
at ./build/../src/backend/nodes/outfuncs.switch.c:58
#9  0x000055feb18e4626 in _outList (str=0x7fff222fe4b0, node=0x560304a6c348)
at ./build/../src/backend/nodes/outfuncs.c:294
#10 0x000055feb18e0b4a in outNode (str=<optimized out>, obj=<optimized out>)
at ./build/../src/backend/nodes/outfuncs.c:843
#11 0x000055feb18f6a81 in _outSeqScan (str=str@entry=0x7fff222fe4b0,
node=node@entry=0x560304a6a2e8) at
./build/../src/backend/nodes/outfuncs.funcs.c:4610
#12 0x000055feb18e119b in outNode (str=0x7fff222fe4b0, obj=0x560304a6a2e8)
at ./build/../src/backend/nodes/outfuncs.switch.c:961
#13 0x000055feb18e4626 in _outList (str=0x7fff222fe4b0, node=0x5602d5976a10)
at ./build/../src/backend/nodes/outfuncs.c:294
#14 0x000055feb18e0b4a in outNode (str=<optimized out>, obj=<optimized out>)
at ./build/../src/backend/nodes/outfuncs.c:843
#15 0x000055feb18f5dc5 in _outAppend (str=0x7fff222fe4b0,
node=0x5602d50a0598) at ./build/../src/backend/nodes/outfuncs.funcs.c:4485
#16 0x000055feb18e11eb in outNode (str=0x7fff222fe4b0, obj=0x5602d50a0598)
at ./build/../src/backend/nodes/outfuncs.switch.c:946
#17 0x000055feb18f50fd in _outPlannedStmt (str=str@entry=0x7fff222fe4b0,
node=node@entry=0x56030f2d1da8) at
./build/../src/backend/nodes/outfuncs.funcs.c:4360
#18 0x000055feb18e122b in outNode (str=0x7fff222fe4b0, obj=0x56030f2d1da8)
at ./build/../src/backend/nodes/outfuncs.switch.c:934
#19 0x000055feb18fc9e4 in nodeToString (obj=obj@entry=0x56030f2d1da8) at
./build/../src/backend/nodes/outfuncs.c:889
#20 0x000055feb18741f7 in ExecSerializePlan (estate=0x5602749b9010,
plan=<optimized out>) at
./build/../src/backend/executor/execParallel.c:218
#21 ExecInitParallelPlan (planstate=0x5602749b9620, estate=0x5602749b9010,
sendParams=0x0, nworkers=2, tuples_needed=-1) at
./build/../src/backend/executor/execParallel.c:633
#22 0x000055feb188b051 in ExecGather (pstate=0x5602749b9448) at
./build/../src/backend/executor/nodeGather.c:171
#23 0x000055feb18a1866 in ExecProcNode (node=0x5602749b9448) at
./build/../src/include/executor/executor.h:273
#24 ExecSort (pstate=0x5602749b9240) at
./build/../src/backend/executor/nodeSort.c:149
#25 0x000055feb1870c05 in ExecProcNode (node=0x5602749b9240) at
./build/../src/include/executor/executor.h:273
#26 ExecutePlan (execute_once=<optimized out>, dest=0x55feb2b3f6d8,
direction=<optimized out>, numberTuples=0, sendTuples=true,
operation=CMD_SELECT, use_parallel_mode=<optimized out>, 
    planstate=0x5602749b9240, estate=0x5602749b9010) at
./build/../src/backend/executor/execMain.c:1670
#27 standard_ExecutorRun (queryDesc=0x55feb2b4ddd0, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:365
#28 0x000055feb1a18c9e in PortalRunSelect
(portal=portal@entry=0x55feb2bdcf88, forward=forward@entry=true, count=0,
count@entry=9223372036854775807, dest=dest@entry=0x55feb2b3f6d8)
    at ./build/../src/backend/tcop/pquery.c:924
#29 0x000055feb1a1a130 in PortalRun (portal=portal@entry=0x55feb2bdcf88,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=false,
run_once=run_once@entry=true, dest=dest@entry=0x55feb2b3f6d8, 
    altdest=altdest@entry=0x55feb2b3f6d8, qc=<optimized out>) at
./build/../src/backend/tcop/pquery.c:768
#30 0x000055feb1825abc in ExecuteQuery (pstate=<optimized out>,
stmt=<optimized out>, intoClause=<optimized out>, params=<optimized out>,
dest=<optimized out>, qc=<optimized out>)
    at ./build/../src/backend/commands/prepare.c:258
#31 0x000055feb1a1b748 in standard_ProcessUtility (pstmt=0x55feb2afc048, 
    queryString=0x55feb2afab30 "execute t (\n'262010000000001',
'262010000000001', '20100621131707',
'20090420140642',\n'262010000000002','262010000000002', '20090420140642',
'20081209084155', \n'262010000000003',  '262010000000003', "...,
readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=<optimized out>, qc=<optimized out>) at
./build/../src/backend/tcop/utility.c:757
#32 0x000055feb1a19bb1 in PortalRunUtility
(portal=portal@entry=0x55feb2bdce80, pstmt=0x55feb2afc048,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=true, 
    dest=dest@entry=0x55feb2b3f6d8, qc=qc@entry=0x7fff222fe9d0) at
./build/../src/backend/tcop/pquery.c:1158
#33 0x000055feb1a19f77 in FillPortalStore
(portal=portal@entry=0x55feb2bdce80, isTopLevel=isTopLevel@entry=true) at
./build/../src/backend/tcop/pquery.c:1031
#34 0x000055feb1a1a2bf in PortalRun (portal=portal@entry=0x55feb2bdce80,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x55feb2afc150, 
    altdest=altdest@entry=0x55feb2afc150, qc=<optimized out>) at
./build/../src/backend/tcop/pquery.c:763
#35 0x000055feb1a164c5 in exec_simple_query (query_string=<optimized out>)
at ./build/../src/backend/tcop/postgres.c:1272
#36 0x000055feb1a17209 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4654
#37 0x000055feb198f33c in BackendRun (port=0x55feb2b4e7f0) at
./build/../src/backend/postmaster/postmaster.c:4464
#38 BackendStartup (port=0x55feb2b4e7f0) at
./build/../src/backend/postmaster/postmaster.c:4192
#39 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1782
#40 0x000055feb1990328 in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x55feb2ab7ef0) at
./build/../src/backend/postmaster/postmaster.c:1466
#41 0x000055feb16c763d in main (argc=5, argv=0x55feb2ab7ef0) at
./build/../src/backend/main/main.c:198

So I guess the problem is somewhere here:

ExecInitParallelPlan - src/backend/executor/execParallel.c:633
        /* Fix up and serialize plan to be sent to workers. */
        pstmt_data = ExecSerializePlan(planstate->plan, estate);

ExecSerializePlan - src/backend/executor/execParallel.c:218
        /* Return serialized copy of our dummy PlannedStmt. */
        return nodeToString(pstmt);

I agree that the overall schema architecture should be rethought and much
fewer partitions be used (and they are looking into this), but as the error
does only happen with a parallel plan, I am reporting it here in case there
is an easy fix for this, or at least maybe we can abort earlier with a
better error message?


On Wed, 29 May 2024 at 04:42, PG Bug reporting form
<noreply@postgresql.org> wrote:
> ERROR:  out of memory
> DETAIL:  Cannot enlarge string buffer containing 1073741812 bytes by 32 more
> bytes.

> I agree that the overall schema architecture should be rethought and much
> fewer partitions be used (and they are looking into this), but as the error
> does only happen with a parallel plan, I am reporting it here in case there
> is an easy fix for this, or at least maybe we can abort earlier with a
> better error message?

I think we should at least write a better error message.  "out of
memory" is misleading. I've attached a patch with some proposed better
wording.

The above will become:
ERROR:  string buffer exceeds maximum allowed length (1073741823)
DETAIL:  Cannot enlarge string buffer containing 1073741812 bytes by 32 more

I also see a few other places that report
ERRCODE_PROGRAM_LIMIT_EXCEEDED and display "out of memory". Namely:

pg_do_encoding_conversion
perform_default_encoding_conversion
pvsnprintf

We should probably consider fixing those too.

I understand this isn't to the level that you'd like to see this
fixed, but it at least should stop people from thinking that
installing more RAM might help.

David

Attachment
> On 30 May 2024, at 03:26, David Rowley <dgrowleyml@gmail.com> wrote:

> I think we should at least write a better error message.  "out of
> memory" is misleading. I've attached a patch with some proposed better
> wording.
>
> The above will become:
> ERROR:  string buffer exceeds maximum allowed length (1073741823)
> DETAIL:  Cannot enlarge string buffer containing 1073741812 bytes by 32 more

Should we use a more human readable unit than bytes in the errmsg as well?

--
Daniel Gustafsson




On Thu, 30 May 2024 at 18:25, Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 30 May 2024, at 03:26, David Rowley <dgrowleyml@gmail.com> wrote:
> > The above will become:
> > ERROR:  string buffer exceeds maximum allowed length (1073741823)
> > DETAIL:  Cannot enlarge string buffer containing 1073741812 bytes by 32 more
>
> Should we use a more human readable unit than bytes in the errmsg as well?

I wondered about that too. There were a couple of reasons I didn't. 1)
It's easy to see that 1073741812 + 32 is greater than 1073741823. If
we divided those down by 1024 a few times then that calculation can no
longer be done. 2) I wanted to use MaxAllocSize rather than hard wire
anything into the string.  That left just doing something like
MaxAllocSize / 1024 / 1024. If you did that with the limit size and
the current string size, it would likely just turn into the same
number and make the complaint confusing. e.g:

ERROR:  string buffer exceeds maximum allowed length (1023MB)
DETAIL:  Cannot enlarge string buffer containing 1023MB by 32 bytes more

or

ERROR:  string buffer exceeds maximum allowed length (1023MB)
DETAIL:  Cannot enlarge string buffer containing 1023MB by 0MB more

I know the bytes are hard to read, but I just don't feel either of the
above are easier for a user to process. The first one gives them 2
different units that they need to convert between and the 2nd is
confusing because being unable to enlarge by zero seems like a strange
limitation.

David

David



> On 30 May 2024, at 11:10, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 30 May 2024 at 18:25, Daniel Gustafsson <daniel@yesql.se> wrote:
>>
>>> On 30 May 2024, at 03:26, David Rowley <dgrowleyml@gmail.com> wrote:
>>> The above will become:
>>> ERROR:  string buffer exceeds maximum allowed length (1073741823)
>>> DETAIL:  Cannot enlarge string buffer containing 1073741812 bytes by 32 more
>>
>> Should we use a more human readable unit than bytes in the errmsg as well?
>
> I wondered about that too. There were a couple of reasons I didn't. 1)
> It's easy to see that 1073741812 + 32 is greater than 1073741823. If
> we divided those down by 1024 a few times then that calculation can no
> longer be done. 2) I wanted to use MaxAllocSize rather than hard wire
> anything into the string.  That left just doing something like
> MaxAllocSize / 1024 / 1024. If you did that with the limit size and
> the current string size, it would likely just turn into the same
> number and make the complaint confusing. e.g:
>
> ERROR:  string buffer exceeds maximum allowed length (1023MB)
> DETAIL:  Cannot enlarge string buffer containing 1023MB by 32 bytes more
>
> or
>
> ERROR:  string buffer exceeds maximum allowed length (1023MB)
> DETAIL:  Cannot enlarge string buffer containing 1023MB by 0MB more
>
> I know the bytes are hard to read, but I just don't feel either of the
> above are easier for a user to process. The first one gives them 2
> different units that they need to convert between

Fair enough, mixing units probably wont be more readable.  The middle ground
might be a bit much:

ERROR:  string buffer exceeds maximum allowed length (1023MB)
DETAIL:  Cannot enlarge string buffer containing 1023MB (1073741823 bytes) by 32 bytes more

Did you omit bytes from the errmsg on purpose?  "length" of a string buffer can
be interpreted as the number of characters, and that wouldn't be true for wide
chars.

> and the 2nd is
> confusing because being unable to enlarge by zero seems like a strange
> limitation.

Yeah, that would no doubt be a source of endless bug reports and Stack Overflow
searches.

--
Daniel Gustafsson




On Thu, 30 May 2024 at 21:27, Daniel Gustafsson <daniel@yesql.se> wrote:
> ERROR:  string buffer exceeds maximum allowed length (1023MB)
> DETAIL:  Cannot enlarge string buffer containing 1023MB (1073741823 bytes) by 32 bytes more

That's better than the two alternatives I showed. The problem now is
that the actual limit (0x3fffffff), is 1 byte less than 1024MB, so
showing 1023MB is likely debatable. It's just 1 byte in a ~million
short.  I kinda think we could just avoid any future revisits from
people having concern about which direction we round to by just not
rounding and doing bytes.  It's not as if it's a message we expect
many people to see, but if they do see it, it's probably going to be
analyzed quite a bit. Having the highest level of detail for that
seems like a good idea.  I don't think it should leave any other
questions other than "Why is Postgres trying to build such a long
string?". Of course, fixing that one is harder.

> Did you omit bytes from the errmsg on purpose?  "length" of a string buffer can
> be interpreted as the number of characters, and that wouldn't be true for wide
> chars.

Not on purpose. I do think (%zu bytes) would be better than (%zu).  I
only thought about that while typing "MB" in the 2nd email.

David



> On 30 May 2024, at 11:55, David Rowley <dgrowleyml@gmail.com> wrote:

> It's not as if it's a message we expect
> many people to see, but if they do see it, it's probably going to be
> analyzed quite a bit. Having the highest level of detail for that
> seems like a good idea.  I don't think it should leave any other
> questions other than "Why is Postgres trying to build such a long
> string?".

Good point.

> Not on purpose. I do think (%zu bytes) would be better than (%zu).  I
> only thought about that while typing "MB" in the 2nd email.

+1

--
Daniel Gustafsson




Daniel Gustafsson <daniel@yesql.se> writes:
>> On 30 May 2024, at 11:55, David Rowley <dgrowleyml@gmail.com> wrote:
>> Not on purpose. I do think (%zu bytes) would be better than (%zu).  I
>> only thought about that while typing "MB" in the 2nd email.

> +1

+1 for "(%zu bytes)", and also for not rounding off.  This is a case
where that won't make it more intelligible.

            regards, tom lane



On Fri, 31 May 2024 at 03:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> +1 for "(%zu bytes)", and also for not rounding off.  This is a case
> where that won't make it more intelligible.

Here's a revised patch with "(%zu bytes)".

Unsure if this needs to wait until we branch or if it's fine to sneak
into v17.  I'll leave it til after the branch unless anyone feels
differently.

David

Attachment
David Rowley <dgrowleyml@gmail.com> writes:
> Unsure if this needs to wait until we branch or if it's fine to sneak
> into v17.  I'll leave it til after the branch unless anyone feels
> differently.

While I'd not fight against committing now, I'd lean to waiting
for v18.  It's not a bug fix, and it's changing translatable
messages at a time when the translators don't need extra work.

            regards, tom lane



> On 31 May 2024, at 03:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> ..and it's changing translatable messages at a time when the translators don't need extra work.

This is a compelling argument for waiting for v18 I think.

--
Daniel Gustafsson