BUG #18484: "Cannot enlarge string buffer" during parallel execution of prepared statement/partitioning - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18484: "Cannot enlarge string buffer" during parallel execution of prepared statement/partitioning
Date
Msg-id 18484-3e357ade5fe50e61@postgresql.org
Whole thread Raw
Responses Re: BUG #18484: "Cannot enlarge string buffer" during parallel execution of prepared statement/partitioning
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18484
Logged by:          Michael Banck
Email address:      michael.banck@netapp.com
PostgreSQL version: 16.3
Operating system:   Debian bookworm
Description:

A customer reported to us that a query/prepared statement on their database
aborted with

ERROR:  out of memory
DETAIL:  Cannot enlarge string buffer containing 1073741812 bytes by 32 more
bytes.

The underlying database schema was migrated from Oracle ("no problems
there") and the base table is two-level partitioned, one range partitioning
per day (currently 135), and then each of those is list-partitioned to
around 50 sub-partitions. Each partition has around 20 indexes.

This leads to around 20k tables and 400k indexes in the schema.

The query consists on a simple select on the table with a lot of OR
conditions like

((((A_FOO = $1) OR (B_FOO = $2)) AND ((START_TIME <= $3) AND (END_TIME >
$4))) 
OR (((A_FOO = $5) OR (B_FOO = $6)) AND ((START_TIME <= $7) AND (END_TIME >
$8))) 
OR (((A_FOO = $9) OR (B_FOO = $10)) AND ((START_TIME <= $11) AND (END_TIME >
$12))) 
OR (((A_FOO = $13) OR (B_FOO = $14)) AND ((START_TIME <= $15) AND (END_TIME
> $16)) 

Overall there are around 100 parameters. The explain plan shows around 15k
Parallel Seq Scans on the various paritions and is around a millions lines
long (in order to get the explain, max_connections and
max_locks_per_transaction need to be bumped up considerably from the
defaults).

They shared the schema with us and we were able to reproduce the error on
the empty schema.  If we set max_parallel_workers_per_gather=0, the prepared
statement executes fine. Also, testing this on 17beta1 did not throw the
error.

The backtrace is:

#0  errfinish (filename=0x55feb1d28100 "./build/../src/common/stringinfo.c",
lineno=303, funcname=0x55feb1d28170 <__func__.0> "enlargeStringInfo") at
./build/../src/backend/utils/error/elog.c:480
#1  0x000055feb16c7292 in enlargeStringInfo (str=str@entry=0x7fff222fe4b0,
needed=32) at ./build/../src/common/stringinfo.c:303
#2  0x000055feb1b9c666 in appendStringInfo (str=str@entry=0x7fff222fe4b0,
fmt=fmt@entry=0x55feb1c9e3fa " :varlevelsup %u") at
./build/../src/common/stringinfo.c:110
#3  0x000055feb18e1810 in _outVar (node=0x560304a6c420, str=0x7fff222fe4b0)
at ./build/../src/backend/nodes/outfuncs.funcs.c:110
#4  outNode (str=0x7fff222fe4b0, obj=0x560304a6c420) at
./build/../src/backend/nodes/outfuncs.switch.c:31
#5  0x000055feb18e4626 in _outList (str=0x7fff222fe4b0, node=0x560304a6c3d8)
at ./build/../src/backend/nodes/outfuncs.c:294
#6  0x000055feb18e0b4a in outNode (str=<optimized out>, obj=<optimized out>)
at ./build/../src/backend/nodes/outfuncs.c:843
#7  0x000055feb18e528c in _outOpExpr (str=str@entry=0x7fff222fe4b0,
node=node@entry=0x560304a6c390) at
./build/../src/backend/nodes/outfuncs.funcs.c:239
#8  0x000055feb18e187d in outNode (str=0x7fff222fe4b0, obj=0x560304a6c390)
at ./build/../src/backend/nodes/outfuncs.switch.c:58
#9  0x000055feb18e4626 in _outList (str=0x7fff222fe4b0, node=0x560304a6c348)
at ./build/../src/backend/nodes/outfuncs.c:294
#10 0x000055feb18e0b4a in outNode (str=<optimized out>, obj=<optimized out>)
at ./build/../src/backend/nodes/outfuncs.c:843
#11 0x000055feb18f6a81 in _outSeqScan (str=str@entry=0x7fff222fe4b0,
node=node@entry=0x560304a6a2e8) at
./build/../src/backend/nodes/outfuncs.funcs.c:4610
#12 0x000055feb18e119b in outNode (str=0x7fff222fe4b0, obj=0x560304a6a2e8)
at ./build/../src/backend/nodes/outfuncs.switch.c:961
#13 0x000055feb18e4626 in _outList (str=0x7fff222fe4b0, node=0x5602d5976a10)
at ./build/../src/backend/nodes/outfuncs.c:294
#14 0x000055feb18e0b4a in outNode (str=<optimized out>, obj=<optimized out>)
at ./build/../src/backend/nodes/outfuncs.c:843
#15 0x000055feb18f5dc5 in _outAppend (str=0x7fff222fe4b0,
node=0x5602d50a0598) at ./build/../src/backend/nodes/outfuncs.funcs.c:4485
#16 0x000055feb18e11eb in outNode (str=0x7fff222fe4b0, obj=0x5602d50a0598)
at ./build/../src/backend/nodes/outfuncs.switch.c:946
#17 0x000055feb18f50fd in _outPlannedStmt (str=str@entry=0x7fff222fe4b0,
node=node@entry=0x56030f2d1da8) at
./build/../src/backend/nodes/outfuncs.funcs.c:4360
#18 0x000055feb18e122b in outNode (str=0x7fff222fe4b0, obj=0x56030f2d1da8)
at ./build/../src/backend/nodes/outfuncs.switch.c:934
#19 0x000055feb18fc9e4 in nodeToString (obj=obj@entry=0x56030f2d1da8) at
./build/../src/backend/nodes/outfuncs.c:889
#20 0x000055feb18741f7 in ExecSerializePlan (estate=0x5602749b9010,
plan=<optimized out>) at
./build/../src/backend/executor/execParallel.c:218
#21 ExecInitParallelPlan (planstate=0x5602749b9620, estate=0x5602749b9010,
sendParams=0x0, nworkers=2, tuples_needed=-1) at
./build/../src/backend/executor/execParallel.c:633
#22 0x000055feb188b051 in ExecGather (pstate=0x5602749b9448) at
./build/../src/backend/executor/nodeGather.c:171
#23 0x000055feb18a1866 in ExecProcNode (node=0x5602749b9448) at
./build/../src/include/executor/executor.h:273
#24 ExecSort (pstate=0x5602749b9240) at
./build/../src/backend/executor/nodeSort.c:149
#25 0x000055feb1870c05 in ExecProcNode (node=0x5602749b9240) at
./build/../src/include/executor/executor.h:273
#26 ExecutePlan (execute_once=<optimized out>, dest=0x55feb2b3f6d8,
direction=<optimized out>, numberTuples=0, sendTuples=true,
operation=CMD_SELECT, use_parallel_mode=<optimized out>, 
    planstate=0x5602749b9240, estate=0x5602749b9010) at
./build/../src/backend/executor/execMain.c:1670
#27 standard_ExecutorRun (queryDesc=0x55feb2b4ddd0, direction=<optimized
out>, count=0, execute_once=<optimized out>) at
./build/../src/backend/executor/execMain.c:365
#28 0x000055feb1a18c9e in PortalRunSelect
(portal=portal@entry=0x55feb2bdcf88, forward=forward@entry=true, count=0,
count@entry=9223372036854775807, dest=dest@entry=0x55feb2b3f6d8)
    at ./build/../src/backend/tcop/pquery.c:924
#29 0x000055feb1a1a130 in PortalRun (portal=portal@entry=0x55feb2bdcf88,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=false,
run_once=run_once@entry=true, dest=dest@entry=0x55feb2b3f6d8, 
    altdest=altdest@entry=0x55feb2b3f6d8, qc=<optimized out>) at
./build/../src/backend/tcop/pquery.c:768
#30 0x000055feb1825abc in ExecuteQuery (pstate=<optimized out>,
stmt=<optimized out>, intoClause=<optimized out>, params=<optimized out>,
dest=<optimized out>, qc=<optimized out>)
    at ./build/../src/backend/commands/prepare.c:258
#31 0x000055feb1a1b748 in standard_ProcessUtility (pstmt=0x55feb2afc048, 
    queryString=0x55feb2afab30 "execute t (\n'262010000000001',
'262010000000001', '20100621131707',
'20090420140642',\n'262010000000002','262010000000002', '20090420140642',
'20081209084155', \n'262010000000003',  '262010000000003', "...,
readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=<optimized out>, qc=<optimized out>) at
./build/../src/backend/tcop/utility.c:757
#32 0x000055feb1a19bb1 in PortalRunUtility
(portal=portal@entry=0x55feb2bdce80, pstmt=0x55feb2afc048,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=true, 
    dest=dest@entry=0x55feb2b3f6d8, qc=qc@entry=0x7fff222fe9d0) at
./build/../src/backend/tcop/pquery.c:1158
#33 0x000055feb1a19f77 in FillPortalStore
(portal=portal@entry=0x55feb2bdce80, isTopLevel=isTopLevel@entry=true) at
./build/../src/backend/tcop/pquery.c:1031
#34 0x000055feb1a1a2bf in PortalRun (portal=portal@entry=0x55feb2bdce80,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x55feb2afc150, 
    altdest=altdest@entry=0x55feb2afc150, qc=<optimized out>) at
./build/../src/backend/tcop/pquery.c:763
#35 0x000055feb1a164c5 in exec_simple_query (query_string=<optimized out>)
at ./build/../src/backend/tcop/postgres.c:1272
#36 0x000055feb1a17209 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4654
#37 0x000055feb198f33c in BackendRun (port=0x55feb2b4e7f0) at
./build/../src/backend/postmaster/postmaster.c:4464
#38 BackendStartup (port=0x55feb2b4e7f0) at
./build/../src/backend/postmaster/postmaster.c:4192
#39 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1782
#40 0x000055feb1990328 in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x55feb2ab7ef0) at
./build/../src/backend/postmaster/postmaster.c:1466
#41 0x000055feb16c763d in main (argc=5, argv=0x55feb2ab7ef0) at
./build/../src/backend/main/main.c:198

So I guess the problem is somewhere here:

ExecInitParallelPlan - src/backend/executor/execParallel.c:633
        /* Fix up and serialize plan to be sent to workers. */
        pstmt_data = ExecSerializePlan(planstate->plan, estate);

ExecSerializePlan - src/backend/executor/execParallel.c:218
        /* Return serialized copy of our dummy PlannedStmt. */
        return nodeToString(pstmt);

I agree that the overall schema architecture should be rethought and much
fewer partitions be used (and they are looking into this), but as the error
does only happen with a parallel plan, I am reporting it here in case there
is an easy fix for this, or at least maybe we can abort earlier with a
better error message?


pgsql-bugs by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Next
From: Bowen Shi
Date:
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae