Thread: BUG #15923: Prepared statements take way too much memory.

BUG #15923: Prepared statements take way too much memory.

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15923
Logged by:          Daniel Migowski
Email address:      dmigowski@ikoffice.de
PostgreSQL version: 11.2
Operating system:   Windows, Linux, doesn't matters
Description:

Hello,

Prepared Statements take too much memory. This is a bug report I already
filled a few years as #14726
(https://www.postgresql.org/message-id/20170702090956.1469.41812%40wrigleys.postgresql.org)
 ago but now I have a proof and can provide a testcase for you to simply
verify my assumptions. 

I have a large query like https://explain.depesz.com/s/gN2, which results in
30MB query plans. 

To verify if that is true I wrote a small script that prepares this query (a
simple SELECT * FROM myNotSoSimpleFatView) 250 times:

DO $$ 
DECLARE i int4; 
BEGIN 
    FOR i IN 1..250 LOOP 
        RAISE NOTICE 'This is prepared statement %', i;
        EXECUTE 'PREPARE testxy_'||i||' AS SELECT *, ''test_'||i||''' FROM
vw_report_salesinvoice WHERE salesinvoice_id = $1;'; 
    END LOOP; 
END $$;

To reproduce just insert your favority view and primary key name after the
FROM and have a look at memory consumption for yourself.

31540 postgres  20   0 1414452 976,9m  26816 R  99,6 12,2   0:11.11 postgres
  after a few queries
31540 postgres  20   0 2480276 1,903g  26816 R  99,9 24,3   0:23.10 postgres
  after 66 queries
31540 postgres  20   0 3824908 3,097g  26816 R  99,9 39,6   0:38.07 postgres
 after 100 queries
31540 postgres  20   0 5727036 4,786g  26816 R  99,9 61,2   0:59.04 postgres
 after 160 queries
...
31540 postgres  20   0 8646140 7,351g  19712 S   0,0 94,0   1:31.81 postgres
after 250 queries     <-  WTF 7 point 5 whopping gigs of RAM just for a few
prepared statements? Thats about 45M for each query!! 

PostgreSQL crashes regulary at my customer servers, because I use automatic
prepared statements for queries that are done often. At least I thought that
would be a good idea. 

Please note that this bug also affects other (like
https://github.com/rails/rails/issues/14645, where they just stopped using
Prepared Statements alltogether as a solution to their crashes).Most users
that use an ORM enable prepared queries, not seeing that PostgreSQL just
isn't capable to handle them. 

I have the problem on 9.5, and testing this on 11.2 still shows the same
behaviour. Please, please, someone have a look at where all that memory goes
and I will immediately roll out a new version to all of my customers. I love
your database and the stability, but this is definitely a point where you
can drastically improve!

Best and kindest regards,
Daniel Migowski


Re: BUG #15923: Prepared statements take way too much memory.

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> Prepared Statements take too much memory.

I'm not really sure what you expect us to do about this.  If you
PREPARE a bunch of statements and never DEALLOCATE them, then yes
they're going to consume memory; maybe a lot of memory if the
plans are complex.

The only thing we could do to reduce the memory consumption is to
throw away the plans, which kind of defeats the purpose of a
prepared statement, wouldn't you say?  And even holding onto enough
information to re-create the plan later would consume some amount
of memory, so you will still have a problem if you keep making
prepared statements and never release them.

            regards, tom lane



AW: BUG #15923: Prepared statements take way too much memory.

From
Daniel Migowski
Date:
I never keep more that 250 Prepared Statements. Sadly, with 10 connections to the database pooled, this is too much to
handlefor my poor servers.  

But really... 45MB? The query has 132 nodes, and each takes up 300kb memory? Do you store the table contents in that
thing?I don't want to be greedy, but might it be possible to store your Prepared Statements in a data structure that is
slightlyless memory hungry?  

Regards and many thanks ahead,
Daniel Migowski

-----Ursprüngliche Nachricht-----
Von: Tom Lane <tgl@sss.pgh.pa.us>
Gesendet: Donnerstag, 25. Juli 2019 00:24
An: Daniel Migowski <dmigowski@ikoffice.de>
Cc: pgsql-bugs@lists.postgresql.org
Betreff: Re: BUG #15923: Prepared statements take way too much memory.

PG Bug reporting form <noreply@postgresql.org> writes:
> Prepared Statements take too much memory.

I'm not really sure what you expect us to do about this.  If you PREPARE a bunch of statements and never DEALLOCATE
them,then yes they're going to consume memory; maybe a lot of memory if the plans are complex. 

The only thing we could do to reduce the memory consumption is to throw away the plans, which kind of defeats the
purposeof a prepared statement, wouldn't you say?  And even holding onto enough information to re-create the plan later
wouldconsume some amount of memory, so you will still have a problem if you keep making prepared statements and never
releasethem. 

            regards, tom lane



AW: BUG #15923: Prepared statements take way too much memory.

From
Daniel Migowski
Date:
I am considering trying PostgreSQL hacking to better understand the reasons of the memory consumption of Query Plans in
orderto be able to maybe reduce the memory used by them. 

Given the current implementation and what I learned from reading about memory allocation in the backend I, in my
currentlimited understanding of the backend, hope to find something like: 

* MemoryContext's in which stuff is palloc'ed that is not really used anymore. I have read that usually you don't pfree
verymuch assuming that the context will be free'd fully after some time anyway. In the case of PreparedStatements that
willbe much later than the devs of the Rewriter/Planner had in mind.  
* Alternate plans that will never be used anymore but which's allocated memory is still not freed.
* References to data like table and datatype definitions that are copied into the plan but are also copied multiple
times.Doesn't matter for small plans, but maybe for large ones where different subselects query the same tables.
Detectingsuch duplicated references would involve sweeping over a plan after it has been created, placing all stuff
intosome kind of map and unifying the references to those definitions. The time spend for this might even speed up
executingthe query because less memory has to be touched when the plan is executed potentially leading to better cpu
cachehit rates. This seems the most promising to me. 
* Or maybe one could, if my previous assumption is true, instead of copying stuff to the plan build a dictionary of
thisstuff in the connection process and just link to entries in it. I assume that when the structure of datatypes
changeson the server the plans have to be discarded anyway and you already know somehow which plans have to be
discardedand use what structures, so there already might be such a dictionary in place? If now, it might also increase
thespeed of dropping stuff when adding a refcount to these structures in the dictionary, so knowing if it has been
referencedsomewhere.  

Because you know the code better, do you think it might be worthwhile to go for these things or would it be waste of
timein your eyes?  

Regards,
Daniel Migowski


-----Ursprüngliche Nachricht-----
Von: Daniel Migowski
Gesendet: Donnerstag, 25. Juli 2019 00:32
An: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-bugs@lists.postgresql.org
Betreff: AW: BUG #15923: Prepared statements take way too much memory.

I never keep more that 250 Prepared Statements. Sadly, with 10 connections to the database pooled, this is too much to
handlefor my poor servers.  

But really... 45MB? The query has 132 nodes, and each takes up 300kb memory? Do you store the table contents in that
thing?I don't want to be greedy, but might it be possible to store your Prepared Statements in a data structure that is
slightlyless memory hungry?  

Regards and many thanks ahead,
Daniel Migowski

...



Re: AW: BUG #15923: Prepared statements take way too much memory.

From
Tom Lane
Date:
Daniel Migowski <dmigowski@ikoffice.de> writes:
> I am considering trying PostgreSQL hacking to better understand the reasons of the memory consumption of Query Plans
inorder to be able to maybe reduce the memory used by them. 

FWIW, I've thought for some time that we should invent a memory context
allocator that's meant for data that doesn't get realloc'd (much) after
first allocation, with lower overhead than aset.c has.  Such an allocator
would be ideal for plancache.c, and perhaps other use-cases such as
plpgsql function parsetrees.  IMV this would have these properties:

* Doesn't support retail pfree; to recover space you must destroy the
whole context.  We could just make pfree a no-op.  With the details
sketched below, repalloc would have to throw an error (because it would
not know the size of the old chunk), but I think that's OK for the
intended purpose.

* Minimum chunk header overhead, ie only the context pointer required by
the mmgr.c infrastructure.  In particular, don't store the chunk size.

* Request sizes are only rounded up to a MAXALIGN boundary, not to a
power of 2.

* All else could be pretty much the same as aset.c, except we'd not need
to have any freelist management.  I would imagine that callers might want
to use a smaller initial and maximum block size than is typical with
aset.c, to reduce end-of-context wastage, but that would be up to the
callers.  (Possibly, instead of freelists, we'd trouble to track
end-of-block space in more than one active block, so that we avoid
wasting such space when a large allocation doesn't quite fit.)

This wouldn't take very much new code, and it would potentially offer
up to perhaps 2X space savings compared to aset.c, thanks to cutting the
per-chunk overhead.  Whether that's enough to float your boat with respect
to cached plans isn't too clear, though.


A totally different idea is to make a variant version of copyObject
that is intended to produce a compact form of a node tree, and does
not create a separate palloc allocation for each node but just packs
them as tightly as it can in larger palloc chunks.  This could outperform
the no-pfree-context idea because it wouldn't need even context-pointer
overhead for each node.  (This relies on node trees being read-only to
whatever code is looking at them, which should be OK for finished plan
trees that are copied into the plancache; otherwise somebody might think
they could apply repalloc, GetMemoryChunkContext, etc to the nodes,
which'd crash.)  The stumbling block here is that nobody is gonna
tolerate maintaining two versions of copyfuncs.c, so you'd have to
find a way for a single set of copy functions to support this output
format as well as the traditional one.  (Alternatively, maybe we
could learn to autogenerate the copy functions from annotated struct
definitions; people have muttered about that for years but not done
anything.)


Or you could imagine flattening the node tree into serialized text
(cf outfuncs.c/readfuncs.c) and compressing that, although I fear that
the overhead of doing so and recovering a node tree later would be
excessive.  Also, the serialized-text representation tends to be pretty
bulky, and I don't know if compression would win back enough to make it
all that much smaller than the existing in-memory representation.


> References to data like table and datatype definitions that are copied
> into the plan but are also copied multiple times. Doesn't matter for
> small plans, but maybe for large ones where different subselects query
> the same tables. Detecting such duplicated references would involve
> sweeping over a plan after it has been created, placing all stuff into
> some kind of map and unifying the references to those definitions. The
> time spend for this might even speed up executing the query because less
> memory has to be touched when the plan is executed potentially leading
> to better cpu cache hit rates. This seems the most promising to me.

TBH that sounds like a dead end to me.  Detecting duplicate subtrees would
be astonishingly expensive, and I doubt that there will be enough of them
to really repay the effort in typical queries.

            regards, tom lane



Re: AW: BUG #15923: Prepared statements take way too much memory.

From
Thomas Munro
Date:
On Fri, Jul 26, 2019 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, I've thought for some time that we should invent a memory context
> allocator that's meant for data that doesn't get realloc'd (much) after
> first allocation, with lower overhead than aset.c has.  Such an allocator
> would be ideal for plancache.c, and perhaps other use-cases such as
> plpgsql function parsetrees.  IMV this would have these properties:
>
> * Doesn't support retail pfree; to recover space you must destroy the
> whole context.  We could just make pfree a no-op.  With the details
> sketched below, repalloc would have to throw an error (because it would
> not know the size of the old chunk), but I think that's OK for the
> intended purpose.

Yeah, I wondered the same thing (in a discussion about shared plan
caches, which are pretty hypothetical at this stage but related to the
OP's complaint):

https://www.postgresql.org/message-id/CA%2BhUKGLOivzO5KuBFP27_jYWi%2B_8ki-Y%2BrdXXJZ4kmqF4kae%2BQ%40mail.gmail.com

> A totally different idea is to make a variant version of copyObject
> that is intended to produce a compact form of a node tree, and does
> not create a separate palloc allocation for each node but just packs
> them as tightly as it can in larger palloc chunks.  This could outperform
> the no-pfree-context idea because it wouldn't need even context-pointer
> overhead for each node.  (This relies on node trees being read-only to
> whatever code is looking at them, which should be OK for finished plan
> trees that are copied into the plancache; otherwise somebody might think
> they could apply repalloc, GetMemoryChunkContext, etc to the nodes,
> which'd crash.)  The stumbling block here is that nobody is gonna
> tolerate maintaining two versions of copyfuncs.c, so you'd have to
> find a way for a single set of copy functions to support this output
> format as well as the traditional one.  (Alternatively, maybe we
> could learn to autogenerate the copy functions from annotated struct
> definitions; people have muttered about that for years but not done
> anything.)

Well, if you had a rip cord type allocator that just advances a
pointer through a chain of chunks and can only free all at once, you'd
get about the same layout anyway with the regular copy functions
anyway, without extra complexity.  IIUC the only advantage of a
special compacting copyObject() would be that it could presumably get
precisely the right sized single piece of memory, but is that worth
the complexity of a two-phase design (IIUC you'd want recursive
tell-me-how-much-memory-you'll-need, then allocate all at once)?  A
chain of smaller chunks should be nearly as good (just a small amount
of wasted space in the final chunk).

(Auto-generated read/write/copy/eq functions would be a good idea anyway.)

>> References to data like table and datatype definitions that are copied
>> into the plan but are also copied multiple times. Doesn't matter for
>> ...
> TBH that sounds like a dead end to me.  Detecting duplicate subtrees would
> be astonishingly expensive, and I doubt that there will be enough of them
> to really repay the effort in typical queries.

Yeah, deduplicating common subplans may be too tricky and expensive.
I don't think it's that crazy to want to be able to share a whole
plans between backends though.  Admittedly a lofty and far-off goal,
but I think we can get there eventually.  Having the same set of N
explicitly prepared plans in M pooled backend sessions is fairly
common.

-- 
Thomas Munro
https://enterprisedb.com



Re: BUG #15923: Prepared statements take way too much memory.

From
Andres Freund
Date:
Hi,

On 2019-07-25 18:13:33 -0400, Tom Lane wrote:
> FWIW, I've thought for some time that we should invent a memory context
> allocator that's meant for data that doesn't get realloc'd (much) after
> first allocation, with lower overhead than aset.c has.

Same.

Although I've wondered whether we'd actually want to tie that to a
separate memory context, but instead add it as an option to set on a
context. It seems like that could potentially reduce overhead, by
avoiding the need to have multiple contexts around. E.g. something like
MemoryContextUseBulkAlloc().  For aset.c that could e.g. switch the
context's callbacks to ones where AllocSetAlloc() is much simpler (one
strategy for large allocations, for the rest don't manage a freelist,
and don't round up allocation sizes).  Not sure what's better.



FWIW, here's a few steps towards tracing aset.c events:

perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetAlloc
context->name:stringsize:u64'
 
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetFree:3
context->name:stringchunk->size:u64'
 
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetFree:18
context->name:stringchunk->size:u64'
 
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetRealloc:9
context->name:stringoldsize=chunk->size:u64 size:u64'
 
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetReset
context->name:string'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetDelete
context->name:string'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetContextCreateInternal
name:string'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'raw_parser'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'raw_parser%return'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'pg_analyze_and_rewrite'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'pg_analyze_and_rewrite%return'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'planner%return'
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'planner'

which then can be used with perf record to trace a backend's
allocations:

perf record $(for e in AllocSetContextCreateInternal AllocSetDelete AllocSetReset AllocSetAlloc AllocSetFree
AllocSetReallocAllocSetFree raw_parser raw_parser__return pg_analyze_and_rewrite  pg_analyze_and_rewrite__return
plannerplanner__return ; do echo "-e probe_postgres:$e";done) -p 26963
 
<execute query>
ctrl-c

perf script

        postgres 26963 [000] 58737.065318:  probe_postgres:AllocSetContextCreateInternal: (55eba8540c23)
name_string="TopTransactionContext"
        postgres 26963 [000] 58737.065331:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="TopMemoryContext"size_u64=480
 
        postgres 26963 [000] 58737.065345:                     probe_postgres:raw_parser: (55eba819f68c)
        postgres 26963 [000] 58737.065355:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=160
 
        postgres 26963 [000] 58737.065362:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=944
 
        postgres 26963 [000] 58737.065367:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=64
 
...
        postgres 26963 [000] 58737.067714:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=32
 
        postgres 26963 [000] 58737.067721:             probe_postgres:raw_parser__return: (55eba819f68c <-
55eba83f2254)
        postgres 26963 [000] 58737.067734:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="TopTransactionContext"size_u64=24
 
        postgres 26963 [000] 58737.067740:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="TopTransactionContext"size_u64=104
 
        postgres 26963 [000] 58737.067746:         probe_postgres:pg_analyze_and_rewrite: (55eba83f232b)
        postgres 26963 [000] 58737.067753:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=208
 
        postgres 26963 [000] 58737.067759:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=216
 
...
        postgres 26963 [000] 58737.074248:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=32
 
        postgres 26963 [000] 58737.074260:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=32
 
        postgres 26963 [000] 58737.074266: probe_postgres:pg_analyze_and_rewrite__return: (55eba83f232b <-
55eba83f2ae9)
        postgres 26963 [000] 58737.074273:                        probe_postgres:planner: (55eba8326c51)
        postgres 26963 [000] 58737.074280:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=128
 
        postgres 26963 [000] 58737.074305:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=536
 
        postgres 26963 [000] 58737.074316:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=48
 
...
        postgres 26963 [000] 58737.076820:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=16
 
        postgres 26963 [000] 58737.076827:                probe_postgres:AllocSetRealloc: (55eba85403f6)
name="MessageContext"oldsize=32 size_u64=24
 
        postgres 26963 [000] 58737.076843:  probe_postgres:AllocSetContextCreateInternal: (55eba8540c23)
name_string="inline_function"
        postgres 26963 [000] 58737.076854:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="inline_function"size_u64=179
 
        postgres 26963 [000] 58737.076861:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="inline_function"size_u64=48
 
        postgres 26963 [000] 58737.076869:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="inline_function"size_u64=40
 
        postgres 26963 [000] 58737.076876:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="inline_function"size_u64=16
 
        postgres 26963 [000] 58737.076883:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="inline_function"size_u64=8
 
        postgres 26963 [000] 58737.076891:                     probe_postgres:raw_parser: (55eba819f68c)
        postgres 26963 [000] 58737.076903:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="inline_function"size_u64=160
 
        postgres 26963 [000] 58737.076910:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="inline_function"size_u64=180
 
...
        postgres 26963 [000] 58737.091725:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=128
 
        postgres 26963 [000] 58737.091733:                probe_postgres:planner__return: (55eba8326c51 <-
55eba83f24ba)
        postgres 26963 [000] 58737.091740:                  probe_postgres:AllocSetAlloc: (55eba853f906)
name="MessageContext"size_u64=32
 
        postgres 26963 [000] 58737.091749:                   probe_postgres:AllocSetFree: (55eba853f4fc)
name="TopTransactionContext"size=128
 
...

the query here was just \dt+.


it's possible that you'd need to change the offsets slightly, depending
on compiler options. One can see tracable lines with:
perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --line AllocSetRealloc
and the available variables with:
erf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --vars AllocSetRealloc:9


Shouldn't be hard to parse that into something giving actual stats for
each memory context, and each query phase.


> Such an allocator would be ideal for plancache.c, and perhaps other
> use-cases such as plpgsql function parsetrees.

Probably also for raw parsing. Bet we'd save more than we'd loose by not
freeing during it.


> IMV this would have these properties:
>
> * Doesn't support retail pfree; to recover space you must destroy the
> whole context.  We could just make pfree a no-op.  With the details
> sketched below, repalloc would have to throw an error (because it would
> not know the size of the old chunk), but I think that's OK for the
> intended purpose.

> * Minimum chunk header overhead, ie only the context pointer required by
> the mmgr.c infrastructure.  In particular, don't store the chunk size.

I'm not sure it's worth eliding the size. Will make it hard to adapt the
new type of context to new code - it's pretty hard to guarantee that
there's not one random utility function in some rarely executed path
that does need the size.


> (Possibly, instead of freelists, we'd trouble to track
> end-of-block space in more than one active block, so that we avoid
> wasting such space when a large allocation doesn't quite fit.)

Hm, doesn't aset.c already solve that by not putting the oversized block
at the head of the block list?



> A totally different idea is to make a variant version of copyObject
> that is intended to produce a compact form of a node tree, and does
> not create a separate palloc allocation for each node but just packs
> them as tightly as it can in larger palloc chunks.  This could outperform
> the no-pfree-context idea because it wouldn't need even context-pointer
> overhead for each node.

If done right, it'd probably also yield considerably higher spatial
locality in the memory layout.


> The stumbling block here is that nobody is gonna tolerate maintaining
> two versions of copyfuncs.c, so you'd have to find a way for a single
> set of copy functions to support this output format as well as the
> traditional one.  (Alternatively, maybe we could learn to autogenerate
> the copy functions from annotated struct definitions; people have
> muttered about that for years but not done anything.)

Yea, I think we'd have to autogenerate them to go for that.  I'm kinda
hoping that maybe John Naylor would take a look, after he made all our
lives quite a bit easier with the other catalog script work...

In the 'pipe dream' version of this, that script would be able to
generate code to adjust pointers after a memcpy... That'd make
copyObject() of such flattened trees much much cheaper.

Greetings,

Andres Freund



Re: BUG #15923: Prepared statements take way too much memory.

From
Andres Freund
Date:
Hi,

On 2019-07-24 21:39:35 +0000, PG Bug reporting form wrote:
> I have the problem on 9.5, and testing this on 11.2 still shows the same
> behaviour. Please, please, someone have a look at where all that memory goes
> and I will immediately roll out a new version to all of my customers. I love
> your database and the stability, but this is definitely a point where you
> can drastically improve!

If you had a reproducible testcase, you'd be more likely to get useful
input... While there's some bigger architectural reasons at play here
too, the overhead you show here seems a bit abnormal still.

IMO bug reports aren't a great avenue for random improvement requests.

Greetings,

Andres Freund



Re: BUG #15923: Prepared statements take way too much memory.

From
Andres Freund
Date:
Hi,

On 2019-07-26 10:55:35 +1200, Thomas Munro wrote:
> IIUC the only advantage of a
> special compacting copyObject() would be that it could presumably get
> precisely the right sized single piece of memory, but is that worth
> the complexity of a two-phase design (IIUC you'd want recursive
> tell-me-how-much-memory-you'll-need, then allocate all at once)?  A
> chain of smaller chunks should be nearly as good (just a small amount
> of wasted space in the final chunk).

I was wondering if we somehow could have a 'header' for Node trees,
which'd keep track of their size when built incrementally. But it seems
like the amount of changes that'd require would be far too
extensive. We, I think, would have to pass down additional pointers in
too many places, not to even talk about the difficulty of tracking the
size precisely during incremental changes.

Although *if* we had it, we could probably avoid a lot of unnecessary
tree copying, by having copy-on-write + reference counting logic.


This kind of thing would be a lot of easier to solve with C++ than C (or
well, just any modern-ish language)... We could have "smart" references
to points in the query tree that keep track both of where they are, and
what the head of the whole tree is.  And the immutable flattened tree
they could just use offsets instead of pointers, and resolve those
offsets within the accessor.



> (Auto-generated read/write/copy/eq functions would be a good idea
> anyway.)

Indeed.


> >> References to data like table and datatype definitions that are copied
> >> into the plan but are also copied multiple times. Doesn't matter for
> >> ...
> > TBH that sounds like a dead end to me.  Detecting duplicate subtrees would
> > be astonishingly expensive, and I doubt that there will be enough of them
> > to really repay the effort in typical queries.
> 
> Yeah, deduplicating common subplans may be too tricky and expensive.

It'd sure be interesting to write a function that computes stats about
which types of nodes consume how much memory. Shouldn't be too hard.

I'm mildly suspecting that a surprisingly large overhead in complex
query trees comes from target lists that are the same in many parts of
the tree.


Greetings,

Andres Freund



RE: BUG #15923: Prepared statements take way too much memory.

From
Daniel Migowski
Date:
Hello,

Thanks for all your replies, I try to combine the answers for the last mails on the subject.

------ In reply to Tom Lane -------

Creating a new type of MemoryContext (FlatMemoryContext?) sounds like a good idea for me, because it sounds minimal
invasive,I will try that. I believe that it doesn't need a maximum block size, does it? One could give a size for
blocksthat are requested by malloc, and every time someone pallocs a larger area this is directly malloced and
returned.I saw that AllocSec used an initial size which is doubled every time the previous size isn't sufficient
anymore.Should the new MemoryContext should behave identically?  

I wonder if this MemoryContext should make good use of a "finalize" function, which can be called after no work in the
contextis expected anymore, and which realloc's the allocated blocks to the really used size so the C-lib can reuse the
unusedmemory. Maybe even AllocSec could profit from it as a first step. Checking the stats the used vs. malloced mem is
ata low 60%.   

> This wouldn't take very much new code, and it would potentially offer up to perhaps 2X space savings compared to
aset.c,thanks to cutting the per-chunk overhead.  Whether that's enough to float your boat with respect to cached plans
isn'ttoo clear, though. 

No, boat will still sink, but it's a good start and servers are less likely to crash.

> A totally different idea is to make a variant version of copyObject that is intended to produce a compact form of a
nodetree, and does not create a separate palloc allocation for each node but just packs them as tightly as it can in
largerpalloc chunks. .. 

On the copyObject code that should be autogenerated, I thought a lot about how this would be possible with macros, but
couldn'tthink of anything, because one macro statement would have to produce source code in two distinct places (at
leastin two functions). Is there similar code somewhere else in the source tree ? 

See below for more on that subject

> ...serialization of node tree...

I also don't like serialization very much, because I already rewrote the Standard Serialization Mechanism in Java very
successfully,but now about the problems in case of graph type object dependencies (which don't seem to exist in
NodeSetsbecause then the objectCopy function would also have to cope with that and doesn't). But if everything else
failsin the end it might by a good idea. However, I believe the size of the rewritten query plan is many times the
originalnode tree and used often, so it might not be a good idea.  

------ In reply to Thomas Munro ------

Regarding your comments on the variant version of copyObject, I also believe that a FlatMemoryContext would result in
nearlythe same memory usage at a first glance, except for the Chunk headers Tom Lane also noticed. One should measure
howmuch space could really be saved when we don't need the chunk headers.  

But I like the calculate-memory variant one would get when the copyObject code is generated. The two phase variant
wouldnot even be slower because the overhead of pallocing every chunk for every node MUST be higher than a simple
calculationof the required space. This feels a bit like inlining the allocation logic into the copyObject code, and
wouldbe the best variant regarding speed AND size.  

------ In reply to Andres Freund ------

> Yea, I think we'd have to autogenerate them to go for that.  I'm kinda hoping that maybe John Naylor would take a
look,after he made all our lives quite a bit easier with the other catalog script work... In the 'pipe dream' version
ofthis, that script would be able to generate code to adjust pointers after a memcpy... That'd make copyObject() of
suchflattened trees much cheaper. 

Thanks for the input, I will have a look at it. When copyObject on such trees is such a common operation, it might be
feasibleto store the length of such a readonly-tree somewhere within the tree itself, so one knows the exact length of
anode including all its subnodes, which "should" always be in a contiguous range of memory anyway, and we would just
haveto apply an offset to all pointers within the copied area.  

Or, when we come back to the two-pass idea of preallocating a large chunk of memory for the nodes, one could use the
samefunction to calculate the range on the fly.  

I see the problem here that the current implementation of copyObject (which is used in many placed in the code) would
haveto be aware that the source nodes are allocated in a flat MemoryContext for this to work through everywhere in the
codebase.This could be achived if a chunk header for the nodes is not omitted (althought the chunk header would not be
setby the MemoryContext but by the copyNode implementation for FlatMemoryContext itself. If the current memory isn't
flat,normal pallocs have to be done.). Does code exist with the assumption that every node resides in its own allocated
chunk?

> It'd sure be interesting to write a function that computes stats about which types of nodes consume how much memory.
Shouldn'tbe too hard. I'm mildly suspecting that a surprisingly large overhead in complex query trees comes from target
liststhat are the same in many parts of the tree. 

This would be the second part of the refactoring. I was wondering if one could reduce the data that is stored in the
treein the first place by using references to existing structures instead.  

------ Conclusion -------

* A FlatMemoryContext would reduce memory usage by 40%(?), and the current copyNode infrastructure could remain as is.
* Using a code generator for the nodes is commonly agreed to be a good idea (and the amount of node types in nodes.h
makeme believe that the amount of hand written code would drop significantly when using one).  
* An autogenerated copyObjectFlat would be able to determine the total size of a node and its descendants and could
copya tree with a single palloc, making FlatMemoryContext obsolete and even yielding a higher memory reduction.  
* IF we keep the ChunkHeader in FlatMemoryContext, then the "standard" copy node would be able to fallback to a very
fastmemcopy+pointeroffsetchange variant. The profit here might outweigh the few percent of memory we gain when we skip
theheader. But then we have no memory gains vs. using the FlatMemoryContext alone. 

So:

* A FlatMemoryContext is a good idea and I will try to implement it.
* Code generation could speed things up significantly and a variant would just contribute to memory if we skip chunk
headers(if that is allowed at all). 
* There is more work to do afterwards, because the trees are still too big.

------------------------------------------------

Some more questions:
* Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function?
* I wonder if we just cannot drop the raw_parse_tree altogether for large statements when the query tree has been
generated.I know you need it in case the definition of tables etc. change to recreate the query tree, but wouldn't it
bepossible the recreate the raw_parse_tree from the query_string in that very rare event also? The query_string is
alreadythe most compacted form of the raw_parse_tree, right?   
* Does code exist with the assumption that every node resides in its own allocated chunk? (asked already above)
* Where can I find the existing code generation script? Is it called during make or before checkin?

------------------------------------------------

Thanks for reading through this. I sadly cannot provide an example because this would mean to include my companies
wholedatabase schema and I just gave up after spending 30 minutes to clean it up enough, but the complexity of the view
canstill be viewed at https://explain.depesz.com/s/gN2  

Regards,
Daniel Migowski



Re: BUG #15923: Prepared statements take way too much memory.

From
Thomas Munro
Date:
On Sun, Jul 28, 2019 at 11:14 PM Daniel Migowski <dmigowski@ikoffice.de> wrote:
> * Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function?

It's not really relevant to the discussion about memory usage, but since
you asked about MemoryContextAllocZeroAligned:

I noticed that stuff a while back when another project I follow was
tweaking related stuff.  The idea behind those functions is that we
have our own specialised memset()-like macros.
MemoryContextAllocZeroAligned() uses MemSetLoop(), while
MemoryContextAllocZero() uses MemSetAligned().  Those date back nearly
two decades and were apparently based on experiments done at the time.
The comments talk about avoiding function call overhead and
specialising for constant lengths, so I suppose we assumed that
compilers at the time didn't automatically know how to inline plain
old memset() calls.

These days, you'd think you could add "assume aligned" attributes to
all memory-allocation-like functions and then change a few key things
to be static inline so that the constants can be seen in the right
places, and get all of the same benefits (and probably more) for free
from the compiler's alignment analysis and inlining powers.

As for what benefits might actually be available, for constants it's
clear but for alignment, I'm doubtful.  When Clang 8 on amd64 inlines
a memset() with a known aligned destination (for various alignment
sizes) and constant size, it generates a bunch of movups instructions,
or with -march set to a modern arch, maybe vmovups, and some
variations depending on size.  Note 'u' (unaligned) instructions, not
'a' (aligned): that is, it doesn't even care that I said it the
destination was aligned!  I found claims from a couple of Intel
sources that this sort of thing stopped making any difference to
performance in the Nehalem microarchitecture (2008).  It still matters
whether the data actually is aligned, but not whether you use the
instructions that tolerate misalignment, so there is apparently no
point in generating different code, and therefore, as far as memset()
goes, apparently no gain from annotating allocation functions as
returning aligned pointers.  As for other architectures or compilers,
I don't know.

For curiosity, here's an experimental patch that gets rid of the
MemSetXXX() stuff, adds some (useless?) annotations about alignment
and makes palloc0() and MemoryContextAllocZero() inline so that they
can benefit from inlining with a visible constant size in eg
newNode().  It didn't seem to do anything very interesting apart from
remove a few hundred lines of code, so I didn't get around to digging
further or sharing it earlier.  Or maybe I was just doing it wrong.
(In this patch it's using sizeof(long) which isn't enough to be
considered aligned for the wider move instructions, but I tried
various sizes when trying to trigger different codegen, without
success).


--
Thomas Munro
https://enterprisedb.com

Attachment

AW: BUG #15923: Prepared statements take way too much memory.

From
Daniel Migowski
Date:
Thanks for the insightful and interesting comments. 

-----Ursprüngliche Nachricht-----
Von: Thomas Munro <thomas.munro@gmail.com> 
Gesendet: Montag, 29. Juli 2019 08:11
An: Daniel Migowski <dmigowski@ikoffice.de>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Andres Freund <andres@anarazel.de>; pgsql-bugs@lists.postgresql.org
Betreff: Re: BUG #15923: Prepared statements take way too much memory.

On Sun, Jul 28, 2019 at 11:14 PM Daniel Migowski <dmigowski@ikoffice.de> wrote:
> * Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function?

It's not really relevant to the discussion about memory usage, but since you asked about
MemoryContextAllocZeroAligned:

I noticed that stuff a while back when another project I follow was tweaking related stuff.  The idea behind those
functionsis that we have our own specialised memset()-like macros.
 
MemoryContextAllocZeroAligned() uses MemSetLoop(), while
MemoryContextAllocZero() uses MemSetAligned().  Those date back nearly two decades and were apparently based on
experimentsdone at the time.
 
The comments talk about avoiding function call overhead and specialising for constant lengths, so I suppose we assumed
thatcompilers at the time didn't automatically know how to inline plain old memset() calls.
 

These days, you'd think you could add "assume aligned" attributes to all memory-allocation-like functions and then
changea few key things to be static inline so that the constants can be seen in the right places, and get all of the
samebenefits (and probably more) for free from the compiler's alignment analysis and inlining powers.
 

As for what benefits might actually be available, for constants it's clear but for alignment, I'm doubtful.  When Clang
8on amd64 inlines a memset() with a known aligned destination (for various alignment
 
sizes) and constant size, it generates a bunch of movups instructions, or with -march set to a modern arch, maybe
vmovups,and some variations depending on size.  Note 'u' (unaligned) instructions, not 'a' (aligned): that is, it
doesn'teven care that I said it the destination was aligned!  I found claims from a couple of Intel sources that this
sortof thing stopped making any difference to performance in the Nehalem microarchitecture (2008).  It still matters
whetherthe data actually is aligned, but not whether you use the instructions that tolerate misalignment, so there is
apparentlyno point in generating different code, and therefore, as far as memset() goes, apparently no gain from
annotatingallocation functions as returning aligned pointers.  As for other architectures or compilers, I don't know.
 

For curiosity, here's an experimental patch that gets rid of the
MemSetXXX() stuff, adds some (useless?) annotations about alignment and makes palloc0() and MemoryContextAllocZero()
inlineso that they can benefit from inlining with a visible constant size in eg newNode().  It didn't seem to do
anythingvery interesting apart from remove a few hundred lines of code, so I didn't get around to digging further or
sharingit earlier.  Or maybe I was just doing it wrong.
 
(In this patch it's using sizeof(long) which isn't enough to be considered aligned for the wider move instructions, but
Itried various sizes when trying to trigger different codegen, without success).
 


--
Thomas Munro
https://enterprisedb.com

Re: BUG #15923: Prepared statements take way too much memory.

From
Kyotaro Horiguchi
Date:
At Mon, 29 Jul 2019 18:11:04 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
<CA+hUKGLfa6ANa0vs7Lf0op0XBH05HE8SyX8NFhDyT7k2CHYLXw@mail.gmail.com>
> On Sun, Jul 28, 2019 at 11:14 PM Daniel Migowski <dmigowski@ikoffice.de> wrote:
> > * Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function?
> 
> It's not really relevant to the discussion about memory usage, but since
> you asked about MemoryContextAllocZeroAligned:
> 
> I noticed that stuff a while back when another project I follow was
> tweaking related stuff.  The idea behind those functions is that we
> have our own specialised memset()-like macros.
> MemoryContextAllocZeroAligned() uses MemSetLoop(), while
> MemoryContextAllocZero() uses MemSetAligned().  Those date back nearly
> two decades and were apparently based on experiments done at the time.
> The comments talk about avoiding function call overhead and
> specialising for constant lengths, so I suppose we assumed that
> compilers at the time didn't automatically know how to inline plain
> old memset() calls.
> 
> These days, you'd think you could add "assume aligned" attributes to
> all memory-allocation-like functions and then change a few key things
> to be static inline so that the constants can be seen in the right
> places, and get all of the same benefits (and probably more) for free
> from the compiler's alignment analysis and inlining powers.
> 
> As for what benefits might actually be available, for constants it's
> clear but for alignment, I'm doubtful.  When Clang 8 on amd64 inlines
> a memset() with a known aligned destination (for various alignment
> sizes) and constant size, it generates a bunch of movups instructions,
> or with -march set to a modern arch, maybe vmovups, and some
> variations depending on size.  Note 'u' (unaligned) instructions, not
> 'a' (aligned): that is, it doesn't even care that I said it the
> destination was aligned!  I found claims from a couple of Intel
> sources that this sort of thing stopped making any difference to
> performance in the Nehalem microarchitecture (2008).  It still matters
> whether the data actually is aligned, but not whether you use the
> instructions that tolerate misalignment, so there is apparently no
> point in generating different code, and therefore, as far as memset()
> goes, apparently no gain from annotating allocation functions as
> returning aligned pointers.  As for other architectures or compilers,
> I don't know.
> 
> For curiosity, here's an experimental patch that gets rid of the
> MemSetXXX() stuff, adds some (useless?) annotations about alignment
> and makes palloc0() and MemoryContextAllocZero() inline so that they
> can benefit from inlining with a visible constant size in eg
> newNode().  It didn't seem to do anything very interesting apart from
> remove a few hundred lines of code, so I didn't get around to digging
> further or sharing it earlier.  Or maybe I was just doing it wrong.
> (In this patch it's using sizeof(long) which isn't enough to be
> considered aligned for the wider move instructions, but I tried
> various sizes when trying to trigger different codegen, without
> success).

FWIW, I saw gcc/x64 -O2 emit "rep stosq" for memset() having
constant length and "not-aligned" pointer are passed. Of course
aligned(8) didn't make difference at all.  Emitted a seris of
movex for smaller chunks.  And always emitted "call memset" for
non-constant length.

That makes me think that the "aligned" attribute works some
different way. Anyway I haven't find any further explantion about
the hint than "compiler considers alignment on the
varialbe/type/function"...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center