Thread: BUG #15923: Prepared statements take way too much memory.
The following bug has been logged on the website: Bug reference: 15923 Logged by: Daniel Migowski Email address: dmigowski@ikoffice.de PostgreSQL version: 11.2 Operating system: Windows, Linux, doesn't matters Description: Hello, Prepared Statements take too much memory. This is a bug report I already filled a few years as #14726 (https://www.postgresql.org/message-id/20170702090956.1469.41812%40wrigleys.postgresql.org) ago but now I have a proof and can provide a testcase for you to simply verify my assumptions. I have a large query like https://explain.depesz.com/s/gN2, which results in 30MB query plans. To verify if that is true I wrote a small script that prepares this query (a simple SELECT * FROM myNotSoSimpleFatView) 250 times: DO $$ DECLARE i int4; BEGIN FOR i IN 1..250 LOOP RAISE NOTICE 'This is prepared statement %', i; EXECUTE 'PREPARE testxy_'||i||' AS SELECT *, ''test_'||i||''' FROM vw_report_salesinvoice WHERE salesinvoice_id = $1;'; END LOOP; END $$; To reproduce just insert your favority view and primary key name after the FROM and have a look at memory consumption for yourself. 31540 postgres 20 0 1414452 976,9m 26816 R 99,6 12,2 0:11.11 postgres after a few queries 31540 postgres 20 0 2480276 1,903g 26816 R 99,9 24,3 0:23.10 postgres after 66 queries 31540 postgres 20 0 3824908 3,097g 26816 R 99,9 39,6 0:38.07 postgres after 100 queries 31540 postgres 20 0 5727036 4,786g 26816 R 99,9 61,2 0:59.04 postgres after 160 queries ... 31540 postgres 20 0 8646140 7,351g 19712 S 0,0 94,0 1:31.81 postgres after 250 queries <- WTF 7 point 5 whopping gigs of RAM just for a few prepared statements? Thats about 45M for each query!! PostgreSQL crashes regulary at my customer servers, because I use automatic prepared statements for queries that are done often. At least I thought that would be a good idea. Please note that this bug also affects other (like https://github.com/rails/rails/issues/14645, where they just stopped using Prepared Statements alltogether as a solution to their crashes).Most users that use an ORM enable prepared queries, not seeing that PostgreSQL just isn't capable to handle them. I have the problem on 9.5, and testing this on 11.2 still shows the same behaviour. Please, please, someone have a look at where all that memory goes and I will immediately roll out a new version to all of my customers. I love your database and the stability, but this is definitely a point where you can drastically improve! Best and kindest regards, Daniel Migowski
PG Bug reporting form <noreply@postgresql.org> writes: > Prepared Statements take too much memory. I'm not really sure what you expect us to do about this. If you PREPARE a bunch of statements and never DEALLOCATE them, then yes they're going to consume memory; maybe a lot of memory if the plans are complex. The only thing we could do to reduce the memory consumption is to throw away the plans, which kind of defeats the purpose of a prepared statement, wouldn't you say? And even holding onto enough information to re-create the plan later would consume some amount of memory, so you will still have a problem if you keep making prepared statements and never release them. regards, tom lane
I never keep more that 250 Prepared Statements. Sadly, with 10 connections to the database pooled, this is too much to handlefor my poor servers. But really... 45MB? The query has 132 nodes, and each takes up 300kb memory? Do you store the table contents in that thing?I don't want to be greedy, but might it be possible to store your Prepared Statements in a data structure that is slightlyless memory hungry? Regards and many thanks ahead, Daniel Migowski -----Ursprüngliche Nachricht----- Von: Tom Lane <tgl@sss.pgh.pa.us> Gesendet: Donnerstag, 25. Juli 2019 00:24 An: Daniel Migowski <dmigowski@ikoffice.de> Cc: pgsql-bugs@lists.postgresql.org Betreff: Re: BUG #15923: Prepared statements take way too much memory. PG Bug reporting form <noreply@postgresql.org> writes: > Prepared Statements take too much memory. I'm not really sure what you expect us to do about this. If you PREPARE a bunch of statements and never DEALLOCATE them,then yes they're going to consume memory; maybe a lot of memory if the plans are complex. The only thing we could do to reduce the memory consumption is to throw away the plans, which kind of defeats the purposeof a prepared statement, wouldn't you say? And even holding onto enough information to re-create the plan later wouldconsume some amount of memory, so you will still have a problem if you keep making prepared statements and never releasethem. regards, tom lane
I am considering trying PostgreSQL hacking to better understand the reasons of the memory consumption of Query Plans in orderto be able to maybe reduce the memory used by them. Given the current implementation and what I learned from reading about memory allocation in the backend I, in my currentlimited understanding of the backend, hope to find something like: * MemoryContext's in which stuff is palloc'ed that is not really used anymore. I have read that usually you don't pfree verymuch assuming that the context will be free'd fully after some time anyway. In the case of PreparedStatements that willbe much later than the devs of the Rewriter/Planner had in mind. * Alternate plans that will never be used anymore but which's allocated memory is still not freed. * References to data like table and datatype definitions that are copied into the plan but are also copied multiple times.Doesn't matter for small plans, but maybe for large ones where different subselects query the same tables. Detectingsuch duplicated references would involve sweeping over a plan after it has been created, placing all stuff intosome kind of map and unifying the references to those definitions. The time spend for this might even speed up executingthe query because less memory has to be touched when the plan is executed potentially leading to better cpu cachehit rates. This seems the most promising to me. * Or maybe one could, if my previous assumption is true, instead of copying stuff to the plan build a dictionary of thisstuff in the connection process and just link to entries in it. I assume that when the structure of datatypes changeson the server the plans have to be discarded anyway and you already know somehow which plans have to be discardedand use what structures, so there already might be such a dictionary in place? If now, it might also increase thespeed of dropping stuff when adding a refcount to these structures in the dictionary, so knowing if it has been referencedsomewhere. Because you know the code better, do you think it might be worthwhile to go for these things or would it be waste of timein your eyes? Regards, Daniel Migowski -----Ursprüngliche Nachricht----- Von: Daniel Migowski Gesendet: Donnerstag, 25. Juli 2019 00:32 An: Tom Lane <tgl@sss.pgh.pa.us> Cc: pgsql-bugs@lists.postgresql.org Betreff: AW: BUG #15923: Prepared statements take way too much memory. I never keep more that 250 Prepared Statements. Sadly, with 10 connections to the database pooled, this is too much to handlefor my poor servers. But really... 45MB? The query has 132 nodes, and each takes up 300kb memory? Do you store the table contents in that thing?I don't want to be greedy, but might it be possible to store your Prepared Statements in a data structure that is slightlyless memory hungry? Regards and many thanks ahead, Daniel Migowski ...
Daniel Migowski <dmigowski@ikoffice.de> writes: > I am considering trying PostgreSQL hacking to better understand the reasons of the memory consumption of Query Plans inorder to be able to maybe reduce the memory used by them. FWIW, I've thought for some time that we should invent a memory context allocator that's meant for data that doesn't get realloc'd (much) after first allocation, with lower overhead than aset.c has. Such an allocator would be ideal for plancache.c, and perhaps other use-cases such as plpgsql function parsetrees. IMV this would have these properties: * Doesn't support retail pfree; to recover space you must destroy the whole context. We could just make pfree a no-op. With the details sketched below, repalloc would have to throw an error (because it would not know the size of the old chunk), but I think that's OK for the intended purpose. * Minimum chunk header overhead, ie only the context pointer required by the mmgr.c infrastructure. In particular, don't store the chunk size. * Request sizes are only rounded up to a MAXALIGN boundary, not to a power of 2. * All else could be pretty much the same as aset.c, except we'd not need to have any freelist management. I would imagine that callers might want to use a smaller initial and maximum block size than is typical with aset.c, to reduce end-of-context wastage, but that would be up to the callers. (Possibly, instead of freelists, we'd trouble to track end-of-block space in more than one active block, so that we avoid wasting such space when a large allocation doesn't quite fit.) This wouldn't take very much new code, and it would potentially offer up to perhaps 2X space savings compared to aset.c, thanks to cutting the per-chunk overhead. Whether that's enough to float your boat with respect to cached plans isn't too clear, though. A totally different idea is to make a variant version of copyObject that is intended to produce a compact form of a node tree, and does not create a separate palloc allocation for each node but just packs them as tightly as it can in larger palloc chunks. This could outperform the no-pfree-context idea because it wouldn't need even context-pointer overhead for each node. (This relies on node trees being read-only to whatever code is looking at them, which should be OK for finished plan trees that are copied into the plancache; otherwise somebody might think they could apply repalloc, GetMemoryChunkContext, etc to the nodes, which'd crash.) The stumbling block here is that nobody is gonna tolerate maintaining two versions of copyfuncs.c, so you'd have to find a way for a single set of copy functions to support this output format as well as the traditional one. (Alternatively, maybe we could learn to autogenerate the copy functions from annotated struct definitions; people have muttered about that for years but not done anything.) Or you could imagine flattening the node tree into serialized text (cf outfuncs.c/readfuncs.c) and compressing that, although I fear that the overhead of doing so and recovering a node tree later would be excessive. Also, the serialized-text representation tends to be pretty bulky, and I don't know if compression would win back enough to make it all that much smaller than the existing in-memory representation. > References to data like table and datatype definitions that are copied > into the plan but are also copied multiple times. Doesn't matter for > small plans, but maybe for large ones where different subselects query > the same tables. Detecting such duplicated references would involve > sweeping over a plan after it has been created, placing all stuff into > some kind of map and unifying the references to those definitions. The > time spend for this might even speed up executing the query because less > memory has to be touched when the plan is executed potentially leading > to better cpu cache hit rates. This seems the most promising to me. TBH that sounds like a dead end to me. Detecting duplicate subtrees would be astonishingly expensive, and I doubt that there will be enough of them to really repay the effort in typical queries. regards, tom lane
On Fri, Jul 26, 2019 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, I've thought for some time that we should invent a memory context > allocator that's meant for data that doesn't get realloc'd (much) after > first allocation, with lower overhead than aset.c has. Such an allocator > would be ideal for plancache.c, and perhaps other use-cases such as > plpgsql function parsetrees. IMV this would have these properties: > > * Doesn't support retail pfree; to recover space you must destroy the > whole context. We could just make pfree a no-op. With the details > sketched below, repalloc would have to throw an error (because it would > not know the size of the old chunk), but I think that's OK for the > intended purpose. Yeah, I wondered the same thing (in a discussion about shared plan caches, which are pretty hypothetical at this stage but related to the OP's complaint): https://www.postgresql.org/message-id/CA%2BhUKGLOivzO5KuBFP27_jYWi%2B_8ki-Y%2BrdXXJZ4kmqF4kae%2BQ%40mail.gmail.com > A totally different idea is to make a variant version of copyObject > that is intended to produce a compact form of a node tree, and does > not create a separate palloc allocation for each node but just packs > them as tightly as it can in larger palloc chunks. This could outperform > the no-pfree-context idea because it wouldn't need even context-pointer > overhead for each node. (This relies on node trees being read-only to > whatever code is looking at them, which should be OK for finished plan > trees that are copied into the plancache; otherwise somebody might think > they could apply repalloc, GetMemoryChunkContext, etc to the nodes, > which'd crash.) The stumbling block here is that nobody is gonna > tolerate maintaining two versions of copyfuncs.c, so you'd have to > find a way for a single set of copy functions to support this output > format as well as the traditional one. (Alternatively, maybe we > could learn to autogenerate the copy functions from annotated struct > definitions; people have muttered about that for years but not done > anything.) Well, if you had a rip cord type allocator that just advances a pointer through a chain of chunks and can only free all at once, you'd get about the same layout anyway with the regular copy functions anyway, without extra complexity. IIUC the only advantage of a special compacting copyObject() would be that it could presumably get precisely the right sized single piece of memory, but is that worth the complexity of a two-phase design (IIUC you'd want recursive tell-me-how-much-memory-you'll-need, then allocate all at once)? A chain of smaller chunks should be nearly as good (just a small amount of wasted space in the final chunk). (Auto-generated read/write/copy/eq functions would be a good idea anyway.) >> References to data like table and datatype definitions that are copied >> into the plan but are also copied multiple times. Doesn't matter for >> ... > TBH that sounds like a dead end to me. Detecting duplicate subtrees would > be astonishingly expensive, and I doubt that there will be enough of them > to really repay the effort in typical queries. Yeah, deduplicating common subplans may be too tricky and expensive. I don't think it's that crazy to want to be able to share a whole plans between backends though. Admittedly a lofty and far-off goal, but I think we can get there eventually. Having the same set of N explicitly prepared plans in M pooled backend sessions is fairly common. -- Thomas Munro https://enterprisedb.com
Hi, On 2019-07-25 18:13:33 -0400, Tom Lane wrote: > FWIW, I've thought for some time that we should invent a memory context > allocator that's meant for data that doesn't get realloc'd (much) after > first allocation, with lower overhead than aset.c has. Same. Although I've wondered whether we'd actually want to tie that to a separate memory context, but instead add it as an option to set on a context. It seems like that could potentially reduce overhead, by avoiding the need to have multiple contexts around. E.g. something like MemoryContextUseBulkAlloc(). For aset.c that could e.g. switch the context's callbacks to ones where AllocSetAlloc() is much simpler (one strategy for large allocations, for the rest don't manage a freelist, and don't round up allocation sizes). Not sure what's better. FWIW, here's a few steps towards tracing aset.c events: perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetAlloc context->name:stringsize:u64' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetFree:3 context->name:stringchunk->size:u64' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetFree:18 context->name:stringchunk->size:u64' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetRealloc:9 context->name:stringoldsize=chunk->size:u64 size:u64' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetReset context->name:string' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetDelete context->name:string' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'AllocSetContextCreateInternal name:string' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'raw_parser' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'raw_parser%return' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'pg_analyze_and_rewrite' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'pg_analyze_and_rewrite%return' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'planner%return' perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --add 'planner' which then can be used with perf record to trace a backend's allocations: perf record $(for e in AllocSetContextCreateInternal AllocSetDelete AllocSetReset AllocSetAlloc AllocSetFree AllocSetReallocAllocSetFree raw_parser raw_parser__return pg_analyze_and_rewrite pg_analyze_and_rewrite__return plannerplanner__return ; do echo "-e probe_postgres:$e";done) -p 26963 <execute query> ctrl-c perf script postgres 26963 [000] 58737.065318: probe_postgres:AllocSetContextCreateInternal: (55eba8540c23) name_string="TopTransactionContext" postgres 26963 [000] 58737.065331: probe_postgres:AllocSetAlloc: (55eba853f906) name="TopMemoryContext"size_u64=480 postgres 26963 [000] 58737.065345: probe_postgres:raw_parser: (55eba819f68c) postgres 26963 [000] 58737.065355: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=160 postgres 26963 [000] 58737.065362: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=944 postgres 26963 [000] 58737.065367: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=64 ... postgres 26963 [000] 58737.067714: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=32 postgres 26963 [000] 58737.067721: probe_postgres:raw_parser__return: (55eba819f68c <- 55eba83f2254) postgres 26963 [000] 58737.067734: probe_postgres:AllocSetAlloc: (55eba853f906) name="TopTransactionContext"size_u64=24 postgres 26963 [000] 58737.067740: probe_postgres:AllocSetAlloc: (55eba853f906) name="TopTransactionContext"size_u64=104 postgres 26963 [000] 58737.067746: probe_postgres:pg_analyze_and_rewrite: (55eba83f232b) postgres 26963 [000] 58737.067753: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=208 postgres 26963 [000] 58737.067759: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=216 ... postgres 26963 [000] 58737.074248: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=32 postgres 26963 [000] 58737.074260: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=32 postgres 26963 [000] 58737.074266: probe_postgres:pg_analyze_and_rewrite__return: (55eba83f232b <- 55eba83f2ae9) postgres 26963 [000] 58737.074273: probe_postgres:planner: (55eba8326c51) postgres 26963 [000] 58737.074280: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=128 postgres 26963 [000] 58737.074305: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=536 postgres 26963 [000] 58737.074316: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=48 ... postgres 26963 [000] 58737.076820: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=16 postgres 26963 [000] 58737.076827: probe_postgres:AllocSetRealloc: (55eba85403f6) name="MessageContext"oldsize=32 size_u64=24 postgres 26963 [000] 58737.076843: probe_postgres:AllocSetContextCreateInternal: (55eba8540c23) name_string="inline_function" postgres 26963 [000] 58737.076854: probe_postgres:AllocSetAlloc: (55eba853f906) name="inline_function"size_u64=179 postgres 26963 [000] 58737.076861: probe_postgres:AllocSetAlloc: (55eba853f906) name="inline_function"size_u64=48 postgres 26963 [000] 58737.076869: probe_postgres:AllocSetAlloc: (55eba853f906) name="inline_function"size_u64=40 postgres 26963 [000] 58737.076876: probe_postgres:AllocSetAlloc: (55eba853f906) name="inline_function"size_u64=16 postgres 26963 [000] 58737.076883: probe_postgres:AllocSetAlloc: (55eba853f906) name="inline_function"size_u64=8 postgres 26963 [000] 58737.076891: probe_postgres:raw_parser: (55eba819f68c) postgres 26963 [000] 58737.076903: probe_postgres:AllocSetAlloc: (55eba853f906) name="inline_function"size_u64=160 postgres 26963 [000] 58737.076910: probe_postgres:AllocSetAlloc: (55eba853f906) name="inline_function"size_u64=180 ... postgres 26963 [000] 58737.091725: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=128 postgres 26963 [000] 58737.091733: probe_postgres:planner__return: (55eba8326c51 <- 55eba83f24ba) postgres 26963 [000] 58737.091740: probe_postgres:AllocSetAlloc: (55eba853f906) name="MessageContext"size_u64=32 postgres 26963 [000] 58737.091749: probe_postgres:AllocSetFree: (55eba853f4fc) name="TopTransactionContext"size=128 ... the query here was just \dt+. it's possible that you'd need to change the offsets slightly, depending on compiler options. One can see tracable lines with: perf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --line AllocSetRealloc and the available variables with: erf probe -x /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres --vars AllocSetRealloc:9 Shouldn't be hard to parse that into something giving actual stats for each memory context, and each query phase. > Such an allocator would be ideal for plancache.c, and perhaps other > use-cases such as plpgsql function parsetrees. Probably also for raw parsing. Bet we'd save more than we'd loose by not freeing during it. > IMV this would have these properties: > > * Doesn't support retail pfree; to recover space you must destroy the > whole context. We could just make pfree a no-op. With the details > sketched below, repalloc would have to throw an error (because it would > not know the size of the old chunk), but I think that's OK for the > intended purpose. > * Minimum chunk header overhead, ie only the context pointer required by > the mmgr.c infrastructure. In particular, don't store the chunk size. I'm not sure it's worth eliding the size. Will make it hard to adapt the new type of context to new code - it's pretty hard to guarantee that there's not one random utility function in some rarely executed path that does need the size. > (Possibly, instead of freelists, we'd trouble to track > end-of-block space in more than one active block, so that we avoid > wasting such space when a large allocation doesn't quite fit.) Hm, doesn't aset.c already solve that by not putting the oversized block at the head of the block list? > A totally different idea is to make a variant version of copyObject > that is intended to produce a compact form of a node tree, and does > not create a separate palloc allocation for each node but just packs > them as tightly as it can in larger palloc chunks. This could outperform > the no-pfree-context idea because it wouldn't need even context-pointer > overhead for each node. If done right, it'd probably also yield considerably higher spatial locality in the memory layout. > The stumbling block here is that nobody is gonna tolerate maintaining > two versions of copyfuncs.c, so you'd have to find a way for a single > set of copy functions to support this output format as well as the > traditional one. (Alternatively, maybe we could learn to autogenerate > the copy functions from annotated struct definitions; people have > muttered about that for years but not done anything.) Yea, I think we'd have to autogenerate them to go for that. I'm kinda hoping that maybe John Naylor would take a look, after he made all our lives quite a bit easier with the other catalog script work... In the 'pipe dream' version of this, that script would be able to generate code to adjust pointers after a memcpy... That'd make copyObject() of such flattened trees much much cheaper. Greetings, Andres Freund
Hi, On 2019-07-24 21:39:35 +0000, PG Bug reporting form wrote: > I have the problem on 9.5, and testing this on 11.2 still shows the same > behaviour. Please, please, someone have a look at where all that memory goes > and I will immediately roll out a new version to all of my customers. I love > your database and the stability, but this is definitely a point where you > can drastically improve! If you had a reproducible testcase, you'd be more likely to get useful input... While there's some bigger architectural reasons at play here too, the overhead you show here seems a bit abnormal still. IMO bug reports aren't a great avenue for random improvement requests. Greetings, Andres Freund
Hi, On 2019-07-26 10:55:35 +1200, Thomas Munro wrote: > IIUC the only advantage of a > special compacting copyObject() would be that it could presumably get > precisely the right sized single piece of memory, but is that worth > the complexity of a two-phase design (IIUC you'd want recursive > tell-me-how-much-memory-you'll-need, then allocate all at once)? A > chain of smaller chunks should be nearly as good (just a small amount > of wasted space in the final chunk). I was wondering if we somehow could have a 'header' for Node trees, which'd keep track of their size when built incrementally. But it seems like the amount of changes that'd require would be far too extensive. We, I think, would have to pass down additional pointers in too many places, not to even talk about the difficulty of tracking the size precisely during incremental changes. Although *if* we had it, we could probably avoid a lot of unnecessary tree copying, by having copy-on-write + reference counting logic. This kind of thing would be a lot of easier to solve with C++ than C (or well, just any modern-ish language)... We could have "smart" references to points in the query tree that keep track both of where they are, and what the head of the whole tree is. And the immutable flattened tree they could just use offsets instead of pointers, and resolve those offsets within the accessor. > (Auto-generated read/write/copy/eq functions would be a good idea > anyway.) Indeed. > >> References to data like table and datatype definitions that are copied > >> into the plan but are also copied multiple times. Doesn't matter for > >> ... > > TBH that sounds like a dead end to me. Detecting duplicate subtrees would > > be astonishingly expensive, and I doubt that there will be enough of them > > to really repay the effort in typical queries. > > Yeah, deduplicating common subplans may be too tricky and expensive. It'd sure be interesting to write a function that computes stats about which types of nodes consume how much memory. Shouldn't be too hard. I'm mildly suspecting that a surprisingly large overhead in complex query trees comes from target lists that are the same in many parts of the tree. Greetings, Andres Freund
Hello, Thanks for all your replies, I try to combine the answers for the last mails on the subject. ------ In reply to Tom Lane ------- Creating a new type of MemoryContext (FlatMemoryContext?) sounds like a good idea for me, because it sounds minimal invasive,I will try that. I believe that it doesn't need a maximum block size, does it? One could give a size for blocksthat are requested by malloc, and every time someone pallocs a larger area this is directly malloced and returned.I saw that AllocSec used an initial size which is doubled every time the previous size isn't sufficient anymore.Should the new MemoryContext should behave identically? I wonder if this MemoryContext should make good use of a "finalize" function, which can be called after no work in the contextis expected anymore, and which realloc's the allocated blocks to the really used size so the C-lib can reuse the unusedmemory. Maybe even AllocSec could profit from it as a first step. Checking the stats the used vs. malloced mem is ata low 60%. > This wouldn't take very much new code, and it would potentially offer up to perhaps 2X space savings compared to aset.c,thanks to cutting the per-chunk overhead. Whether that's enough to float your boat with respect to cached plans isn'ttoo clear, though. No, boat will still sink, but it's a good start and servers are less likely to crash. > A totally different idea is to make a variant version of copyObject that is intended to produce a compact form of a nodetree, and does not create a separate palloc allocation for each node but just packs them as tightly as it can in largerpalloc chunks. .. On the copyObject code that should be autogenerated, I thought a lot about how this would be possible with macros, but couldn'tthink of anything, because one macro statement would have to produce source code in two distinct places (at leastin two functions). Is there similar code somewhere else in the source tree ? See below for more on that subject > ...serialization of node tree... I also don't like serialization very much, because I already rewrote the Standard Serialization Mechanism in Java very successfully,but now about the problems in case of graph type object dependencies (which don't seem to exist in NodeSetsbecause then the objectCopy function would also have to cope with that and doesn't). But if everything else failsin the end it might by a good idea. However, I believe the size of the rewritten query plan is many times the originalnode tree and used often, so it might not be a good idea. ------ In reply to Thomas Munro ------ Regarding your comments on the variant version of copyObject, I also believe that a FlatMemoryContext would result in nearlythe same memory usage at a first glance, except for the Chunk headers Tom Lane also noticed. One should measure howmuch space could really be saved when we don't need the chunk headers. But I like the calculate-memory variant one would get when the copyObject code is generated. The two phase variant wouldnot even be slower because the overhead of pallocing every chunk for every node MUST be higher than a simple calculationof the required space. This feels a bit like inlining the allocation logic into the copyObject code, and wouldbe the best variant regarding speed AND size. ------ In reply to Andres Freund ------ > Yea, I think we'd have to autogenerate them to go for that. I'm kinda hoping that maybe John Naylor would take a look,after he made all our lives quite a bit easier with the other catalog script work... In the 'pipe dream' version ofthis, that script would be able to generate code to adjust pointers after a memcpy... That'd make copyObject() of suchflattened trees much cheaper. Thanks for the input, I will have a look at it. When copyObject on such trees is such a common operation, it might be feasibleto store the length of such a readonly-tree somewhere within the tree itself, so one knows the exact length of anode including all its subnodes, which "should" always be in a contiguous range of memory anyway, and we would just haveto apply an offset to all pointers within the copied area. Or, when we come back to the two-pass idea of preallocating a large chunk of memory for the nodes, one could use the samefunction to calculate the range on the fly. I see the problem here that the current implementation of copyObject (which is used in many placed in the code) would haveto be aware that the source nodes are allocated in a flat MemoryContext for this to work through everywhere in the codebase.This could be achived if a chunk header for the nodes is not omitted (althought the chunk header would not be setby the MemoryContext but by the copyNode implementation for FlatMemoryContext itself. If the current memory isn't flat,normal pallocs have to be done.). Does code exist with the assumption that every node resides in its own allocated chunk? > It'd sure be interesting to write a function that computes stats about which types of nodes consume how much memory. Shouldn'tbe too hard. I'm mildly suspecting that a surprisingly large overhead in complex query trees comes from target liststhat are the same in many parts of the tree. This would be the second part of the refactoring. I was wondering if one could reduce the data that is stored in the treein the first place by using references to existing structures instead. ------ Conclusion ------- * A FlatMemoryContext would reduce memory usage by 40%(?), and the current copyNode infrastructure could remain as is. * Using a code generator for the nodes is commonly agreed to be a good idea (and the amount of node types in nodes.h makeme believe that the amount of hand written code would drop significantly when using one). * An autogenerated copyObjectFlat would be able to determine the total size of a node and its descendants and could copya tree with a single palloc, making FlatMemoryContext obsolete and even yielding a higher memory reduction. * IF we keep the ChunkHeader in FlatMemoryContext, then the "standard" copy node would be able to fallback to a very fastmemcopy+pointeroffsetchange variant. The profit here might outweigh the few percent of memory we gain when we skip theheader. But then we have no memory gains vs. using the FlatMemoryContext alone. So: * A FlatMemoryContext is a good idea and I will try to implement it. * Code generation could speed things up significantly and a variant would just contribute to memory if we skip chunk headers(if that is allowed at all). * There is more work to do afterwards, because the trees are still too big. ------------------------------------------------ Some more questions: * Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function? * I wonder if we just cannot drop the raw_parse_tree altogether for large statements when the query tree has been generated.I know you need it in case the definition of tables etc. change to recreate the query tree, but wouldn't it bepossible the recreate the raw_parse_tree from the query_string in that very rare event also? The query_string is alreadythe most compacted form of the raw_parse_tree, right? * Does code exist with the assumption that every node resides in its own allocated chunk? (asked already above) * Where can I find the existing code generation script? Is it called during make or before checkin? ------------------------------------------------ Thanks for reading through this. I sadly cannot provide an example because this would mean to include my companies wholedatabase schema and I just gave up after spending 30 minutes to clean it up enough, but the complexity of the view canstill be viewed at https://explain.depesz.com/s/gN2 Regards, Daniel Migowski
On Sun, Jul 28, 2019 at 11:14 PM Daniel Migowski <dmigowski@ikoffice.de> wrote: > * Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function? It's not really relevant to the discussion about memory usage, but since you asked about MemoryContextAllocZeroAligned: I noticed that stuff a while back when another project I follow was tweaking related stuff. The idea behind those functions is that we have our own specialised memset()-like macros. MemoryContextAllocZeroAligned() uses MemSetLoop(), while MemoryContextAllocZero() uses MemSetAligned(). Those date back nearly two decades and were apparently based on experiments done at the time. The comments talk about avoiding function call overhead and specialising for constant lengths, so I suppose we assumed that compilers at the time didn't automatically know how to inline plain old memset() calls. These days, you'd think you could add "assume aligned" attributes to all memory-allocation-like functions and then change a few key things to be static inline so that the constants can be seen in the right places, and get all of the same benefits (and probably more) for free from the compiler's alignment analysis and inlining powers. As for what benefits might actually be available, for constants it's clear but for alignment, I'm doubtful. When Clang 8 on amd64 inlines a memset() with a known aligned destination (for various alignment sizes) and constant size, it generates a bunch of movups instructions, or with -march set to a modern arch, maybe vmovups, and some variations depending on size. Note 'u' (unaligned) instructions, not 'a' (aligned): that is, it doesn't even care that I said it the destination was aligned! I found claims from a couple of Intel sources that this sort of thing stopped making any difference to performance in the Nehalem microarchitecture (2008). It still matters whether the data actually is aligned, but not whether you use the instructions that tolerate misalignment, so there is apparently no point in generating different code, and therefore, as far as memset() goes, apparently no gain from annotating allocation functions as returning aligned pointers. As for other architectures or compilers, I don't know. For curiosity, here's an experimental patch that gets rid of the MemSetXXX() stuff, adds some (useless?) annotations about alignment and makes palloc0() and MemoryContextAllocZero() inline so that they can benefit from inlining with a visible constant size in eg newNode(). It didn't seem to do anything very interesting apart from remove a few hundred lines of code, so I didn't get around to digging further or sharing it earlier. Or maybe I was just doing it wrong. (In this patch it's using sizeof(long) which isn't enough to be considered aligned for the wider move instructions, but I tried various sizes when trying to trigger different codegen, without success). -- Thomas Munro https://enterprisedb.com
Attachment
Thanks for the insightful and interesting comments. -----Ursprüngliche Nachricht----- Von: Thomas Munro <thomas.munro@gmail.com> Gesendet: Montag, 29. Juli 2019 08:11 An: Daniel Migowski <dmigowski@ikoffice.de> Cc: Tom Lane <tgl@sss.pgh.pa.us>; Andres Freund <andres@anarazel.de>; pgsql-bugs@lists.postgresql.org Betreff: Re: BUG #15923: Prepared statements take way too much memory. On Sun, Jul 28, 2019 at 11:14 PM Daniel Migowski <dmigowski@ikoffice.de> wrote: > * Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function? It's not really relevant to the discussion about memory usage, but since you asked about MemoryContextAllocZeroAligned: I noticed that stuff a while back when another project I follow was tweaking related stuff. The idea behind those functionsis that we have our own specialised memset()-like macros. MemoryContextAllocZeroAligned() uses MemSetLoop(), while MemoryContextAllocZero() uses MemSetAligned(). Those date back nearly two decades and were apparently based on experimentsdone at the time. The comments talk about avoiding function call overhead and specialising for constant lengths, so I suppose we assumed thatcompilers at the time didn't automatically know how to inline plain old memset() calls. These days, you'd think you could add "assume aligned" attributes to all memory-allocation-like functions and then changea few key things to be static inline so that the constants can be seen in the right places, and get all of the samebenefits (and probably more) for free from the compiler's alignment analysis and inlining powers. As for what benefits might actually be available, for constants it's clear but for alignment, I'm doubtful. When Clang 8on amd64 inlines a memset() with a known aligned destination (for various alignment sizes) and constant size, it generates a bunch of movups instructions, or with -march set to a modern arch, maybe vmovups,and some variations depending on size. Note 'u' (unaligned) instructions, not 'a' (aligned): that is, it doesn'teven care that I said it the destination was aligned! I found claims from a couple of Intel sources that this sortof thing stopped making any difference to performance in the Nehalem microarchitecture (2008). It still matters whetherthe data actually is aligned, but not whether you use the instructions that tolerate misalignment, so there is apparentlyno point in generating different code, and therefore, as far as memset() goes, apparently no gain from annotatingallocation functions as returning aligned pointers. As for other architectures or compilers, I don't know. For curiosity, here's an experimental patch that gets rid of the MemSetXXX() stuff, adds some (useless?) annotations about alignment and makes palloc0() and MemoryContextAllocZero() inlineso that they can benefit from inlining with a visible constant size in eg newNode(). It didn't seem to do anythingvery interesting apart from remove a few hundred lines of code, so I didn't get around to digging further or sharingit earlier. Or maybe I was just doing it wrong. (In this patch it's using sizeof(long) which isn't enough to be considered aligned for the wider move instructions, but Itried various sizes when trying to trigger different codegen, without success). -- Thomas Munro https://enterprisedb.com
At Mon, 29 Jul 2019 18:11:04 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in <CA+hUKGLfa6ANa0vs7Lf0op0XBH05HE8SyX8NFhDyT7k2CHYLXw@mail.gmail.com> > On Sun, Jul 28, 2019 at 11:14 PM Daniel Migowski <dmigowski@ikoffice.de> wrote: > > * Why are MemoryContextAllocZeroAligned and MemoryContextAllocZero the same function? > > It's not really relevant to the discussion about memory usage, but since > you asked about MemoryContextAllocZeroAligned: > > I noticed that stuff a while back when another project I follow was > tweaking related stuff. The idea behind those functions is that we > have our own specialised memset()-like macros. > MemoryContextAllocZeroAligned() uses MemSetLoop(), while > MemoryContextAllocZero() uses MemSetAligned(). Those date back nearly > two decades and were apparently based on experiments done at the time. > The comments talk about avoiding function call overhead and > specialising for constant lengths, so I suppose we assumed that > compilers at the time didn't automatically know how to inline plain > old memset() calls. > > These days, you'd think you could add "assume aligned" attributes to > all memory-allocation-like functions and then change a few key things > to be static inline so that the constants can be seen in the right > places, and get all of the same benefits (and probably more) for free > from the compiler's alignment analysis and inlining powers. > > As for what benefits might actually be available, for constants it's > clear but for alignment, I'm doubtful. When Clang 8 on amd64 inlines > a memset() with a known aligned destination (for various alignment > sizes) and constant size, it generates a bunch of movups instructions, > or with -march set to a modern arch, maybe vmovups, and some > variations depending on size. Note 'u' (unaligned) instructions, not > 'a' (aligned): that is, it doesn't even care that I said it the > destination was aligned! I found claims from a couple of Intel > sources that this sort of thing stopped making any difference to > performance in the Nehalem microarchitecture (2008). It still matters > whether the data actually is aligned, but not whether you use the > instructions that tolerate misalignment, so there is apparently no > point in generating different code, and therefore, as far as memset() > goes, apparently no gain from annotating allocation functions as > returning aligned pointers. As for other architectures or compilers, > I don't know. > > For curiosity, here's an experimental patch that gets rid of the > MemSetXXX() stuff, adds some (useless?) annotations about alignment > and makes palloc0() and MemoryContextAllocZero() inline so that they > can benefit from inlining with a visible constant size in eg > newNode(). It didn't seem to do anything very interesting apart from > remove a few hundred lines of code, so I didn't get around to digging > further or sharing it earlier. Or maybe I was just doing it wrong. > (In this patch it's using sizeof(long) which isn't enough to be > considered aligned for the wider move instructions, but I tried > various sizes when trying to trigger different codegen, without > success). FWIW, I saw gcc/x64 -O2 emit "rep stosq" for memset() having constant length and "not-aligned" pointer are passed. Of course aligned(8) didn't make difference at all. Emitted a seris of movex for smaller chunks. And always emitted "call memset" for non-constant length. That makes me think that the "aligned" attribute works some different way. Anyway I haven't find any further explantion about the hint than "compiler considers alignment on the varialbe/type/function"... regards. -- Kyotaro Horiguchi NTT Open Source Software Center