Thread: Copy data to DSA area
Hi,
Related to my development (putting relcache and catcache onto shared memory)[1],
I have some questions about how to copy variables into shared memory, especially DSA area, and its implementation way.
Under the current architecture when initializing some data, we sometimes copy certain data using some specified functions
such as CreateTupleDescCopyConstr(), datumCopy(), pstrdup() and so on. These copy functions calls palloc() and allocates the
copied data into current memory context.
But on the shared memory, palloc() cannot be used. Now I'm trying to use DSA (and dshash) to handle data on the shared memory
so for example dsa_allocate() is needed instead of palloc(). I hit upon three ways to implementation.
A. Copy existing functions and write equivalent DSA version copy functions like CreateDSATupleDescCopyConstr(),
datumCopyDSA(), dsa_strdup()
In these functions the logic is same as current one but would be replaced palloc() with dsa_allocate().
But this way looks too straight forward and code becomes less readable and maintainable.
B. Using current functions and copy data on local memory context temporarily and copy it again to DSA area.
This method looks better compared to the plan A because we don't need to write clone functions with copy-paste.
But copying twice would degrade the performance.
C. Enhance the feature of palloc() and MemoryContext.
This is a rough idea but, for instance, make a new global flag to tell palloc() to use DSA area instead of normal MemoryContext.
MemoryContextSwitchToDSA(dsa_area *area) indicates following palloc() to allocate memory to DSA.
And MemoryContextSwitchBack(dsa_area) indicates to palloc is used as normal one.
MemoryContextSwitchToDSA(dsa_area);
palloc(size);
MemoryContextSwitchBack(dsa_area);
Plan C seems a handy way for DSA user because people can take advantage of existing functions.
What do you think about these ideas?
[1]
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04
Regards,
Takeshi Ideriha
On Wed, Nov 7, 2018 at 3:34 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > Related to my development (putting relcache and catcache onto shared memory)[1], > > I have some questions about how to copy variables into shared memory, especially DSA area, and its implementation way. > > Under the current architecture when initializing some data, we sometimes copy certain data using some specified functions > > such as CreateTupleDescCopyConstr(), datumCopy(), pstrdup() and so on. These copy functions calls palloc() and allocatesthe > > copied data into current memory context. Yeah, I faced this problem in typcache.c, and you can see the function share_typledesc() which copies TupleDesc objects into a DSA area. This doesn't really address the fundamental problem here though... see below. > But on the shared memory, palloc() cannot be used. Now I'm trying to use DSA (and dshash) to handle data on the sharedmemory > > so for example dsa_allocate() is needed instead of palloc(). I hit upon three ways to implementation. > > A. Copy existing functions and write equivalent DSA version copy functions like CreateDSATupleDescCopyConstr(), > > datumCopyDSA(), dsa_strdup() > > In these functions the logic is same as current one but would be replaced palloc() with dsa_allocate(). > > But this way looks too straight forward and code becomes less readable and maintainable. > > B. Using current functions and copy data on local memory context temporarily and copy it again to DSA area. > > This method looks better compared to the plan A because we don't need to write clone functions with copy-paste. > > But copying twice would degrade the performance. It's nice when you can construct objects directly at an address supplied by the caller. In other words, if allocation and object initialisation are two separate steps, you can put the object anywhere you like without copying. That could be on the stack, in an array, inside another object, in regular heap memory, in traditional shared memory, in a DSM segment or in DSA memory. I asked for an alloc/init separation in the Bloom filter code for the same reason. But this still isn't the real problem here... > C. Enhance the feature of palloc() and MemoryContext. > > This is a rough idea but, for instance, make a new global flag to tell palloc() to use DSA area instead of normal MemoryContext. > > MemoryContextSwitchToDSA(dsa_area *area) indicates following palloc() to allocate memory to DSA. > > And MemoryContextSwitchBack(dsa_area) indicates to palloc is used as normal one. > > MemoryContextSwitchToDSA(dsa_area); > > palloc(size); > > MemoryContextSwitchBack(dsa_area); > > Plan C seems a handy way for DSA user because people can take advantage of existing functions. The problem with plan C is that palloc() has to return a native pointer, but in order to do anything useful with this memory (ie to share it) you also need to get the dsa_pointer, but the palloc() interface doesn't allow for that. Any object that holds a pointer allocated with DSA-hiding-behind-palloc() will be useless for another process. > What do you think about these ideas? The real problem is object graphs with pointers. I solved that problem for TupleDesc by making them *almost* entirely flat, in commit c6293249. I say 'almost' because it won't work for constraints or defaults, but that didn't matter for the typcache.c case because it doesn't use those. In other words I danced carefully around the edge of the problem. In theory, object graphs, trees, lists etc could be implemented in a way that allows for "flat" storage, if they can be allocated in contiguous memory and refer to sub-objects within that space using offsets from the start of the space, and then they could be used without having to know whether they are in DSM/DSA memory or regular memory. That seems like a huge project though. Otherwise they have to hold dsa_pointer, and deal with that in many places. You can see this in the Parallel Hash code. I had to make the hash table data structure able to deal with raw pointers OR dsa_pointer. That's would be theoretically doable, but really quite painful, for the whole universe of PostgreSQL node types and data structures. I know of 3 ideas that would make your idea C work, so that you could share something as complicated as a query plan directly without having to deserialise it to use it: 1. Allow the creation of DSA areas inside the traditional fixed memory segment (instead of DSM), in a fixed-sized space reserved by the postmaster. That is, use dsa.c's ability to allocate and free memory, and possibly free a whole area at once to avoid leaking memory in some cases (like MemoryContexts), but in this mode dsa_pointer would be directly castable to a raw pointer. Then you could provide a regular MemoryContext interface for it, and use it via palloc(), as you said, and all the code that knows how to construct lists and trees and plan nodes etc would All Just Work. It would be your plan C, and all the pointers would be usable in every process, but limited in total size at start-up time. 2. Allow regular DSA in DSM to use raw pointers into DSM segments, by mapping segments at the same address in every backend. This involves reserving a large virtual address range up front in the postmaster, and then managing the space, trapping SEGV to map/unmap segments into parts of that address space as necessary (instead of doing that in dsa_get_address()). AFAIK that would work, but it seems to be a bit weird to go to such lengths. It would be a kind of home-made simulation of threads. On the other hand, that is what we're already doing in dsa.c, except more slowly due to extra software address translation from funky pseudo-addresses. 3. Something something threads. -- Thomas Munro http://www.enterprisedb.com
Hi, thank you for all the comment. It's really helpful. >From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] >Sent: Wednesday, November 7, 2018 1:35 PM > >On Wed, Nov 7, 2018 at 3:34 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> >wrote: >> Related to my development (putting relcache and catcache onto shared >> memory)[1], >> >> I have some questions about how to copy variables into shared memory, especially >DSA area, and its implementation way. >> >> Under the current architecture when initializing some data, we >> sometimes copy certain data using some specified functions >> >> such as CreateTupleDescCopyConstr(), datumCopy(), pstrdup() and so on. >> These copy functions calls palloc() and allocates the >> >> copied data into current memory context. > >Yeah, I faced this problem in typcache.c, and you can see the function >share_typledesc() which copies TupleDesc objects into a DSA area. >This doesn't really address the fundamental problem here though... see below. I checked share_tupledesc(). My original motivation came from copying tupledesc with constraint like CreateTupleDescCopyConstr(). But yes, as you stated here and bellow without having to copy tupledesc constraint, this method makes sense. > >> But on the shared memory, palloc() cannot be used. Now I'm trying to >> use DSA (and dshash) to handle data on the shared memory >> >> so for example dsa_allocate() is needed instead of palloc(). I hit upon three ways to >implementation. >> >> A. Copy existing functions and write equivalent DSA version copy >> functions like CreateDSATupleDescCopyConstr(), >> >> datumCopyDSA(), dsa_strdup() >> >> In these functions the logic is same as current one but would be replaced palloc() >with dsa_allocate(). >> >> But this way looks too straight forward and code becomes less readable and >maintainable. >> >> B. Using current functions and copy data on local memory context temporarily and >copy it again to DSA area. >> >> This method looks better compared to the plan A because we don't need to write >clone functions with copy-paste. >> >> But copying twice would degrade the performance. > >It's nice when you can construct objects directly at an address supplied by the caller. >In other words, if allocation and object initialization are two separate steps, you can >put the object anywhere you like without copying. That could be on the stack, in an >array, inside another object, in regular heap memory, in traditional shared memory, in >a DSM segment or in DSA memory. I asked for an alloc/init separation in the Bloom >filter code for the same reason. But this still isn't the real problem here... Yes, actually I tried to create a new function TupleDescCopyConstr() which is almost same as TupleDescCopy() except also copying constraints. This is supposed to separate allocation and initialization. But as you pointed out bellow, I had to manage object graph with pointes and faced the difficulty. >> C. Enhance the feature of palloc() and MemoryContext. >> >> This is a rough idea but, for instance, make a new global flag to tell palloc() to >use DSA area instead of normal MemoryContext. >> >> MemoryContextSwitchToDSA(dsa_area *area) indicates following palloc() to >allocate memory to DSA. >> >> And MemoryContextSwitchBack(dsa_area) indicates to palloc is used as normal >one. >> >> MemoryContextSwitchToDSA(dsa_area); >> >> palloc(size); >> >> MemoryContextSwitchBack(dsa_area); >> >> Plan C seems a handy way for DSA user because people can take advantage of >existing functions. > >The problem with plan C is that palloc() has to return a native pointer, but in order to >do anything useful with this memory (ie to share it) you also need to get the >dsa_pointer, but the palloc() interface doesn't allow for that. Any object that holds a >pointer allocated with DSA-hiding-behind-palloc() will be useless for another process. Agreed. I didn't have much consideration on this point. >> What do you think about these ideas? > >The real problem is object graphs with pointers. I solved that problem for TupleDesc >by making them *almost* entirely flat, in commit c6293249. I say 'almost' because it >won't work for constraints or defaults, but that didn't matter for the typcache.c case >because it doesn't use those. In other words I danced carefully around the edge of >the problem. > >In theory, object graphs, trees, lists etc could be implemented in a way that allows for >"flat" storage, if they can be allocated in contiguous memory and refer to sub-objects >within that space using offsets from the start of the space, and then they could be used >without having to know whether they are in DSM/DSA memory or regular memory. >That seems like a huge project though. Otherwise they have to hold dsa_pointer, and >deal with that in many places. You can see this in the Parallel Hash code. I had to >make the hash table data structure able to deal with raw pointers OR dsa_pointer. >That's would be theoretically doable, but really quite painful, for the whole universe of >PostgreSQL node types and data structures. Yeah, thank you for summarizing this point. That's one of the difficulty I've faced with but failed to state it in my previous email. I agreed your point. Making everything "flat" is not a realistic choice and holding dsa_pointer here and there and switching native pointer or dsa_pointer using union type or other things can be done but still huge one. >I know of 3 ideas that would make your idea C work, so that you could share >something as complicated as a query plan directly without having to deserialise it to >use it: > >1. Allow the creation of DSA areas inside the traditional fixed memory segment >(instead of DSM), in a fixed-sized space reserved by the postmaster. That is, use >dsa.c's ability to allocate and free memory, and possibly free a whole area at once to >avoid leaking memory in some cases (like MemoryContexts), but in this mode >dsa_pointer would be directly castable to a raw pointer. Then you could provide a >regular MemoryContext interface for it, and use it via palloc(), as you said, and all the >code that knows how to construct lists and trees and plan nodes etc would All Just >Work. It would be your plan C, and all the pointers would be usable in every process, >but limited in total size at start-up time. > >2. Allow regular DSA in DSM to use raw pointers into DSM segments, by mapping >segments at the same address in every backend. This involves reserving a large >virtual address range up front in the postmaster, and then managing the space, >trapping SEGV to map/unmap segments into parts of that address space as necessary >(instead of doing that in dsa_get_address()). AFAIK that would work, but it seems to >be a bit weird to go to such lengths. It would be a kind of home-made simulation of >threads. On the other hand, that is what we're already doing in dsa.c, except more >slowly due to extra software address translation from funky pseudo-addresses. > >3. Something something threads. I'm thinking to go with plan 1. No need to think about address translation seems tempting. Plan 2 (as well as plan 3) looks a big project. Regards, Takeshi Ideriha
On Fri, Nov 9, 2018 at 2:19 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] > >I know of 3 ideas that would make your idea C work, so that you could share > >something as complicated as a query plan directly without having to deserialise it to > >use it: > > > >1. Allow the creation of DSA areas inside the traditional fixed memory segment > >(instead of DSM), in a fixed-sized space reserved by the postmaster. That is, use > >dsa.c's ability to allocate and free memory, and possibly free a whole area at once to > >avoid leaking memory in some cases (like MemoryContexts), but in this mode > >dsa_pointer would be directly castable to a raw pointer. Then you could provide a > >regular MemoryContext interface for it, and use it via palloc(), as you said, and all the > >code that knows how to construct lists and trees and plan nodes etc would All Just > >Work. It would be your plan C, and all the pointers would be usable in every process, > >but limited in total size at start-up time. > > > >2. Allow regular DSA in DSM to use raw pointers into DSM segments, by mapping > >segments at the same address in every backend. This involves reserving a large > >virtual address range up front in the postmaster, and then managing the space, > >trapping SEGV to map/unmap segments into parts of that address space as necessary > >(instead of doing that in dsa_get_address()). AFAIK that would work, but it seems to > >be a bit weird to go to such lengths. It would be a kind of home-made simulation of > >threads. On the other hand, that is what we're already doing in dsa.c, except more > >slowly due to extra software address translation from funky pseudo-addresses. > > > >3. Something something threads. > > I'm thinking to go with plan 1. No need to think about address translation > seems tempting. Plan 2 (as well as plan 3) looks a big project. The existing function dsa_create_in_place() interface was intended to support that, but has never been used in that way so I'm not sure what extra problems will come up. Here are some assorted thoughts: * You can prevent a DSA area from creating extra DSM segments, so that it is constrained to stay entirely in the space you give it, by calling dsa_set_size_limit(area, size) using the same size that you gave to dsa_create_in_place(); now you have a DSA area that manages a single fixed-sized chunk of memory that you gave it, in your case inside the traditional shared memory segment (but it could be anywhere, including inside a DSM segment or another DSA area!) * You can probably write a MemoryContext wrapper for it, if it has only one segment that is in the traditional shared memory segment. You would need to do very simple kind of address translation: the result from palloc() needs to be base + dsa_allocate()'s result, and the argument to pfree() needs to be subtracted from base when dsa_free() is called. That is a version of your idea C that should work AFAIK. * Once you have that working, you now have a new kind of resource management problem on your hands: memory leaks will be cluster-wide and cluster-life-time! That's hard, because the goal is to be able to use arbitrary code in the tree that deals with plans etc, but that code all assumes that it can "throw" (elog()) on errors. PostgreSQL C is generally "garbage collected" (in a way), but in this sketch, that doesn't work anymore: this area *never* goes out of scope and gets cleaned up. Generally, languages with exceptions either need garbage collection or scoped destructors to clean up the mess, but in this sketch we don't have that anymore... much like allocating stuff in TopMemoryContext, except worse because it doesn't go away when one backend exits. * I had some ideas about some kind of "allocation rollback" interface: you begin an "allocation transaction", allocate a bunch of stuff (perhaps indirectly, by calling some API that makes query plans or whatever and is totally unaware of this stuff). Then if there is an error, whatever was allocated so far is freed in the usual cleanup paths by a rollback that happens via the resource manager machinery. If you commit, then the allocation becomes permanent. Then you only commit stuff that you promise not to leak (perhaps stuff that has been added to a very carefully managed cluster-wide plan cache). I am not sure of the details, and this might be crazy... -- Thomas Munro http://www.enterprisedb.com
On Thu, Nov 8, 2018 at 9:05 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > * I had some ideas about some kind of "allocation rollback" interface: > you begin an "allocation transaction", allocate a bunch of stuff > (perhaps indirectly, by calling some API that makes query plans or > whatever and is totally unaware of this stuff). Then if there is an > error, whatever was allocated so far is freed in the usual cleanup > paths by a rollback that happens via the resource manager machinery. > If you commit, then the allocation becomes permanent. Then you only > commit stuff that you promise not to leak (perhaps stuff that has been > added to a very carefully managed cluster-wide plan cache). I am not > sure of the details, and this might be crazy... Hmm, my first thought was that you were just reinventing memory contexts, but it's really not quite the same thing, because you want the allocations to end up owned by a long-lived context when you succeed but a transient context when you fail. Still, if it weren't for the fact that the memory context interface is hostile to dynamic shared memory's map-this-anywhere semantics, I suspect we'd try to find a way to make memory contexts fit the bill, maybe by reparenting contexts or even individual allocations. You could imagine using the sorts of union algorithms that are described in https://en.wikipedia.org/wiki/Disjoint-set_data_structure to get very low asymptotic complexity here. I wonder if it's possible to rethink our current memory context machinery so that it is not so DSM-hostile. At one point, I had the idea of replacing the pointer in the chunk header with an array-offset, which might also avoid repeatedly creating and destroying AllocSetContext objects over and over at high speed. Or you could come up with some intermediate idea: if the value there is MAXALIGN'd, it's a pointer; if not, it's some other kind of identifier that you have to go look up in a table someplace to find the real context. Part of the problem here is that, on the one hand, it's really useful that all memory management in PostgreSQL currently uses a single system: memory contexts. I'd be loathe to change that. On the other hand, there are several different allocation patterns which have noticeably different optimal strategies: 1. allocate for a while and then free everything at once => allocate from a single slab 2. allocate and free for a while and then free anything that's left => allocate from multiple slabs, dividing allocations by size class 3. perform a short series of allocations, freeing everything if we hit an error midway through => keep an allocation log, roll back by retail frees 4. like (3) but with a long series of allocations => allocate from a slab, free the whole slab on error Our current AllocSetContext is not optimal for any of these situations. It uses size classes, but they are rather coarse-grained, which wastes a lot of memory. They also have big headers, which also wastes a lot of memory -- unnecessarily, in the case where we don't need to free anything until the end. The main advantage of AllocSetAlloc is that it's very fast for small contexts while still being able to support individual freeing individual allocations when necessary, though not efficiently. I'm rambling here, I guess... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 13, 2018 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Nov 8, 2018 at 9:05 PM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > * I had some ideas about some kind of "allocation rollback" interface: > > you begin an "allocation transaction", allocate a bunch of stuff > > (perhaps indirectly, by calling some API that makes query plans or > > whatever and is totally unaware of this stuff). Then if there is an > > error, whatever was allocated so far is freed in the usual cleanup > > paths by a rollback that happens via the resource manager machinery. > > If you commit, then the allocation becomes permanent. Then you only > > commit stuff that you promise not to leak (perhaps stuff that has been > > added to a very carefully managed cluster-wide plan cache). I am not > > sure of the details, and this might be crazy... > > Hmm, my first thought was that you were just reinventing memory > contexts, but it's really not quite the same thing, because you want > the allocations to end up owned by a long-lived context when you > succeed but a transient context when you fail. Still, if it weren't > for the fact that the memory context interface is hostile to dynamic > shared memory's map-this-anywhere semantics, I suspect we'd try to > find a way to make memory contexts fit the bill, maybe by reparenting > contexts or even individual allocations. You could imagine using the > sorts of union algorithms that are described in > https://en.wikipedia.org/wiki/Disjoint-set_data_structure to get very > low asymptotic complexity here. Yeah, I was imagining something that still works with MemoryContexts. Interesting idea re: unions. I haven't got as far as thinking about how you'd actually make that work. But I think we're both describing the same general kind of semantics; you need to be able to build stuff with automatic clean-up on non-local exit, but also keep that stuff for longer once you decide you're ready. Anyway, avoiding all the hard questions about new kinds of foot gun for now, here is a stupid hack that shows a DSA area inside the traditional fixed-address shared memory region, wrapped in a custom (and somewhat defective for now) MemoryContext. It shows a "List" being manipulated by standard PostgreSQL list code that is entirely unaware that it is working in shared memory: postgres=# call hoge_list(3); NOTICE: Contents of list: NOTICE: 1 NOTICE: 2 NOTICE: 3 CALL You can call that procedure from various different backends to add and remove integers from a List. The point is that it could just as easily be a query plan or any other palloc-based stuff in our tree, and the pointers work in any backend. -- Thomas Munro http://www.enterprisedb.com
Attachment
Thank you for the comment. From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] >> I'm thinking to go with plan 1. No need to think about address >> translation seems tempting. Plan 2 (as well as plan 3) looks a big project. > >The existing function dsa_create_in_place() interface was intended to support that, >but has never been used in that way so I'm not sure what extra problems will come up. >Here are some assorted thoughts: > >* You can prevent a DSA area from creating extra DSM segments, so that it is >constrained to stay entirely in the space you give it, by calling dsa_set_size_limit(area, >size) using the same size that you gave to dsa_create_in_place(); now you have a >DSA area that manages a single fixed-sized chunk of memory that you gave it, in your >case inside the traditional shared memory segment (but it could be anywhere, >including inside a DSM segment or another DSA area!) Yeah, I will use it. >* You can probably write a MemoryContext wrapper for it, if it has only one segment >that is in the traditional shared memory segment. >You would need to do very simple kind of address translation: the result from palloc() >needs to be base + dsa_allocate()'s result, and the argument to pfree() needs to be >subtracted from base when >dsa_free() is called. That is a version of your idea C that should work AFAIK. I didn't notice that if only one segment is used dsa_get_address() is not needed and simple math is enough. >* Once you have that working, you now have a new kind of resource management >problem on your hands: memory leaks will be cluster-wide and cluster-life-time! >That's hard, because the goal is to be able to use arbitrary code in the tree that deals >with plans etc, but that code all assumes that it can "throw" (elog()) on errors. >PostgreSQL C is generally "garbage collected" (in a way), but in this sketch, that >doesn't work anymore: this area *never* goes out of scope and gets cleaned up. >Generally, languages with exceptions either need garbage collection or scoped >destructors to clean up the mess, but in this sketch we don't have that anymore... >much like allocating stuff in TopMemoryContext, except worse because it doesn't go >away when one backend exits. > >* I had some ideas about some kind of "allocation rollback" interface: >you begin an "allocation transaction", allocate a bunch of stuff (perhaps indirectly, by >calling some API that makes query plans or whatever and is totally unaware of this >stuff). Then if there is an error, whatever was allocated so far is freed in the usual >cleanup paths by a rollback that happens via the resource manager machinery. >If you commit, then the allocation becomes permanent. Then you only commit stuff >that you promise not to leak (perhaps stuff that has been added to a very carefully >managed cluster-wide plan cache). I am not sure of the details, and this might be >crazy... Can I check my understanding? The situation you are talking about is the following: Data structure A and B will be allocated on DSA. Data A has a pointer to B. Another data X is already allocated on some area (might be on local heap) and has a pointer variable X->a, which will be linked to A. old_context = MemoryContextSwitchTo(dsa_memory_context); A = palloc(); B = palloc(); A->b = B; X->a = A; MemoryContextSwitchTo(old_context); If error happens in the way of this flow, palloc'd data (A & B) should be freed and pointer to freed data (X->a) should be back to its original one. So handling these cases introduces an "transaction" API like begin_allocate() and end_allocate(). I'm thinking begin_allocate() starts to keeping a record of palloc'd data until end_allocate() is done. If error occurs, just pfree() these data. However, to rollback pointers we need to take notes of old value of the pointer. This would introduce a new API like "points_to_pallocd_data(pointer, new_data)" to remember old_value and do `X->a = A`. To implement them I still need further consideration about how to fit or extend existing MemoryContext machinery. Regards, Takeshi Ideriha
From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] > >On Tue, Nov 13, 2018 at 9:45 AM Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Nov 8, 2018 at 9:05 PM Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> > * I had some ideas about some kind of "allocation rollback" interface: >> > you begin an "allocation transaction", allocate a bunch of stuff >> > (perhaps indirectly, by calling some API that makes query plans or >> > whatever and is totally unaware of this stuff). Then if there is an >> > error, whatever was allocated so far is freed in the usual cleanup >> > paths by a rollback that happens via the resource manager machinery. >> > If you commit, then the allocation becomes permanent. Then you only >> > commit stuff that you promise not to leak (perhaps stuff that has >> > been added to a very carefully managed cluster-wide plan cache). I >> > am not sure of the details, and this might be crazy... >> >> Hmm, my first thought was that you were just reinventing memory >> contexts, but it's really not quite the same thing, because you want >> the allocations to end up owned by a long-lived context when you >> succeed but a transient context when you fail. Still, if it weren't >> for the fact that the memory context interface is hostile to dynamic >> shared memory's map-this-anywhere semantics, I suspect we'd try to >> find a way to make memory contexts fit the bill, maybe by reparenting >> contexts or even individual allocations. You could imagine using the >> sorts of union algorithms that are described in >> https://en.wikipedia.org/wiki/Disjoint-set_data_structure to get very >> low asymptotic complexity here. > >Yeah, I was imagining something that still works with MemoryContexts. >Interesting idea re: unions. I haven't got as far as thinking about how you'd actually >make that work. But I think we're both describing the same general kind of >semantics; you need to be able to build stuff with automatic clean-up on non-local exit, >but also keep that stuff for longer once you decide you're ready. > >Anyway, avoiding all the hard questions about new kinds of foot gun for now, here is a >stupid hack that shows a DSA area inside the traditional fixed-address shared memory >region, wrapped in a custom (and somewhat defective for now) MemoryContext. It >shows a "List" >being manipulated by standard PostgreSQL list code that is entirely unaware that it is >working in shared memory: > >postgres=# call hoge_list(3); >NOTICE: Contents of list: >NOTICE: 1 >NOTICE: 2 >NOTICE: 3 >CALL > >You can call that procedure from various different backends to add and remove >integers from a List. The point is that it could just as easily be a query plan or any >other palloc-based stuff in our tree, and the pointers work in any backend. Thanks for the patch. I know it's a rough sketch but basically that's what I want to do. I tried it and it works well. Regards, Takeshi Ideriha
On Tue, Nov 13, 2018 at 10:59 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > Can I check my understanding? > The situation you are talking about is the following: > Data structure A and B will be allocated on DSA. Data A has a pointer to B. > Another data X is already allocated on some area (might be on local heap) and has a pointer variable X->a, > which will be linked to A. > > old_context = MemoryContextSwitchTo(dsa_memory_context); > A = palloc(); > B = palloc(); > A->b = B; > X->a = A; > MemoryContextSwitchTo(old_context); > > If error happens in the way of this flow, palloc'd data (A & B) should be freed > and pointer to freed data (X->a) should be back to its original one. > So handling these cases introduces an "transaction" API like begin_allocate() and end_allocate(). I wasn't thinking of resetting X->a to its original value, just freeing A and B. For a shared cache in this memory area, you need to make sure that you never leak memory; it must either be referenced by the cache book keeping data structures so that you can find it and manually free it eventually (probably with an LRU system?), or it must somehow be cleaned up if an error occurs before you manage to insert it into the cache. During construction of an object graph, before you attach it to your manual book keeping data structures, there is a danger zone. For example, what if, while copying a query plan or a catcache entry, we successfully allocate a new cache entry with palloc(), but then another palloc() call fails because we run out of memory when copying the keys? It looks like the current catcache.c coding just doesn't care: memory will be leaked permanently in CacheMemoryContext. That's OK because it is process-local. Such a process is probably doomed anyway. If you exit and then start a new backend the problem is gone. Here we need something better, because once this new kind of memory fills up with leaked objects, you have to restart the whole cluster to recover. If the solution involves significantly different control flows (like PG_TRY(), PG_FINALLY() manual cleanup all over the place), then lots of existing palloc-based code will need to be reviewed and rewritten very carefully for use in this new kind of memory context. If it can be done automagically, then more of our existing code will be able to participate in the construction of stuff that we want to put in shared memory. I first thought about this in the context of shared plan caches, a topic that appears on this mailing list from time to time. There are some other fun problems, like cache invalidation for shared caches, space recycling (LRUs etc), and of course locking/concurrency (dsa_allocate() and dsa_free() are concurrency safe, but whatever data structures you build with the memory of course need their own plan for dealing with concurrency). But a strategy for clean-up on errors seems to be a major subproblem to deal with early in the project. I will be very happy if we can come up with something :-) On Wed, Nov 14, 2018 at 12:40 AM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] > >postgres=# call hoge_list(3); > >NOTICE: Contents of list: > >NOTICE: 1 > >NOTICE: 2 > >NOTICE: 3 > >CALL > > > >You can call that procedure from various different backends to add and remove > >integers from a List. The point is that it could just as easily be a query plan or any > >other palloc-based stuff in our tree, and the pointers work in any backend. > > Thanks for the patch. I know it's a rough sketch but basically that's what I want to do. > I tried it and it works well. After your message I couldn't resist a small experiment. But it seems there are plenty more problems to be solved to make it useful... -- Thomas Munro http://www.enterprisedb.com
Hi >From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] >Sent: Wednesday, November 14, 2018 9:50 AM >To: Ideriha, Takeshi/出利葉 健 <ideriha.takeshi@jp.fujitsu.com> > >On Tue, Nov 13, 2018 at 10:59 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> >wrote: >> Can I check my understanding? >> The situation you are talking about is the following: >> Data structure A and B will be allocated on DSA. Data A has a pointer to B. >> Another data X is already allocated on some area (might be on local >> heap) and has a pointer variable X->a, which will be linked to A. >> >> old_context = MemoryContextSwitchTo(dsa_memory_context); >> A = palloc(); >> B = palloc(); >> A->b = B; >> X->a = A; >> MemoryContextSwitchTo(old_context); >> >> If error happens in the way of this flow, palloc'd data (A & B) should >> be freed and pointer to freed data (X->a) should be back to its original one. >> So handling these cases introduces an "transaction" API like begin_allocate() and >end_allocate(). > >I wasn't thinking of resetting X->a to its original value, just freeing A and B. For a >shared cache in this memory area, you need to make sure that you never leak >memory; it must either be referenced by the cache book keeping data structures so >that you can find it and manually free it eventually (probably with an LRU system?), or >it must somehow be cleaned up if an error occurs before you manage to insert it into >the cache. During construction of an object graph, before you attach it to your >manual book keeping data structures, there is a danger zone. > >For example, what if, while copying a query plan or a catcache entry, we successfully >allocate a new cache entry with palloc(), but then another palloc() call fails because >we run out of memory when copying the keys? It looks like the current catcache.c >coding just doesn't >care: memory will be leaked permanently in CacheMemoryContext. That's OK >because it is process-local. Such a process is probably doomed anyway. If you exit >and then start a new backend the problem is gone. >Here we need something better, because once this new kind of memory fills up with >leaked objects, you have to restart the whole cluster to recover. Thank you for the clarification. I agree that leaving memory leaking in shared memory without freeing it ends up cluster-wide problem. >If the solution involves significantly different control flows (like PG_TRY(), >PG_FINALLY() manual cleanup all over the place), then lots of existing palloc-based >code will need to be reviewed and rewritten very carefully for use in this new kind of >memory context. If it can be done automagically, then more of our existing code will >be able to participate in the construction of stuff that we want to put in shared >memory. About resetting X->a I was thinking about how to treat dangling pointer inside a new memory context machinery instead of using PG_TRY() stuff. That's because putting PG_TRY() all over the place becomes responsible for a developer trying to shared-memory-version MemoryContext and it needs a lot of efforts as you noted. But in my mind even if PG_TRY() is used, it only affects new code uses shared-memory version MemoryContext and doesn't affect current code. I may be missing something here... >I first thought about this in the context of shared plan caches, a topic that appears on >this mailing list from time to time. There are some other fun problems, like cache >invalidation for shared caches, space recycling (LRUs etc), and of course >locking/concurrency >(dsa_allocate() and dsa_free() are concurrency safe, but whatever data structures >you build with the memory of course need their own plan for dealing with concurrency). >But a strategy for clean-up on errors seems to be a major subproblem to deal with >early in the project. I will be very happy if we can come up with something :-) Yeah, I hope we can do it somehow. >On Wed, Nov 14, 2018 at 12:40 AM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> >wrote: >> From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] >> >postgres=# call hoge_list(3); >> >NOTICE: Contents of list: >> >NOTICE: 1 >> >NOTICE: 2 >> >NOTICE: 3 >> >CALL >> > >> >You can call that procedure from various different backends to add >> >and remove integers from a List. The point is that it could just as >> >easily be a query plan or any other palloc-based stuff in our tree, and the pointers >work in any backend. >> >> Thanks for the patch. I know it's a rough sketch but basically that's what I want to >do. >> I tried it and it works well. > >After your message I couldn't resist a small experiment. But it seems there are >plenty more problems to be solved to make it useful... Thank you for kind remark. By the way I just thought meta variable "hoge" is used only in Japan :) Regards, Takeshi Ideriha
Hello. At Wed, 28 Nov 2018 05:13:26 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F3BD73A@G01JPEXMBKW04> > Hi > > >From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] > >Here we need something better, because once this new kind of memory fills up with > >leaked objects, you have to restart the whole cluster to recover. > > Thank you for the clarification. I agree that leaving memory leaking in shared memory > without freeing it ends up cluster-wide problem. If some of the "leaked" objects are liked each other with other "live" objects including those on local memory, we must tell whether each "maybe-leaked" object is actually a leak or not. That looks apprently stupid. Maybe we shouldn't manipulate objects on shared memory at such a low-level interface. We would need a kind of resource manager for such structures holding shared objects and should (perhaps explicitly) register the objects that shuold be free'd on error-exiting from a scope. This seems a bit different from reset callbacks of the current MemoryContext. > >If the solution involves significantly different control flows (like PG_TRY(), > >PG_FINALLY() manual cleanup all over the place), then lots of existing palloc-based > >code will need to be reviewed and rewritten very carefully for use in this new kind of > >memory context. If it can be done automagically, then more of our existing code will > >be able to participate in the construction of stuff that we want to put in shared > >memory. > > About resetting X->a I was thinking about how to treat dangling pointer inside a new memory > context machinery instead of using PG_TRY() stuff. That's because putting PG_TRY() all over the > place becomes responsible for a developer trying to shared-memory-version MemoryContext and > it needs a lot of efforts as you noted. But in my mind even if PG_TRY() is used, it only affects new > code uses shared-memory version MemoryContext and doesn't affect current code. > I may be missing something here... Thomas mentioned exiting upper-layer library code using palloc, like pg_list. As I wrote above, pg_list is one of such feature that would need to be carefully adjusted. > >I first thought about this in the context of shared plan caches, a topic that appears on > >this mailing list from time to time. There are some other fun problems, like cache > >invalidation for shared caches, space recycling (LRUs etc), and of course > >locking/concurrency > >(dsa_allocate() and dsa_free() are concurrency safe, but whatever data structures > >you build with the memory of course need their own plan for dealing with concurrency). > >But a strategy for clean-up on errors seems to be a major subproblem to deal with > >early in the project. I will be very happy if we can come up with something :-) > > Yeah, I hope we can do it somehow. In addition to recycling issue within a fixed area, I suppose that we will need to "expand" the fixed-address shared memory. PROT_NONE could work for that but it must break some platforms.. > >On Wed, Nov 14, 2018 at 12:40 AM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> > >wrote: > >> From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] > >> >postgres=# call hoge_list(3); > >> >NOTICE: Contents of list: > >> >NOTICE: 1 > >> >NOTICE: 2 > >> >NOTICE: 3 > >> >CALL > >> > > >> >You can call that procedure from various different backends to add > >> >and remove integers from a List. The point is that it could just as > >> >easily be a query plan or any other palloc-based stuff in our tree, and the pointers > >work in any backend. > >> > >> Thanks for the patch. I know it's a rough sketch but basically that's what I want to > >do. > >> I tried it and it works well. > > > >After your message I couldn't resist a small experiment. But it seems there are > >plenty more problems to be solved to make it useful... > > Thank you for kind remark. > By the way I just thought meta variable "hoge" is used only in Japan :) I've seen it many-many times including mine in this list, thus why don't you know it is already a world-wide jargon? :p regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, thank you for the comment. >-----Original Message----- >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] >Sent: Wednesday, November 28, 2018 5:09 PM > >Hello. > >At Wed, 28 Nov 2018 05:13:26 +0000, "Ideriha, Takeshi" ><ideriha.takeshi@jp.fujitsu.com> wrote in ><4E72940DA2BF16479384A86D54D0988A6F3BD73A@G01JPEXMBKW04> >> Hi >> >> >From: Thomas Munro [mailto:thomas.munro@enterprisedb.com] >> >Here we need something better, because once this new kind of memory >> >fills up with leaked objects, you have to restart the whole cluster to recover. >> >> Thank you for the clarification. I agree that leaving memory leaking >> in shared memory without freeing it ends up cluster-wide problem. > >If some of the "leaked" objects are liked each other with other "live" objects including >those on local memory, we must tell whether each "maybe-leaked" object is actually >a leak or not. That looks apprently stupid. Maybe we shouldn't manipulate objects on >shared memory at such a low-level interface. We would need a kind of resource >manager for such structures holding shared objects and should (perhaps explicitly) >register the objects that shuold be free'd on error-exiting from a scope. This seems a >bit different from reset callbacks of the current MemoryContext. Sure. We want to get rid of dangling pointer at the error handling but this should be transparent. >> >If the solution involves significantly different control flows (like >> >PG_TRY(), >> >PG_FINALLY() manual cleanup all over the place), then lots of >> >existing palloc-based code will need to be reviewed and rewritten >> >very carefully for use in this new kind of memory context. If it can >> >be done automagically, then more of our existing code will be able to >> >participate in the construction of stuff that we want to put in shared memory. >> >> About resetting X->a I was thinking about how to treat dangling >> pointer inside a new memory context machinery instead of using >> PG_TRY() stuff. That's because putting PG_TRY() all over the place >> becomes responsible for a developer trying to shared-memory-version >> MemoryContext and it needs a lot of efforts as you noted. But in my mind even if >PG_TRY() is used, it only affects new code uses shared-memory version >MemoryContext and doesn't affect current code. >> I may be missing something here... > >Thomas mentioned exiting upper-layer library code using palloc, like pg_list. As I >wrote above, pg_list is one of such feature that would need to be carefully adjusted. Thank you for the clarification. I think I was misunderstanding. In some previous email I suggested "points_to_pallocd_data(pointer, new_data)" to remember old_value and do reset a pointer at the time of error handling. Yeah, this interface needs to be implemented at existing function wrapping palloc and set the pointer. >> >I first thought about this in the context of shared plan caches, a >> >topic that appears on this mailing list from time to time. There are >> >some other fun problems, like cache invalidation for shared caches, >> >space recycling (LRUs etc), and of course locking/concurrency >> >(dsa_allocate() and dsa_free() are concurrency safe, but whatever >> >data structures you build with the memory of course need their own plan for >dealing with concurrency). >> >But a strategy for clean-up on errors seems to be a major subproblem >> >to deal with early in the project. I will be very happy if we can >> >come up with something :-) >> >> Yeah, I hope we can do it somehow. > >In addition to recycling issue within a fixed area, I suppose that we will need to >"expand" the fixed-address shared memory. PROT_NONE could work for that but it >must break some platforms.. If we are only talking about DSA created in the place of Postgres-initailized shared area, I think we can prevent memory from expanding using dsa_set_limit_size(). >> Thank you for kind remark. >> By the way I just thought meta variable "hoge" is used only in Japan >> :) > >I've seen it many-many times including mine in this list, thus why don't you know it is >already a world-wide jargon? :p Ah I didn't know that, thanks! Regards, Takeshi Ideriha
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >Sent: Wednesday, December 5, 2018 2:42 PM >Subject: RE: Copy data to DSA area Hi It's been a long while since we discussed this topic. Let me recap first and I'll give some thoughts. It seems things we got consensus is: - Want to palloc/pfree transparently in DSA - Use Postgres-initialized shared memory as DSA - Don’t leak memory in shared memory Things under discussion: - How we prevent memory leak - How we prevent dangling pointer after cleaning up about-to-leak-objects Regarding memory leak, I think Robert's idea that allocate objects under temporal context while building and re-parent it to permanent one at some point is promising. While building objects they are under temporal DSA-MemoryContext, which is child of TopTransactionContext (if it's in the transaction) and are freed all at once when error happens. To do delete all the chunks allocated under temporal DSA context, we need to search or remember all chunks location under the context. Unlike AllocAset we don't have block information to delete them altogether. So I'm thinking to manage dsa_allocated chunks with single linked list to keep track of them and delete them. The context has head of linked list and all chunks have pointer to next allocated chunk. But this way adds space overhead to every dsa_allocated chunk and we maybe want to avoid it because shared memory size islimited. In this case, we can free these pointer area at some point when we make sure that allocation is successful. Another debate is when we should think the allocation is successful (when we make sure object won't leak). If allocation is done in the transaction, we think if transaction is committed we can think it's safe. Or I assume this DSA memory context for cache such as relcache, catcache, plancache and so on. In this case cache won't leak once it's added to hash table or list because I assume some eviction mechanism like LRU willbe implemented and it will erase useless cache some time later. What do you think about these ideas? Regarding dangling pointer I think it's also problem. After cleaning up objects to prevent memory leak we don't have mechanism to reset dangling pointer. On this point I gave some thoughts while ago though begin_allocate/end_allocate don't seem good names. Maybe more explaining names are like start_pointing_to_dsa_object_under_construction() and end_pointing_to_dsa_object_under_construction(). https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F1F259F%40G01JPEXMBKW04 If we make sure that such dangling pointer never happen, we don't need to use it. As Thomas mentioned before, where these interface should be put needs review but couldn't hit upon another solution rightnow. Do you have some thoughts? best regards, Ideriha, Takeshi
Hi, I've updated Thomas's quick PoC. >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >Sent: Wednesday, April 17, 2019 2:07 PM >>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >>Sent: Wednesday, December 5, 2018 2:42 PM >>Subject: RE: Copy data to DSA area > >Things under discussion: >- How we prevent memory leak >- How we prevent dangling pointer after cleaning up about-to-leak-objects > >Regarding memory leak, I think Robert's idea that allocate objects under temporal >context while building and re-parent it to permanent one at some point is promising. >While building objects they are under temporal DSA-MemoryContext, which is child of >TopTransactionContext (if it's in the transaction) and are freed all at once when error >happens. >To do delete all the chunks allocated under temporal DSA context, we need to search >or remember all chunks location under the context. Unlike AllocAset we don't have >block information to delete them altogether. > >So I'm thinking to manage dsa_allocated chunks with single linked list to keep track of >them and delete them. >The context has head of linked list and all chunks have pointer to next allocated chunk. >But this way adds space overhead to every dsa_allocated chunk and we maybe want >to avoid it because shared memory size is limited. >In this case, we can free these pointer area at some point when we make sure that >allocation is successful. > >Another debate is when we should think the allocation is successful (when we make >sure object won't leak). >If allocation is done in the transaction, we think if transaction is committed we can >think it's safe. >Or I assume this DSA memory context for cache such as relcache, catcache, plancache >and so on. >In this case cache won't leak once it's added to hash table or list because I assume >some eviction mechanism like LRU will be implemented and it will erase useless cache >some time later. I changed design from my last email. I've introduced "local" and "shared" MemoryContext for DSA. Local means MemoryContext object is allocated on local heap and shared means on shared memory. While dsa_allocating, operation should be done on local context and after developer thinks they don't leak, set local context to shared one. Dsa_pointer to chunks is stored in the array linked by local context. When error happens before chunks become "shared", all chunks are freed by checking the array. On the other hand, chunk gets "committed" and become shared, we don't need that array of pointers. This PoC has three test cases, which is updated version of Thomas's ones. - hoge() : palloc(), pfree() then set the context shared and do it again - hoge_list(): add chunk to shared single linked list and set context to shared - hoge_list_error(): add chunk to linked list but error happens before it becomes shared one, so free the chunk to prevent memory leak Well, after developing PoC, I realized that this PoC doesn't solve the local process is crashed before the context becomes shared because local process keeps track of pointer to chunks. Maybe all of you have already noticed and pointed out this case :) So it needs another work but this poc is a good step for me to advance more. Another thing is that I don't want put backpointer to MemoryContext before every chunks since it has some overhead in limited shared memory. But pfree uses it so I compromised it. And when set local context to shared one, I need to change every backpointer to shared MemoryContext so it has some cost. I think there is more efficient way. (Maybe Robert mentioned it in previous email?) >Regarding dangling pointer I think it's also problem. >After cleaning up objects to prevent memory leak we don't have mechanism to reset >dangling pointer. I haven't addressed the dangling pointer yet. Actually hoge_list() issued after hoge_list_error() is executed leads backends crash. That's because it seems to me that allocated chunk is freed after error but the tail pointer of shared linked list is not recovered. It becomes dangling pointer. So this would be a good example of dangling pointer. Regards, Takeshi Ideriha
Attachment
Hi, >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >Sent: Friday, April 26, 2019 11:50 PM >Well, after developing PoC, I realized that this PoC doesn't solve the local process is >crashed before the context becomes shared because local process keeps track of >pointer to chunks. >Maybe all of you have already noticed and pointed out this case :) So it needs another >work but this poc is a good step for me to advance more. I think the point to prevent memory leak is allocating memory and storing its address into a structure at the same time. This structure should be trackable from other process. So I'm going to change the design. I'll allocate a buffer of pointers to dsa objects which are not permanent yet. This buffer is located on shared memory. For simplicity, the buffer is allocated per process. That is the number of backends equals to MaxBackends. Developer calls API to make objects permanent. If objects become permanent, corresponding pointers are deleted from the buffer. If they fail to become permanent, they are freed from shared memory by checking the buffer and corresponding pointers also deleted from the buffer. There are two cases of failure: one is transaction abort, the other is process crash. If transaction aborts, that process itself has responsibility to free the objects. The process free the objects. In case of process crash, another backend needs to take care of it. I'm assuming on_shmem_exit callback can be used to free objects by postmaster. Regarding MemoryContext, one Shared MemoryContext is on the shared memory and each chunk has a back pointer to it to pfree and so on. I'm going to make a new patch and send it later. Regards, Takeshi Ideriha
On Wed, May 8, 2019 at 5:29 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] > >Sent: Friday, April 26, 2019 11:50 PM > >Well, after developing PoC, I realized that this PoC doesn't solve the local process is > >crashed before the context becomes shared because local process keeps track of > >pointer to chunks. > >Maybe all of you have already noticed and pointed out this case :) So it needs another > >work but this poc is a good step for me to advance more. > > I think the point to prevent memory leak is allocating memory and storing its > address into a structure at the same time. This structure should be trackable from > other process. I'm not sure that it's necessarily wrong to keep tracking information in private memory. If any backend crashes, the postmaster will terminate all backends and reinitialise everything anyway, because shared memory might be corrupted. -- Thomas Munro https://enterprisedb.com
Hi, Thomas >-----Original Message----- >From: Thomas Munro [mailto:thomas.munro@gmail.com] >Subject: Re: Copy data to DSA area > >On Wed, May 8, 2019 at 5:29 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> >wrote: >> >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >> >Sent: Friday, April 26, 2019 11:50 PM Well, after developing PoC, I >> >realized that this PoC doesn't solve the local process is crashed >> >before the context becomes shared because local process keeps track >> >of pointer to chunks. >> >Maybe all of you have already noticed and pointed out this case :) So >> >it needs another work but this poc is a good step for me to advance more. >> >> I think the point to prevent memory leak is allocating memory and >> storing its address into a structure at the same time. This structure >> should be trackable from other process. > >I'm not sure that it's necessarily wrong to keep tracking information in private memory. >If any backend crashes, the postmaster will terminate all backends and reinitialise >everything anyway, because shared memory might be corrupted. Thank you very much for the clarification. I haven't looked into reinitialize sequence so much. I checked CreateSharedMemoryAndSemaphores is called again and shared memory gets initialized. So I'm going to put keep tracking information in private memory and send a patch. Regards, Takeshi Ideriha
Hi, >>From: Thomas Munro [mailto:thomas.munro@gmail.com] >>Subject: Re: Copy data to DSA area >> >>On Wed, May 8, 2019 at 5:29 PM Ideriha, Takeshi >><ideriha.takeshi@jp.fujitsu.com> >>wrote: >>> >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >>> >Sent: Friday, April 26, 2019 11:50 PM Well, after developing PoC, I >>> >realized that this PoC doesn't solve the local process is crashed >>> >before the context becomes shared because local process keeps track >>> >of pointer to chunks. >>> >Maybe all of you have already noticed and pointed out this case :) >>> >So it needs another work but this poc is a good step for me to advance more. >>> >>> I think the point to prevent memory leak is allocating memory and >>> storing its address into a structure at the same time. This structure >>> should be trackable from other process. >> >>I'm not sure that it's necessarily wrong to keep tracking information in private >memory. >>If any backend crashes, the postmaster will terminate all backends and >>reinitialise everything anyway, because shared memory might be corrupted. > >I'm going to put keep tracking information in private >memory and send a patch. I updated a PoC patch. This has memory tracking buffer in local process. The old version also has this system but I refactored the code. Memory leak while allocating memory seems to be solved thanks to memory tracking buffer. What I haven't addressed is memory leak while freeing objects. In current sequence a cache (e.g. relcache) is freed after removed from its hash table. If cache and hash table gets shared, memory leak is likely to happen between removal from hash table and free. We lose track of cache objects if error happens after cache is unlinked from the hash table. And also a cache consists of graph structure. So we also take care of freeing cache partially. Maybe we need to remember pointers of objects before unlink from the hash. Also, we need to free them all at once after we can make sure that all the pointers are registered to local buffer. Followings are some idea to implement this: - change the order of removal from hash table and deletion - pfree in shared memory context doesn't dsa_free but just add pointer to the local buffer. - remove link from hash table after all pfree() is done - then, call a function, which does actual dsa_free taking a look at the local Buffer But I'm not sure this solution is good one. Do you have any thoughts? Regards, Takeshi Ideriha
Attachment
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >Sent: Friday, April 26, 2019 11:50 PM >To: 'Kyotaro HORIGUCHI' <horiguchi.kyotaro@lab.ntt.co.jp>; >thomas.munro@enterprisedb.com; robertmhaas@gmail.com >Cc: pgsql-hackers@postgresql.org >Subject: RE: Copy data to DSA area > >Hi, I've updated Thomas's quick PoC. Hi. I've rebased the patch to fit the core code rather than extension. Regarding shared memory context (ShmContext), I added three APIs: - CreatePermShmContext create "permanent" context located in shared memory - CreateTempShmContext Create "temporary" context located in local memory, Which has buffer to keep track of possible memory leak objects - ChangeToPermShmContext Change allocated objects parent to permanent context When you allocate something, add an object in the Temp ShmContext, and re-parent it to Perm ShmContext after it becomes not leaked. Current method of keeping track of objects and freeing them at rollback works well for the case where delete both the parent object and child object, which is pointed by parent. This is because no dangling pointer remains after rollback. If child object is freed but parent object was already allocated in the permeant context, this object has a dangling pointer. But if an object is pointed by already permanent object, this means that just allocated object won't be leaked. So in such cases we could skip temporary allocation and allocate it directory to permanent context. At rollback case, we could just leave it in the shared memory and could make upper layer function handle its "ghost" object in a good way. I cannot see the solution around here clearly. Do you have any thoughts? MemoryContextMethods are not fully supported but some methods like stats() might be needed. Current test case is very simple and same as previous one but implemented as isolation test. It checks if interger can be set in shared memory and get it by another process. Actually, current test case doesn't cover all possible case so more cases should be added. I'll add this coming CF. P.S Thomas, thank you for taking some time at PGCon to discuss this item and shared catalog cache. It was very helpful. I'm going to submit email about shared catcache soon. Regards, Takeshi Ideriha
Attachment
On Tue, Jun 25, 2019 at 12:53 AM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > I've rebased the patch to fit the core code rather than extension. > Regarding shared memory context (ShmContext), I added three > APIs: > - CreatePermShmContext > create "permanent" context located in shared memory > - CreateTempShmContext > Create "temporary" context located in local memory, > Which has buffer to keep track of possible memory leak objects > - ChangeToPermShmContext > Change allocated objects parent to permanent context > > When you allocate something, add an object in the Temp ShmContext, > and re-parent it to Perm ShmContext after it becomes not leaked. Hi Ideriha-san, Thanks for working on this! It's important for us to figure out how to share caches, so we can make better use of memory. I think there are two things going on in this prototyping work: 1. The temporary/permanent concept. I think the same problem will come up in the future when we eventually have a multi-thread model. 2. Getting a MemoryContext-compatible allocator to work with the current multi-process model. So, I'm wondering if we can find a more generic way to do your ChangeToPermShmContext thing with an API that isn't tied to this implementation. Another thing to consider is that we probably still need an intermediate level of contexts, to hold individual complex long-lived objects like (say) cached plans. What do you think about the following? Even though I know you want to start with much simpler kinds of cache, I'm looking ahead to the lofty end-goal of having a shared plan cache. No doubt, that involves solving many other problems that don't belong in this thread, but please indulge me: 1. We could provide a way to ask any MemoryContext implementation to create a new context of the same type and set its parent to any parent, so that it will be automatically deleted when that parent is deleted. For example MemoryContextClone(SharedPlanCache, some_short_lived_context). That gives you a memory context that will be deleted with some_short_lived_context (unless you reparent it or delete it first), but it's of the same type as SharedPlanCache and (in this particular case) allocates from the same underlying DSA area. In some hypothetical future, SharedPlanCache might use a different memory context implementation depending on whether you selected a multi-process or multi-thread build, but when you call MemoryContextClone(SharedPlanCache, ...) you don't know or care about that, it just has to be some implementation that supports cloning. 2. One you have finished constructing (say) a plan in that memory context, you can make it 'permanent' by simply calling MemoryContextSetParent(temp, SharedPlanCache). That follows existing practice for how we put stuff into the existing backend-private cache context. Otherwise, it is deleted when its current parent is deleted. 3. Following existing practice, the cached plan needs to know the context that it lives in, so that DropCachedPlan() can delete the whole plan easily according to your LRU scheme (or whatever -- a topic for another thread). One of the implications of the above ideas is that the MemoryContext object itself has to be in shared memory, and the management of parent/child links needs to be protected by locks for parts of the context graph that are shared. And probably there are other complications in this area. I don't claim that any of this is easy, or necessarily the right way to do it. But I'm pretty sure these problems are not artifacts to the multi-process/shared memory model, they're due to sharing and they'll still be waiting for us in the multi-threaded future! + /* To avoid double free by ShmContextDelete, remove its reference */ + for (buf = shmContext->temp_buffer; buf != NULL; buf = buf->next) + { + for (idx = 0; idx < buf->idx; idx++) + { + if (buf->chunks[idx] == dp) + { + buf->chunks[idx] = 0; + break; + } + } + } Hmm. I wonder if we should just make ShmContextFree() do nothing! And make ShmContextAlloc() allocate (say) 8KB chunks (or larger if needed for larger allocation) and then hand out small pieces from the 'current' chunk as needed. Then the only way to free memory is to destroy contexts, but for the use case being discussed, that might actually be OK. I suppose you'd want to call this implementation something different, like ShmRegionContext, ShmZoneContext or ShmArenaContext[1]. That's a fairly specialised type of memory allocator, but it seems like it would suit our usage pattern here: you build objects to cache them, and then destroy them all at once. This would only be efficient if the code that you call to build (or, I guess, copy) plans doesn't do much palloc/free of temporary stuff, which would leave dead junk in our context if we did it that way. Earlier I said that it would probably be OK to use backend-local memory to track all the objects that need to be freed on error, because I was considering only the problem of preventing leaks on error. I wasn't thinking about point 3 above. We eventually need to delete the cached object (plan etc) from the cache (eg DropCachedPlan()), and so we still need to have all its allocations together in one place as a context, and that control data needs to be accessible to other processes. So in the ShmZoneContext idea, there would need to be a chunk list allocated in shared memory for eventual freeing. [1] https://en.wikipedia.org/wiki/Region-based_memory_management -- Thomas Munro https://enterprisedb.com
On Wed, Jul 10, 2019 at 6:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Hmm. I wonder if we should just make ShmContextFree() do nothing! And > make ShmContextAlloc() allocate (say) 8KB chunks (or larger if needed > for larger allocation) and then hand out small pieces from the > 'current' chunk as needed. Then the only way to free memory is to > destroy contexts, but for the use case being discussed, that might > actually be OK. I suppose you'd want to call this implementation > something different, like ShmRegionContext, ShmZoneContext or > ShmArenaContext[1]. <after sleeping on this> I guess what I said above is only really appropriate for complex things like plans that have their own contexts so that we can delete them easily "in bulk". I guess it's not true for caches of simpler objects like catcache, that don't want a context for each cached thing and want to free objects "retail" (one by one). So I guess you might want something more like your current patch for (say) SharedCatCache, and something like the above-quoted idea for (say) SharedPlanCache or SharedRelCache. For an implementation that supports retail free, perhaps you could store the address of the clean-up list element in some extra bytes before the returned pointer, so you don't have to find it by linear search. Next, I suppose you don't want to leave holes in the middle of the array, so perhaps instead of writing NULL there, you could transfer the last item in the array to this location (with associated concurrency problems). Since I don't think anyone ever said it explicitly, the above discussion is all about how we get to this situation, while making sure that we're mostly solving problems that occur in both multi-process and multi-threaded designs: shared_metadata_cache = '100MB' shared_plan_cache = '100MB' -- Thomas Munro https://enterprisedb.com
Hi, thank you for the previous two emails. Thomas Munro <thomas.munro@gmail.com> wrote: >What do you think about the following? Even though I know you want to start with >much simpler kinds of cache, I'm looking ahead to the lofty end-goal of having a shared >plan cache. No doubt, that involves solving many other problems that don't belong in >this thread, but please indulge me: My initial motivation came from shared catcache and relcache but I also think shared plan cache is one of the big topics and I'd be very excited if it's come true. Sometimes making plan at each backend becomes enormous overhead for speed. >On Wed, Jul 10, 2019 at 6:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> Hmm. I wonder if we should just make ShmContextFree() do nothing! And >> make ShmContextAlloc() allocate (say) 8KB chunks (or larger if needed >> for larger allocation) and then hand out small pieces from the >> 'current' chunk as needed. Then the only way to free memory is to >> destroy contexts, but for the use case being discussed, that might >> actually be OK. I suppose you'd want to call this implementation >> something different, like ShmRegionContext, ShmZoneContext or >> ShmArenaContext[1]. > ><after sleeping on this> > >I guess what I said above is only really appropriate for complex things like plans that >have their own contexts so that we can delete them easily "in bulk". I guess it's not >true for caches of simpler objects like catcache, that don't want a context for each >cached thing and want to free objects "retail" (one by one). So I guess you might >want something more like your current patch for (say) SharedCatCache, and >something like the above-quoted idea for (say) SharedPlanCache or SharedRelCache. Here's what I get: - Temporary/Permanent concept can be implemented by following current cached plan schema of reparanting and also adding general API MemoryContextClone. - There are at least two strategies (implementation ways) of allocation. One is ShmZoneContext; region-based allocation, which drops chunks in bulk. This type is useful for cache made of complex objects graph like plan cache. This case clean up list (list of pointers to dsa_allocated objects) needs to be located in shared memory. That's because this list is used not only in error case but in case of DropCachedPlan. - The other is not region based, but allocate or free objects "retail". I would call it ShmSimpleContext case. This would be suitable for, say, catalog cache. The type of ShmSimpleContext is OK to have clean up list in local memory, right? Unlike ShmZoneContext, this list is only used on error and cache entry can be freed by just traversing simple object graph as discussed before. Regarding the tracing list for cleanup, I'd change the design based on either ShmSimpleContext or ShmZoneContext assuming the clean-up list would be pointed by the context object in both type. In type of ShmSimpleContext, the list is deleted after the context is reparent to long lived one. The context for building, say, catcache entry is located in shared memory but its tracking list is in local memory. So it seems wired to that the shared object points to local object. But there seems no problem because other process cannot touch the context until its reparent. In case of ShmZoneContext, it would be ok to still keep the list after its reparent. That's because both this intermediate context and the list is supposed be located in shared memory and used. Regarding Intermediate level context, I was wondering if we need it in ShmSimpleContext case. Every time we build the catcache entry, we make MemoryContext object in shared memory and this becomes some memory Overhead. But right now context object per catcache entry seems simple architecture and moreover easier to operate on error case. >For an implementation that supports retail free, perhaps you could store the address of >the clean-up list element in some extra bytes before the returned pointer, so you don't >have to find it by linear search. Next, I suppose you don't want to leave holes in the >middle of the array, so perhaps instead of writing NULL there, you could transfer the >last item in the array to this location (with associated concurrency problems). Sure, I'll fix it. >Since I don't think anyone ever said it explicitly, the above discussion is all about how >we get to this situation, while making sure that we're mostly solving problems that >occur in both multi-process and multi-threaded designs: > >shared_metadata_cache = '100MB' >shared_plan_cache = '100MB' Yes, we are talking about this. Thank you for the clarification. Regards, Takeshi Ideriha
Should now this be closed as Returned with Feedback, or are we really waiting on an updated patch in the short term? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, Sorry for waiting. >Thomas Munro <thomas.munro@gmail.com> wrote: >>What do you think about the following? Even though I know you want to >>start with much simpler kinds of cache, I'm looking ahead to the lofty >>end-goal of having a shared plan cache. No doubt, that involves >>solving many other problems that don't belong in this thread, but please indulge me: > >My initial motivation came from shared catcache and relcache but I also think shared >plan cache is one of the big topics and I'd be very excited if it's come true. Sometimes >making plan at each backend becomes enormous overhead for speed. > >>On Wed, Jul 10, 2019 at 6:03 PM Thomas Munro <thomas.munro@gmail.com> >wrote: >>> Hmm. I wonder if we should just make ShmContextFree() do nothing! >>> And make ShmContextAlloc() allocate (say) 8KB chunks (or larger if >>> needed for larger allocation) and then hand out small pieces from the >>> 'current' chunk as needed. Then the only way to free memory is to >>> destroy contexts, but for the use case being discussed, that might >>> actually be OK. I suppose you'd want to call this implementation >>> something different, like ShmRegionContext, ShmZoneContext or >>> ShmArenaContext[1]. >> >><after sleeping on this> >> >>I guess what I said above is only really appropriate for complex things >>like plans that have their own contexts so that we can delete them >>easily "in bulk". I guess it's not true for caches of simpler objects >>like catcache, that don't want a context for each cached thing and want >>to free objects "retail" (one by one). So I guess you might want >>something more like your current patch for (say) SharedCatCache, and something >>like the above-quoted idea for (say) SharedPlanCache or SharedRelCache. I updated shared memory context for SharedCatCache, which I call ShmRetailContext. I refactored my previous PoC, which palloc calls dsa_allocate every time in a retail way. And I also implemented MemoryContextClone(template_context, short_lived_parent_context). >>For an implementation that supports retail free, perhaps you could >>store the address of the clean-up list element in some extra bytes >>before the returned pointer, so you don't have to find it by linear >>search. ShmRetailContext is supposed to use SharedCatCache. Here are some features of current CatCache entries: 1. The number of cache entries is generally larger than compared to relcache and plan cache. This is because relcache is proportional to the number of tables and indexes. Catcache has much more kinds and some kind like pg_statistic is proportional to the number of attributes of each table. 2. Cache entry (catctup) is built via only one or two times palloc(). 3. When cache entry is evicted from hash table, it is deleted by pfree() one by one. Because of my point 1, I'd rather not to have the extra pointer to clean-up list element. This pointer is allocated per catcache entry and it would take space. And also the length of clean-up list is not much big becase of my point 2. So it would be fine with linear search. And also because of my point 1 again, I didn't create MemoryContext header for each catalog cache. These memory context header is located in shared memory and takes space. So I use ShmRetailContextMoveChunk(), (which I called ChangeToPermShmContext() before), instead of MemoryContextSetParent(). This moves only chunks from locally allocated parent to shared parent memory context. Due to my point 3, I think it's also ok not to have clean-up list in shared memory in ShmRetailContext. There is no situation to free chunks all at once. What do you think about above things? >>Next, I suppose you don't want to leave holes in the middle of >>the array, so perhaps instead of writing NULL there, you could transfer the last item >in the array to this location (with associated concurrency problems). Done. ShmZoneContext for SharedPlan and SharedRelCache is not implemented but I'm going to do it following your points. Regards, Takeshi Ideriha
Attachment
Hi, >ShmZoneContext for SharedPlan and SharedRelCache is not implemented but I'm >going to do it following your points. After looking into existing code, I'm thinking Generation Memory Context seems to have the similar purpose. So I'll implement ShmZoneContext by reference it. Generation context manages chunks having almost same life span. ShmZoneContext would dsa_allocate() block and hand out chunks and free chunks and its block all together. Regards, Takeshi Ideriha
>>ShmZoneContext for SharedPlan and SharedRelCache is not implemented but >>I'm going to do it following your points. > >After looking into existing code, I'm thinking Generation Memory Context seems to >have the similar purpose. So I'll implement ShmZoneContext by reference it. >Generation context manages chunks having almost same life span. >ShmZoneContext would dsa_allocate() block and hand out chunks and free chunks and >its block all together. I added ShmZoneContext to my patch. I haven't added detailed comments and test set, so let me explain how to use it here. I followed Thomas' suggestion. At start up, ShmZoneContext is created in shared memory by ShmZoneContextCreateOrigin(). Before allocating memory, another context is created and set to short-lived parent context via MemoryContextClone() so that objects and contexts are automatically freed. Then you can use, palloc() which returns chunk from dsa_allocated block. When you use MemoryContextSetParent() to long-lived ShmContext, you need to acquire lock to protect parent-child path. The LWLock object is get by ShmZoneContextGetLock(). Every cloned ShmZoneContext uses the same lock instance. If you want to free allocated object, use MemoryContextDelete(). After the context becomes long-lived, you need to get lock again to do MemoryContextDelete() in order to protect MemoryContext parent-child path. Thinking about use case of Shared RelCache/PlanCache, allocation happens only before the parent context is switch to long-lived shared one, so I think we don't need to take lock while palloc(). I also think MemoryContextDelete() should be done after allocated objects are deleted from some shared index structure (ex. hash table or list in shared memory) so that another process can take a look at it What do you think? Regards, Takeshi Ideriha
Attachment
On Fri, Oct 18, 2019 at 12:53:16PM +0000, ideriha.takeshi@fujitsu.com wrote: > I added ShmZoneContext to my patch. > I haven't added detailed comments and test set, so let me explain how to > use it here. I followed Thomas' suggestion. The latest patch sent fails to apply. Could you please send a rebase? I am moving this patch to next CF, waiting on author. -- Michael
Attachment
I've marked this patch as returned with feedback. It's been sitting in the CF for a long time without any update (and does not apply anymore). That does not mean we don't want the feature (it seems useful) so feel free to resubmit an updated patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services