Thread: Dynamic shared memory areas
Hi hackers, I would like to propose a new subsystem called Dynamic Shared [Memory] Areas, or "DSA". It provides an object called a "dsa_area" which can be used by multiple backends to share data. Under the covers, a dsa_area is made up of some number of DSM segments, but it appears to client code as a single shared memory heap with a simple allocate/free interface. Because the memory is mapped at different addresses in different backends, it introduces a kind of sharable relative pointer and an operation to convert it to a backend-local pointer. After you have created or attached to a dsa_area, you can use it much like MemoryContextAlloc/pfree, except for the extra hoop to jump through to get the local address: dsa_pointer p; char *mem; p = dsa_allocate(area, 42); mem = (char *) dsa_get_address(area, p); if (mem != NULL) { snprintf(mem, 42, "Hello world"); dsa_free(area, p); } Exposing the dsa_pointer in this way allows client code to build data structures with internal dsa_pointers that will be usable in all backends that attach to the dsa_area. DSA areas have many potential uses, including shared workspaces for various kinds of parallel query execution, longer term storage for in-memory database objects, caches and so forth. In some cases it may be useful to use a dsa_area directly, but there could be a library of useful data structures that know how to use DSA memory. More on all of those topics, with patches, soon. SOME CONTEXT Currently, Postgres provides three classes of memory: 1. Backend-local memory, managed with palloc/pfree, and MemoryContext providing a hierarchy of memory heaps tied to various scopes. Underneath that, there is of course the C runtime's heap and allocator. 2. Traditional non-extensible shared memory mapped into every backend at the same address. This works on Unix because child processes inherit the memory map of the postmaster. In EXEC_BACKEND builds (including Windows) it works because you can ask for memory to be mapped at a specific address and it'll succeed if ASLR is turned off and the backend hasn't been running very long and the address range happens to be still free. This memory is currently managed with an allocate-only allocator. There is a small library of data structures that know how to use (but never free) this memory. 3. DSM memory, our abstraction for shared memory segments created on demand in non-postmaster backends. This memory is mapped at different addresses in different backends. Currently its main use is to provide a chunk of memory for parallel query. To manage the space inside a DSM segment, shm_toc ('table-of-contents') can be used as a kind of allocate-only space manager which allows backends to find the backend-local address of objects within the segment using integer keys. This proposal adds a fourth class, building on the third. Compared with the existing memory classes: * It provides a fully general allocate/free facility, as currently available only in (1), though does not have (1)'s directly dereferenceable pointers. * It grows automatically and can in theory grow as big as virtual memory allows, like (1), though it also provides a way to cap total size so that allocations fail beyond some size. * It provides something like the throw-it-all-away-at-once clean-up facility of (1), since DSA areas can be destroyed, are reference counted, and can optionally be tracked by the resource manager mechanism (riding on DSM's coat tails). * It provides the data sharing between backends of (2) and (3), though doesn't have (2)'s directly dereferenceable pointers. * Through proposals that will follow this one, it will provide for basic data structures that build on top of it such as hash tables, like (2), except that these ones will be able to grow as required and give memory back. * Unlike (1) and (2), client code has to deal with incompatible memory maps. This involves calling dsa_get_address(area, relative_pointer) which amounts to a few instructions to perform a base address lookup and pointer arithmetic. Using processes instead of threads gives Postgres certain advantages, but requires us to deal with shared memory instead of just using something like (1) for all our memory needs, as a hypothetical multi-threaded Postgres fork would presumably do. This proposal is a step towards making our shared memory facilities more powerful and general. IMPLEMENTATON AND HISTORY Back in 2014, Robert Haas proposed sb_alloc[1]. It had two layers: * a 'free page manager' which cuts a piece of memory into 4KB pages and embeds a btree into the empty pages to track contiguous runs of pages, so that you can get and put free page ranges * an allocator which manages a set of backend-private memory regions, each of which has a free page manager; large allocations are handled directly with pages from the free page manager in an existing region, or new regions created as required with malloc; allocations <= 8KB are handled with pools (called "heaps" in that patch) of various object sizes ("size classes") that live in 64KB superblocks, which in turn come from the free page manager DSA uses Robert's free page manager unchanged, except for some debugging by me. It uses the same general approach and much of the code for the higher level allocator, but I have reworked it substantially to replace the MemoryContext interface, put it in DSM segments, introduce the multi-segment relative pointer scheme, and add concurrency support. Compared to some well known malloc implementations which this code takes general inspiration from, the main differences are obviously the shared memory nature, the lack of per-core pools (an avenue for future research that would increase concurrent performance at the cost of increased fragmentation), and it has that lower level page manager. Some other systems go directly to the OS (mmap, sbrk) for superblocks and large objects. The equivalent for us would be to throw away the lower layer and simply create a DSM segment for large allocations and 64KB superblocks, but there are implementation and portability reasons not to want to create very large numbers of DSM segments. Compared to palloc/pfree, DSA aims to waste less space. It has more finely gained size classes (8, 16, 24, 32, 40, 48, ... see dsa_size_classes), uses a page map that uses 8 bytes per 4KB page to keep track of how to free memory instead of putting bookkeeping information in front of every object. Some other notes in no particular order: It's admittedly slightly confusing that the patch currently contains two separate relative pointer concepts: relptr is used by Robert's freespace.c code and provides for sort-of-type-checked offsets relative to a single base, and dsa_pointer is used by dsa.c to provide multi-segment relative pointers that encode a segment index in the higher bits. The lock tranche arguments to dsa_create_dynamic are clunky, but I don't have a better idea currently since you can't allocate and free tranche IDs so I don't see how dsa.c can own that problem. The "dynamic" part of dsa_create_dynamic's name reflects a desire to have an alternative "fixed" version where you can provide it with an already existing piece of memory to manage, such as a pre-existing DSM segment, but that has not been implemented. It's desirable to allow atomic ops on dsa_pointer; I believe Andres Freund plans to make that happen for 64 bit values on 32 bit systems, but if that turns out to be problematic I would want to make dsa_pointer 32 bits on 32 bit systems. PATCH First, please apply dsm-unpin-segment-v2.patch[2], and then dsm-handle-invalid.patch (attached, and also proposed), and finally dsa-v1.patch. I have also attached test-dsa.patch, a small module which exercises the allocator and shows some client code. Thanks to my colleagues Robert Haas for the sb_alloc code that morphed into this patch, and John Gorman and Amit Khandekar for feedback and testing. I'd be most grateful for any feedback. Thanks for reading! [1] https://www.postgresql.org/message-id/flat/CA%2BTgmobkeWptGwiNa%2BSGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAEepm%3D29DZeWf44-4fzciAQ14iY5vCVZ6RUJ-KR2yzs3hPzrkw%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Aug 19, 2016 at 7:07 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I would like to propose a new subsystem called Dynamic Shared [Memory] > Areas, or "DSA". It provides an object called a "dsa_area" which can > be used by multiple backends to share data. Under the covers, a > dsa_area is made up of some number of DSM segments, but it appears to > client code as a single shared memory heap with a simple allocate/free > interface. Because the memory is mapped at different addresses in > different backends, it introduces a kind of sharable relative pointer > and an operation to convert it to a backend-local pointer. > > [...] > > [...] It's desirable to allow atomic ops on > dsa_pointer; I believe Andres Freund plans to make that happen for 64 > bit values on 32 bit systems, but if that turns out to be problematic > I would want to make dsa_pointer 32 bits on 32 bit systems. Here's a new version that does that. It provides the type dsa_pointer_atomic and associated operations, using PG_HAVE_ATOMIC_U64_SUPPORT to decide which size to use. The choice of size is overridable at compile time with USE_SMALL_DSA_POINTER. The other change is that it now creates DSM segments of sizes that don't get large so fast. V1 would create 1MB, 2MB, 4MB, ... segments (geometric growth being necessary because we can't have large numbers of segments, but we want to support large total sizes). V2 creates segments of size 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4, ... according to the compile time constant DSA_NUM_SEGMENTS_AT_EACH_SIZE. I'm not sure how to select a good number for this yet and the best answer may depend on whether you're using small pointers. This version is rebased against master as of today and doesn't depend on any other patches. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Oct 5, 2016 at 3:00 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's a new version that does that. While testing this patch I found some issue, + total_size = DSA_INITIAL_SEGMENT_SIZE; + total_pages = total_size / FPM_PAGE_SIZE; + metadata_bytes = + MAXALIGN(sizeof(dsa_area_control)) + + MAXALIGN(sizeof(FreePageManager)) + + total_pages * sizeof(dsa_pointer); + /* Add padding up to next page boundary. */ + if (metadata_bytes % FPM_PAGE_SIZE != 0) + metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE); + usable_pages = + (total_size - metadata_bytes) / FPM_PAGE_SIZE; + segment = dsm_create(total_size, 0); + dsm_pin_segment(segment); Actually problem is that size of dsa_area_control is bigger than DSA_INITIAL_SEGMENT_SIZE. but we are allocating segment of DSA_INITIAL_SEGMENT_SIZE size. (gdb) p sizeof(dsa_area_control) $8 = 67111000 (gdb) p DSA_INITIAL_SEGMENT_SIZE $9 = 1048576 In dsa-v1 problem was not exist because DSA_MAX_SEGMENTS was 1024, but in dsa-v2 I think it's calculated wrongly. (gdb) p DSA_MAX_SEGMENTS $10 = 16777216 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 5, 2016 at 10:04 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Wed, Oct 5, 2016 at 3:00 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Here's a new version that does that. > > While testing this patch I found some issue, > > + total_size = DSA_INITIAL_SEGMENT_SIZE; > + total_pages = total_size / FPM_PAGE_SIZE; > + metadata_bytes = > + MAXALIGN(sizeof(dsa_area_control)) + > + MAXALIGN(sizeof(FreePageManager)) + > + total_pages * sizeof(dsa_pointer); > + /* Add padding up to next page boundary. */ > + if (metadata_bytes % FPM_PAGE_SIZE != 0) > + metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE); > + usable_pages = > + (total_size - metadata_bytes) / FPM_PAGE_SIZE; > > + segment = dsm_create(total_size, 0); > + dsm_pin_segment(segment); > > Actually problem is that size of dsa_area_control is bigger than > DSA_INITIAL_SEGMENT_SIZE. > but we are allocating segment of DSA_INITIAL_SEGMENT_SIZE size. > > (gdb) p sizeof(dsa_area_control) > $8 = 67111000 > (gdb) p DSA_INITIAL_SEGMENT_SIZE > $9 = 1048576 > > In dsa-v1 problem was not exist because DSA_MAX_SEGMENTS was 1024, > but in dsa-v2 I think it's calculated wrongly. > > (gdb) p DSA_MAX_SEGMENTS > $10 = 16777216 Oops, right, thanks. A last minute change to that macro definition that I stupidly tested only in USE_SMALL_DSA_POINTER mode. Here is a fix for that, capping DSA_MAX_SEGMENTS as before. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Oct 5, 2016 at 11:28 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > [dsa-v3.patch] Here is a new version which just adds CLOBBER_FREED_MEMORY support to dsa_free. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Tue, Nov 1, 2016 at 5:06 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Oct 5, 2016 at 11:28 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> [dsa-v3.patch] > > Here is a new version which just adds CLOBBER_FREED_MEMORY support to dsa_free. Here is a new version that fixes a bug I discovered in freepage.c today. Details: When dsa_free decides to give back a whole superblock back to the free page manager for a segment with FreePageManagerPut, and there was already exactly one span of exactly one free page in that segment, and the span being 'put' is not adjacent to that existing free page, then the singleton format must be converted to a btree with the existing page as root and the newly put span as the sole leaf. But in that special case we forgot to add the newly put span to the appropriate free list. Not only did we lose track of it, but a future call to FreePageManagerPut might try to merge it with another adjacent span, which will try to manipulate the freelist that it expects it to be in and blow up. The fix is just to add a call to FreePagePushSpanLeader in this corner case before the early return. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Thu, Nov 10, 2016 at 6:37 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here is a new version that fixes a bug I discovered in freepage.c today. And here is a new version rebased on top of commit b40b4dd9e10ea701c8d47ccba9407fc32ed384e5. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Thu, Nov 10, 2016 at 12:37 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Tue, Nov 1, 2016 at 5:06 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Oct 5, 2016 at 11:28 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> [dsa-v3.patch] >> >> Here is a new version which just adds CLOBBER_FREED_MEMORY support to dsa_free. > > Here is a new version that fixes a bug I discovered in freepage.c today. > > Details: When dsa_free decides to give back a whole superblock back > to the free page manager for a segment with FreePageManagerPut, and > there was already exactly one span of exactly one free page in that > segment, and the span being 'put' is not adjacent to that existing > free page, then the singleton format must be converted to a btree with > the existing page as root and the newly put span as the sole leaf. > But in that special case we forgot to add the newly put span to the > appropriate free list. Not only did we lose track of it, but a future > call to FreePageManagerPut might try to merge it with another adjacent > span, which will try to manipulate the freelist that it expects it to > be in and blow up. The fix is just to add a call to > FreePagePushSpanLeader in this corner case before the early return. Since a lot of the design of this patch is mine - from my earlier work on sb_alloc - I don't expect to have a ton of objections to it. And I'd like to get it committed, because other parallelism work depends on it (bitmap heap scan and parallel hash join in particular), and because it's got other uses as well. However, I don't want to be perceived as slamming my code (or that of my colleagues) into the tree without due opportunity for other people to provide feedback, so if anyone has questions, comments, concerns, or review to offer, please do. I think we should develop versions of this that (1) allocate from the main shared memory segment and (2) allocate from backend-private memory. Per my previous benchmarking results, allocating from backend-private memory would be a substantial win for tuplesort.c because this allocator is substantially more memory-efficient for large memory contexts than aset.c, and Tomas Vondra tested it out and found that it is also faster for logical decoding than the approach he proposed. Perhaps that's not an argument for holding up his proposed patches for that problem, but I think it IS a good argument for pressing forward with a backend-private version of this allocator. I'm not saying that should be part of the initial commit of this code, but I think it's a good direction to pursue. One question that we need to resolve is where the code should live in the source tree. When I wrote the original patches upon which this work was based, I think that I put all the code in src/backend/utils/mmgr, since it's all memory-management code. In this patch, Thomas left the free page manager code there, but put the allocator itself in src/backend/storage/ipc. There's a certain logic to that because dynamic shared areas (dsa.c) sit right next to dynamic shared memory (dsm.c) but it feels a bit schizophrenic to have half of the code in storage/ipc and the other half in utils/mmgr. I guess my view is that utils/mmgr is a better fit, because I think that this is basically memory management code that happens to use shared memory, rather than basically IPC that happens to be an allocator. If we decide that this stuff goes in storage/ipc then that's basically saying that everything that uses dynamic shared memory is going to end up in that directory, which seems like a mess. The fact that the free-page manager, at least, could be used for allocations not based on DSM strengthens that argument in my view. Other opinions? The #ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P hack in relptr.h, for which I believe I'm responsible, is ugly. There is probably a compiler out there that has __typeof__ but not __builtin_types_compatible_p, and we could cater to that by adding a separate configure test for __typeof__. A little browsing of the documentation on at https://gcc.gnu.org/onlinedocs/ seems to suggest that __builtin_types_compatible_p didn't exist before GCC 3.1, but __typeof__ seems to be there even in 2.95.3. That's not very interesting, honestly, because 3.1 came out in 2002, but there might be non-GCC compilers that implement __typeof__ but not __builtin_types_compatible_p. I am inclined not to worry about this unless somebody feels otherwise, but it's not beautiful. I wonder if it wouldn't be a good idea to allow the dsa_area_control to be stored wherever instead of insisting that it's got to be located inside the first DSM segment backing the pool itself. For example, you could make dsa_create_dynamic() take a third argument which is a pointer to enough space for a dsa_area_control, and it would initialize it in place. Then you could make dsa_attach_dynamic() take a pointer to that same structure instead of taking a dsa_handle. Actually, I think dsa_handle goes away: it becomes the caller's responsibility to figure out the correct pointer address to pass in the second process. The advantage of this design change is that you could stuff the dsa_area_control into the existing parallel DSM and only create additional DSMs if anything is actually allocated. What would be even cooler is to allow a little bit of space inside the parallel DSM that gets used first, and then, when that overflows, we start creating new DSMs. Say, 64kB. Am I sounding greedy yet? It just seems like a good idea not to needlessly multiply the number of DSMs. + /* Unlink span. */ + /* TODO: Does it even need to be linked in in the first place? */ + LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE), + LW_EXCLUSIVE); + unlink_span(area, span); + LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE)); In answer to the TODO, I think this isn't strictly necessary, but it seems like a good idea to do it anyway for debuggability. If we didn't do this, the space occupied by a large object wouldn't be "known" in any way other than by having disappeared from the free page map, whereas this way it's linked into the DSA's listed of allocated chunks like anything else, so for example dsa_dump() can print it. I recommend removing this TODO. + /* + * TODO: We could take Max(fpm->contiguous_pages, result of + * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a + * starting point for its search, potentially avoiding a bunch of work, + * since there is no way the largest contiguous run is bigger than that. + */ Typo: Updat. + /* + * TODO: Figure out how to avoid setting this every time. It may not be as + * simple as it looks. + */ Something isn't right with this function, because it takes the trouble to calculate a value for contiguous_pages that it then doesn't use for anything. I think the original idea here was that if we calculated a value for contiguous_pages that was less than fpm->contiguous_pages, there was no need to dirty it. If we can't get away with that for some reason, then there's no point in calculating the value in the first place. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 16, 2016 at 2:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > ... my > view is that utils/mmgr is a better fit, ... OK, changed. > The #ifdef HAVE__BUILTIN_TYPES_COMPATIBLE_P hack in relptr.h, for > which I believe I'm responsible, is ugly. There is probably a > compiler out there that has __typeof__ but not > __builtin_types_compatible_p, and we could cater to that by adding a > separate configure test for __typeof__. A little browsing of the > documentation on at https://gcc.gnu.org/onlinedocs/ seems to suggest > that __builtin_types_compatible_p didn't exist before GCC 3.1, but > __typeof__ seems to be there even in 2.95.3. That's not very > interesting, honestly, because 3.1 came out in 2002, but there might > be non-GCC compilers that implement __typeof__ but not > __builtin_types_compatible_p. I am inclined not to worry about this > unless somebody feels otherwise, but it's not beautiful. +1 > I wonder if it wouldn't be a good idea to allow the dsa_area_control > to be stored wherever instead of insisting that it's got to be located > inside the first DSM segment backing the pool itself. For example, > you could make dsa_create_dynamic() take a third argument which is a > pointer to enough space for a dsa_area_control, and it would > initialize it in place. Then you could make dsa_attach_dynamic() take > a pointer to that same structure instead of taking a dsa_handle. > Actually, I think dsa_handle goes away: it becomes the caller's > responsibility to figure out the correct pointer address to pass in > the second process. The advantage of this design change is that you > could stuff the dsa_area_control into the existing parallel DSM and > only create additional DSMs if anything is actually allocated. What > would be even cooler is to allow a little bit of space inside the > parallel DSM that gets used first, and then, when that overflows, we > start creating new DSMs. Say, 64kB. Am I sounding greedy yet? It > just seems like a good idea not to needlessly multiply the number of > DSMs. Alternatively we could stop using DSM directly for parallel query and just use a DSA area for all the shmem needs of a parallel query execution as I mentioned elsewhere[1]. That would involve changing a bunch of stuff including the FDW interface, so that's probably a bad idea at this point. So I tried this in-place idea out today. See the attached version which provides: dsa_area *dsa_create(...); dsa_area *dsa_attach(dsa_handle handle); Those replace the functions that previously had _dynamic in the name. Then I have new variants: dsa_area *dsa_create_in_place(void *place, size_t size, ...); dsa_area *dsa_attach_in_place(void *place); Those let you create an area in existing memory (in a DSM segment, traditional inherited shmem). The in-place versions will stlll create DSM segments on demand as required, though I suppose if you wanted to prevent that you could with dsa_set_size_limit(area, size). One complication is that of course the automatic detach feature doesn't work if you're in some random piece of memory. I have exposed dsa_on_dsm_detach, so that there is a way to hook it up to the detach hook for a pre-existing DSM segment, but that's the caller's responibility. This is important because although the first 'segment' is created in place, if other segments have been created we still have to manage those; it gets tricky if you are the last attached process for the area, but do not have a particular segment mapped in currently because you've never accessed it; that works with a regular dsa_create area, because everyone has the control segment mapped in so we use that one's dsm_on_detach hook and from there we can do the cleanup we need to do, but in this new case there is no such thing. You can see an example of manual detach hook installation in dsa-area-for-executor-v2.patch which I'll now go and post over in that other thread. > + /* Unlink span. */ > + /* TODO: Does it even need to be linked in in the > first place? */ > + LWLockAcquire(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE), > + LW_EXCLUSIVE); > + unlink_span(area, span); > + LWLockRelease(DSA_SCLASS_LOCK(area, DSA_SCLASS_SPAN_LARGE)); > > In answer to the TODO, I think this isn't strictly necessary, but it > seems like a good idea to do it anyway for debuggability. If we > didn't do this, the space occupied by a large object wouldn't be > "known" in any way other than by having disappeared from the free page > map, whereas this way it's linked into the DSA's listed of allocated > chunks like anything else, so for example dsa_dump() can print it. I > recommend removing this TODO. Removed. > + /* > + * TODO: We could take Max(fpm->contiguous_pages, result of > + * FreePageBtreeCleanup) and give it to FreePageManagerUpdatLargest as a > + * starting point for its search, potentially avoiding a bunch of work, > + * since there is no way the largest contiguous run is bigger than that. > + */ > > Typo: Updat. Fixed. > + /* > + * TODO: Figure out how to avoid setting this every time. It > may not be as > + * simple as it looks. > + */ > > Something isn't right with this function, because it takes the trouble > to calculate a value for contiguous_pages that it then doesn't use for > anything. I think the original idea here was that if we calculated a > value for contiguous_pages that was less than fpm->contiguous_pages, > there was no need to dirty it. If we can't get away with that for > some reason, then there's no point in calculating the value in the > first place. Yeah. Will come back on this point. The attached patch is just for discussion only... I need to resolve that contiguous_pages question and do some more testing. [1] https://www.postgresql.org/message-id/CAEepm=0HmRefi1+xDJ99Gj5APHr8Qr05KZtAxrMj8b+ay3o6sA@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Attachment
On Thu, Nov 24, 2016 at 1:07 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > The attached patch is just for discussion only... I need to resolve > that contiguous_pages question and do some more testing. As Dilip discovered, there was a problem with resource cleanup for DSA areas created inside pre-existing DSM segments, which I've now sorted out in the attached version. I also updated the copyright messages, introduced a couple of the new 'unlikely' macros in the address decoding path, and introduced high_segment_index to avoid scanning bigger segment arrays than is necessary sometimes. As for contiguous_pages_dirty, I see what was missing from earlier attempts at more subtle invalidation: we had failed to set the flag in cases where FreePageManagerGetInternal was called during a FreePageManagerPut operation. What do you think about the logic in this patch... do you see any ways for contiguous_pages to get out of date? There is a new assertion that contiguous_pages matches the state of the freelists at the end of FreePageManagerGet and FreePageManagerPut, enabled if you defined FPM_EXTRA_ASSERTS, and this passes my random allocation pattern testing. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Nov 23, 2016 at 7:07 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Those let you create an area in existing memory (in a DSM segment, > traditional inherited shmem). The in-place versions will stlll create > DSM segments on demand as required, though I suppose if you wanted to > prevent that you could with dsa_set_size_limit(area, size). One > complication is that of course the automatic detach feature doesn't > work if you're in some random piece of memory. I have exposed > dsa_on_dsm_detach, so that there is a way to hook it up to the detach > hook for a pre-existing DSM segment, but that's the caller's > responibility. shm_mq_attach() made the opposite decision about how to solve this problem, and frankly I think that API is a lot more convenient: if the first argument to shm_mq_attach() happens to be located inside of a DSM, you can pass the DSM as the second argument and it registers the on_dsm_detach() hook for you. If not, you can pass NULL and deal with it in some other way. But this makes the common case very simple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
More review: + * For large objects, we just stick all of the allocations in fullness class + * 0. Since we can just return the space directly to the free page manager, + * we don't really need them on a list at all, except that if someone wants + * to bulk release everything allocated using this BlockAreaContext, we + * have no other way of finding them. This comment is out-of-date. + /* + * If this is the only span, and there is no active span, then maybe + * we should probably move this span to fullness class 1. (Otherwise + * if you allocate exactly all the objects in the only span, it moves + * to class 3, then you free them all, it moves to 2, and then is + * given back, leaving no active span). + */ "maybe we should probably" seems to have one more doubt-expressing word than it needs. + if (size_class == DSA_SCLASS_SPAN_LARGE) + /* Large object frees give back segments aggressively already. */ + continue; We generally use braces in this kind of case. + * Search the fullness class 1 only. That is where we expect to find extra "the" + /* Call for effect (we don't need the result). */ + get_segment_by_index(area, index); ... + return area->segment_maps[index].mapped_address + offset; It isn't guaranteed that area->segment_maps[index].mapped_address will be non-NULL on return from get_segment_by_index, and then this function will return a completely bogus pointer to the caller. I think you should probably elog() instead. + elog(ERROR, "dsa: can't attach to area handle %u", handle); Avoid ":" in elog messages. You don't really need to - and it isn't project style to - tag these with "dsa:"; that's what \errverbose or \set VERBOSITY verbose is for. In this particular case, I might just adopt the formulation from parallel.c: ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("could not map dynamic shared memory segment"))); + elog(FATAL, + "dsa couldn't find run of pages: fpm_largest out of sync"); Here I'd go with "dsa could not find %u free pages". + elog(ERROR, "dsa_pin: area already pinned"); "dsa_area already pinned" + elog(ERROR, "dsa_unpin: area not pinned"); "dsa_area not pinned" + if (segment == NULL) + elog(ERROR, "dsa: can't attach to segment"); As above. +static dsa_segment_map * +get_segment_by_index(dsa_area *area, dsa_segment_index index) +{ + if (unlikely(area->segment_maps[index].mapped_address == NULL)) + { + dsm_handle handle; + dsm_segment *segment; + dsa_segment_map *segment_map; + + handle = area->control->segment_handles[index]; Don't you need to acquire the lock for this? + /* Check all currently mapped segments to find what's been freed. */ + for (i = 0; i <= area->high_segment_index; ++i) + { + if (area->segment_maps[i].header != NULL && + area->segment_maps[i].header->freed) + { + dsm_detach(area->segment_maps[i].segment); + area->segment_maps[i].segment = NULL; + area->segment_maps[i].header = NULL; + area->segment_maps[i].mapped_address = NULL; + } + } + area->freed_segment_counter = freed_segment_counter; And this? +/* + * Release a DSA area that was produced by dsa_create_in_place or + * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX' + * callbacks so that this is managed automatically, because failure to release + * an area created in-place leaks its segments permanently. + */ +void +dsa_release_in_place(void *place) +{ + decrement_reference_count((dsa_area_control *) place); +} Since this seems to be the only caller of decrement_reference_count, you could just put the logic here. The contract for this function is also a bit unclear from the header comment. I initially thought that it was your intention that this should be called from every process that has either created or attached the segment. But that doesn't seem like it will work, because decrement_reference_count calls dsm_unpin_segment on every segment, and a segment can only be pinned once, so everybody else would fail. So maybe the idea is that ANY ONE process has to call dsa_release_in_place. But then that could lead to failures in other backends inside get_segment_by_index(), because segments they don't have mapped might already be gone. OK, third try: maybe the idea is that the LAST process out has to call dsa_release_in_place(). But how do the various cooperating processes know which one that is? I've also realized another thing that's not so good about this: superblocks are 64kB, so allocating 64kB of initial space probably just wastes most of it. I think we want to either allocate just enough space to hold the control information, or else that much space plus space for at least a few superblocks. I'm inclined to go the first way, because it seems a bit overenthusiastic to allocate 256kB or 512kB just on the off chance we might need it. On the other hand, including a few bytes in the control segment so that we don't need to allocate 1MB segment that we might not need sounds pretty sharp. Maybe DSA can expose an API that returns the number of bytes that will be needed for the control structure, and then the caller can arrange for that space to be available during the Estimate phase... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 30, 2016 at 4:35 AM, Robert Haas <robertmhaas@gmail.com> wrote: > More review: Thanks! > + * For large objects, we just stick all of the allocations in fullness class > + * 0. Since we can just return the space directly to the free page manager, > + * we don't really need them on a list at all, except that if someone wants > + * to bulk release everything allocated using this BlockAreaContext, we > + * have no other way of finding them. > > This comment is out-of-date. Removed. > + /* > + * If this is the only span, and there is no active > span, then maybe > + * we should probably move this span to fullness class > 1. (Otherwise > + * if you allocate exactly all the objects in the only > span, it moves > + * to class 3, then you free them all, it moves to 2, > and then is > + * given back, leaving no active span). > + */ > > "maybe we should probably" seems to have one more doubt-expressing > word than it needs. Fixed. > + if (size_class == DSA_SCLASS_SPAN_LARGE) > + /* Large object frees give back segments > aggressively already. */ > + continue; > > We generally use braces in this kind of case. Fixed. > + * Search the fullness class 1 only. That is where we > expect to find > > extra "the" Fixed. > + /* Call for effect (we don't need the result). */ > + get_segment_by_index(area, index); > ... > + return area->segment_maps[index].mapped_address + offset; > > It isn't guaranteed that area->segment_maps[index].mapped_address will > be non-NULL on return from get_segment_by_index, and then this > function will return a completely bogus pointer to the caller. I > think you should probably elog() instead. Hmm. Right. In fact it's never OK to ask for a segment by index when that segment is gone since that implies an access-after-free so there is no reason for NULL to be handled by callers. I have changed get_segment_by_index to raise an error.. > + elog(ERROR, "dsa: can't attach to area handle %u", handle); > > Avoid ":" in elog messages. You don't really need to - and it isn't > project style to - tag these with "dsa:"; that's what \errverbose or > \set VERBOSITY verbose is for. In this particular case, I might just > adopt the formulation from parallel.c: > > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("could not map dynamic shared > memory segment"))); Fixed. > + elog(FATAL, > + "dsa couldn't find run of pages: > fpm_largest out of sync"); > > Here I'd go with "dsa could not find %u free pages". Fixed. > + elog(ERROR, "dsa_pin: area already pinned"); > > "dsa_area already pinned" Fixed. > + elog(ERROR, "dsa_unpin: area not pinned"); > > "dsa_area not pinned" Fixed. > + if (segment == NULL) > + elog(ERROR, "dsa: can't attach to segment"); > > As above. Fixed. > +static dsa_segment_map * > +get_segment_by_index(dsa_area *area, dsa_segment_index index) > +{ > + if (unlikely(area->segment_maps[index].mapped_address == NULL)) > + { > + dsm_handle handle; > + dsm_segment *segment; > + dsa_segment_map *segment_map; > + > + handle = area->control->segment_handles[index]; > > Don't you need to acquire the lock for this? No. I've updated the comments to explain, and refactored a bit. I'll explain here in different words here: This is memory, you are a C programmer, and as with malloc/free, referencing memory that has been freed invokes undefined behaviour possibly including but not limited to demons flying out of your nose. When you call dsa_get_address(some_dsa_pointer) or dsa_free(some_dsa_pointer) you are asserting that the address points to memory allocated with dsa_allocate from this area that has not yet been freed. Given that assertion, area->control->segment_handles[index] (where index is extracted from the address) must be valid and cannot change under your feet. control->segment_handles[index] can only change after everything allocated from that whole segment has been freed; you can think of it as 'locked' as long as any live object exists in the segment it corresponds to. In general I'm trying not to do anything too clever in the first version of DSA: it uses plain old LWLock for each size-class's pool and then an area-wide LWLock for segment operations. But in the particular case of dsa_get_address, I think it's really important for the viability of DSA for these address translations to be fast in likely path, hence my desire to figure out a protocol for lock-free address translation even though segments come and go. > + /* Check all currently mapped segments to find what's > been freed. */ > + for (i = 0; i <= area->high_segment_index; ++i) > + { > + if (area->segment_maps[i].header != NULL && > + area->segment_maps[i].header->freed) > + { > + dsm_detach(area->segment_maps[i].segment); > + area->segment_maps[i].segment = NULL; > + area->segment_maps[i].header = NULL; > + area->segment_maps[i].mapped_address = NULL; > + } > + } > + area->freed_segment_counter = freed_segment_counter; > > And this? Hmm. I had a theory for why that didn't need to be locked, though it admittedly lacked a necessary barrier -- d'oh. But I'll spare you the details and just lock it because this is not a hot path and it's much simpler that way. I've also refactored that code into a new static function check_for_freed_segments, because I realised that dsa_free needs the same treatment as dsa_get_address. The checking for freed segments was also happening at the wrong time, which I've now straightened out -- that must happen before you arrive into a get_segment_index, as described in the copious new comments. Thoughts? > +/* > + * Release a DSA area that was produced by dsa_create_in_place or > + * dsa_attach_in_place. It is preferable to use one of the 'dsa_on_XXX' > + * callbacks so that this is managed automatically, because failure to release > + * an area created in-place leaks its segments permanently. > + */ > +void > +dsa_release_in_place(void *place) > +{ > + decrement_reference_count((dsa_area_control *) place); > +} > > Since this seems to be the only caller of decrement_reference_count, > you could just put the logic here. Ok, done. > The contract for this function is > also a bit unclear from the header comment. I initially thought that > it was your intention that this should be called from every process > that has either created or attached the segment. That is indeed my intention. > But that doesn't > seem like it will work, because decrement_reference_count calls > dsm_unpin_segment on every segment, and a segment can only be pinned > once, so everybody else would fail. So maybe the idea is that ANY ONE > process has to call dsa_release_in_place. But then that could lead to > failures in other backends inside get_segment_by_index(), because > segments they don't have mapped might already be gone. OK, third try: > maybe the idea is that the LAST process out has to call > dsa_release_in_place(). But how do the various cooperating processes > know which one that is? It decrements the reference count for the area, but only unpins the segments if the reference count reaches zero: Assert(control->refcnt > 0); if (--control->refcnt == 0) { /* ... unpin all the segments ... */ } > I've also realized another thing that's not so good about this: > superblocks are 64kB, so allocating 64kB of initial space probably > just wastes most of it. I think we want to either allocate just > enough space to hold the control information, or else that much space > plus space for at least a few superblocks. I'm inclined to go the > first way, because it seems a bit overenthusiastic to allocate 256kB > or 512kB just on the off chance we might need it. On the other hand, > including a few bytes in the control segment so that we don't need to > allocate 1MB segment that we might not need sounds pretty sharp. > Maybe DSA can expose an API that returns the number of bytes that will > be needed for the control structure, and then the caller can arrange > for that space to be available during the Estimate phase... Yeah, I also thought about that, but didn't try to do better before because I couldn't see how to make a nice macro for this without dragging a ton of internal stuff out into the header. I have written a new function dsa_minimum_size(). The caller can use that number directly to get a minimal in-place area that will immediately create an extra DSM segment as soon as you call dsa_allocate. Unfortunately you can't really add more to that number with predictable results unless you know some internal details and your future allocation pattern: to avoid extra segment creation, you'd need to add 4KB for a block of spans and then 64KB for each size class you plan to allocate, and of course that might change. But at least it allows us to create an in-place DSA area for every parallel query cheaply, and then defer creation of the first DSM segment until the first time someone tries to allocate, which seems about right to me. And in response to your earlier email: On Tue, Nov 29, 2016 at 7:48 AM, Robert Haas <robertmhaas@gmail.com> wrote: > shm_mq_attach() made the opposite decision about how to solve this > problem, and frankly I think that API is a lot more convenient: if the > first argument to shm_mq_attach() happens to be located inside of a > DSM, you can pass the DSM as the second argument and it registers the > on_dsm_detach() hook for you. If not, you can pass NULL and deal with > it in some other way. But this makes the common case very simple. Ok, I've now done the same. I feel like some more general destructor callback for objects in containing object is wanted here, rather than sticking dsm_segment * into various constructor-like functions, but I haven't thought seriously about that and I'm not arguing that case now. Please find attached dsa-v8.patch, and also a small test module for running random allocate/free exercises and dumping the internal allocator state. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Thu, Dec 1, 2016 at 10:33 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
Please find attached dsa-v8.patch, and also a small test module for
running random allocate/free exercises and dumping the internal
allocator state.
Moved to next CF with "needs review" status.
Regards,
Hari Babu
Fujitsu Australia
On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Please find attached dsa-v8.patch, and also a small test module for > running random allocate/free exercises and dumping the internal > allocator state. OK, I've committed the main patch. As far as test-dsa.patch, can we tie that into make check-world so that committing it delivers some buildfarm coverage for this code? Of course, the test settings would have to be fairly conservative given that some buildfarm machines have very limited resources, but it still seems worth doing. test_shm_mq might provide some useful precedent. Note that you don't need the prototype if you've already used PG_FUNCTION_INFO_V1. I'm not sure that using the same random seed every time is a good idea. Maybe you should provide a way to set the seed as part of starting the test, or to not do that (pass NULL?) and then elog(LOG, ...) the seed that's chosen. Then if the BF crashes, we can see what seed was in use for that particular test. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 2, 2016 at 1:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Please find attached dsa-v8.patch, and also a small test module for >> running random allocate/free exercises and dumping the internal >> allocator state. > > OK, I've committed the main patch. ...but the buildfarm isn't very happy about it. tern complains: In file included from dsa.c:58:0: ../../../../src/include/utils/dsa.h:59:1: error: unknown type name 'pg_atomic_uint64'typedef pg_atomic_uint64 dsa_pointer_atomic; ...but that code is only compiled if #if DSA_POINTER_SIZEOF == 4 fails to be true. And that should always be true unless PG_HAVE_ATOMIC_U64_SUPPORT is defined. So apparently tern claims to PG_HAVE_ATOMIC_U64_SUPPORT but doesn't actually define pg_atomic_uint64? That doesn't seem right. The failures on several other BF members appear to be similar. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 2, 2016 at 2:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 2, 2016 at 1:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> Please find attached dsa-v8.patch, and also a small test module for >>> running random allocate/free exercises and dumping the internal >>> allocator state. >> >> OK, I've committed the main patch. > > ...but the buildfarm isn't very happy about it. > > tern complains: > > In file included from dsa.c:58:0: > ../../../../src/include/utils/dsa.h:59:1: error: unknown type name > 'pg_atomic_uint64' > typedef pg_atomic_uint64 dsa_pointer_atomic; > > ...but that code is only compiled if #if DSA_POINTER_SIZEOF == 4 fails > to be true. And that should always be true unless > PG_HAVE_ATOMIC_U64_SUPPORT is defined. So apparently tern claims to > PG_HAVE_ATOMIC_U64_SUPPORT but doesn't actually define > pg_atomic_uint64? That doesn't seem right. No, that's not the problem. Just a garden variety thinko in dsa.h. Will push a fix presently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 3, 2016 at 9:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 2, 2016 at 2:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Dec 2, 2016 at 1:21 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Dec 1, 2016 at 6:33 AM, Thomas Munro >>> <thomas.munro@enterprisedb.com> wrote: >>>> Please find attached dsa-v8.patch, and also a small test module for >>>> running random allocate/free exercises and dumping the internal >>>> allocator state. >>> >>> OK, I've committed the main patch. >> >> ...but the buildfarm isn't very happy about it. >> >> tern complains: >> >> In file included from dsa.c:58:0: >> ../../../../src/include/utils/dsa.h:59:1: error: unknown type name >> 'pg_atomic_uint64' >> typedef pg_atomic_uint64 dsa_pointer_atomic; >> >> ...but that code is only compiled if #if DSA_POINTER_SIZEOF == 4 fails >> to be true. And that should always be true unless >> PG_HAVE_ATOMIC_U64_SUPPORT is defined. So apparently tern claims to >> PG_HAVE_ATOMIC_U64_SUPPORT but doesn't actually define >> pg_atomic_uint64? That doesn't seem right. > > No, that's not the problem. Just a garden variety thinko in dsa.h. > Will push a fix presently. Here's a patch to provide the right format string for dsa_pointer to printf-like functions, which clears a warning coming from dsa_dump (a debugging function) on 32 bit systems. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Dec 2, 2016 at 3:46 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's a patch to provide the right format string for dsa_pointer to > printf-like functions, which clears a warning coming from dsa_dump (a > debugging function) on 32 bit systems. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 15, 2016 at 5:31 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think we should develop versions of this that (1) allocate from the > main shared memory segment and (2) allocate from backend-private > memory. Per my previous benchmarking results, allocating from > backend-private memory would be a substantial win for tuplesort.c > because this allocator is substantially more memory-efficient for > large memory contexts than aset.c, and Tomas Vondra tested it out and > found that it is also faster for logical decoding than the approach he > proposed. The approach that I'd prefer to take with tuplesort.c is to have a buffer for caller tuples that is written to sequentially, and repalloc()'d as needed, much like the memtuples array. It would be slightly tricky to make this work when memtuples needs to be a heap (I'm mostly thinking of top-N heapsorts here). That has perhaps unbeatable efficiency, while also helping cases with significant physical/logical correlation in their input, which is pretty common. Creating an index on a serial PK within pg_restore would probably get notably faster if we went this way. -- Peter Geoghegan