things I learned from working on memory allocation - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | things I learned from working on memory allocation |
Date | |
Msg-id | CA+TgmoYx0d5aMCQ0TRpNie=kP_30vRf+yMdMSHbxA0H1KvgDNg@mail.gmail.com Whole thread Raw |
Responses |
Re: things I learned from working on memory allocation
Re: things I learned from working on memory allocation Re: things I learned from working on memory allocation |
List | pgsql-hackers |
I wrote previously about my efforts to create a new memory allocator (then called sb_alloc, now renamed to BlockAllocatorContext in my git branch for consistency with existing identifiers) for PostgreSQL. To make a long story short, many of the basic concepts behind that patch still seem sound to me, but significant problems remain, which account for it not having been submitted to the current CommitFest nor attached to this email. Working in this area has led me to a few conclusions - comments welcome. 1. I tried to write a single allocator which would be better than aset.c in two ways: first, by allowing allocation from dynamic shared memory; and second, by being more memory-efficient than aset.c. Heikki told me at PGCon that he thought that was a bad idea, and I now think he was right. Allocating from dynamic shared memory requires the use of relative rather than absolute pointers (because the DSM could be mapped at different addresses in different processes) and, if concurrent allocating and freeing is to be supported, locking. Although neither of these things requires very many extra instructions, the common path for aset.c is extremely short, and we haven't got those instructions to spare. Code like "if (lock != NULL) LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the branch is never taken, apparently because being prepared to call a function at that point requires storing more stuff in our stack frame, and extra instructions to save and restore registers are a painfully expensive on allocation-heavy workloads. On PPC64, to match aset.c's performance, a palloc/pfree cycle must run in ~120 instructions, ~80 cycles, which doesn't leave much room for fluff. 2. After ripping the relative-pointer and locking stuff out of my allocator, I came up with something which performs about like aset.c on allocation, but with significantly better memory efficiency on small allocations. REINDEX pgbench_accounts_pkey saves about 22%, because the SortTuple array saves nothing but the individual tuples take only half as much space, by avoiding the StandardChunkHeader overhead. This seems like a savings worth pursuing, if we can figure out how to get it. Unfortunately, there's some incremental overhead in pfree, amounting to a single test of global variable, even when the new allocator is never used, which is measurable on a microbenchmark; I don't remember the exact numbers off-hand. When the new allocator *is* used, pfree gets a lot slower - mostly because one of the data structures used by the new allocator suffer occasional cache misses during free operations. I think there might be an argument that some hit on pfree would be worth taking in exchange for better space-efficiency, because we mostly free contexts by resetting them anyway; but I haven't yet managed to make that hit small enough to feel especially good about it. 3. The current MemoryContext abstraction layer is inadequate for almost everything one might want to do. The idea of a context that allows only allocation, and ignores requests to free memory, has been discussed more than once. Such an allocator could skip the power-of-two rounding that aset.c does, but it couldn't omit the chunk header, which means that in practice it would save no memory at all for cases like the one mentioned above (REINDEX pgbench_accounts_pkey). While there might be some benefit in other cases, without getting rid of or at least reducing the size of the chunk-header, I expect that the benefits will be too narrow to make this approach worth pursuing. The chunk-header requirement is also a killer for DSM, again because segments can be mapped at different addresses in different backends. I'm inclined to think that if we can't find a way to get rid of the chunk-header requirement, we ought to consider ripping out the whole abstraction layer as a performance-wasting exercise in futility. 4. The MemoryContext layer embeds assumptions not only about the layout of memory, but the performance of various operations. For example, GetMemoryContextSpace() is assumed to be cheap, so there's no interface like void *MemoryContextAlloc(Size request_size, Size *chunk_size) or Size MemoryContextFree(Size chunk_size), but those would be significantly more efficient for my new allocator. Possibly better still would be to have the context itself track utilization and soft-fail when we run out of space; even for aset.c, it makes little sense to refuse to accept another tuple when the AllocSet has a suitable object on the freelist. That's probably not a critical flaw because, since tuples being sorted are likely all about the same size, the waste may be little or nothing. Other allocators might have different space-management strategies, though, where this matters more. 5. It's tempting to look at other ways of solving the parallel sort problem that don't need an allocator - perhaps by simply packing all the tuples into a DSM one after the next. But this is not easy to do, or at least it's not easy to do efficiently. Tuples enter tuplesort.c through one of the tuplesort_put*() functions, most of which end up calling one of the copytup_*() functions. But you can't easily just change those functions to stick the tuple someplace else, because they don't necessarily get the address of the space to be used from palloc directly. In particular, copytup_heap() calls ExecCopySlotMinimalTuple(), and copytup_cluster() calls heap_copytuple(). heap_copytuple() is simple enough that you might be able to finagle a new API that would make it work, but I'm not sure what we could really do about ExecCopySlotMinimalTuple() except call it and then copy the result. Perhaps that'd be OK for a first version. 6. In general, I'm worried that it's going to be hard to keep the overhead of parallel sort from leaking into the non-parallel case. With the no-allocator approach, every place that uses GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the DSM and non-DSM cases differently, which isn't great for either performance or maintainability. And even with an allocator, the SortTuple array will need to use relative pointers in a DSM; that might burden the non-DSM case. Thoughts, comments, suggestions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: