Thread: things I learned from working on memory allocation
I wrote previously about my efforts to create a new memory allocator (then called sb_alloc, now renamed to BlockAllocatorContext in my git branch for consistency with existing identifiers) for PostgreSQL. To make a long story short, many of the basic concepts behind that patch still seem sound to me, but significant problems remain, which account for it not having been submitted to the current CommitFest nor attached to this email. Working in this area has led me to a few conclusions - comments welcome. 1. I tried to write a single allocator which would be better than aset.c in two ways: first, by allowing allocation from dynamic shared memory; and second, by being more memory-efficient than aset.c. Heikki told me at PGCon that he thought that was a bad idea, and I now think he was right. Allocating from dynamic shared memory requires the use of relative rather than absolute pointers (because the DSM could be mapped at different addresses in different processes) and, if concurrent allocating and freeing is to be supported, locking. Although neither of these things requires very many extra instructions, the common path for aset.c is extremely short, and we haven't got those instructions to spare. Code like "if (lock != NULL) LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the branch is never taken, apparently because being prepared to call a function at that point requires storing more stuff in our stack frame, and extra instructions to save and restore registers are a painfully expensive on allocation-heavy workloads. On PPC64, to match aset.c's performance, a palloc/pfree cycle must run in ~120 instructions, ~80 cycles, which doesn't leave much room for fluff. 2. After ripping the relative-pointer and locking stuff out of my allocator, I came up with something which performs about like aset.c on allocation, but with significantly better memory efficiency on small allocations. REINDEX pgbench_accounts_pkey saves about 22%, because the SortTuple array saves nothing but the individual tuples take only half as much space, by avoiding the StandardChunkHeader overhead. This seems like a savings worth pursuing, if we can figure out how to get it. Unfortunately, there's some incremental overhead in pfree, amounting to a single test of global variable, even when the new allocator is never used, which is measurable on a microbenchmark; I don't remember the exact numbers off-hand. When the new allocator *is* used, pfree gets a lot slower - mostly because one of the data structures used by the new allocator suffer occasional cache misses during free operations. I think there might be an argument that some hit on pfree would be worth taking in exchange for better space-efficiency, because we mostly free contexts by resetting them anyway; but I haven't yet managed to make that hit small enough to feel especially good about it. 3. The current MemoryContext abstraction layer is inadequate for almost everything one might want to do. The idea of a context that allows only allocation, and ignores requests to free memory, has been discussed more than once. Such an allocator could skip the power-of-two rounding that aset.c does, but it couldn't omit the chunk header, which means that in practice it would save no memory at all for cases like the one mentioned above (REINDEX pgbench_accounts_pkey). While there might be some benefit in other cases, without getting rid of or at least reducing the size of the chunk-header, I expect that the benefits will be too narrow to make this approach worth pursuing. The chunk-header requirement is also a killer for DSM, again because segments can be mapped at different addresses in different backends. I'm inclined to think that if we can't find a way to get rid of the chunk-header requirement, we ought to consider ripping out the whole abstraction layer as a performance-wasting exercise in futility. 4. The MemoryContext layer embeds assumptions not only about the layout of memory, but the performance of various operations. For example, GetMemoryContextSpace() is assumed to be cheap, so there's no interface like void *MemoryContextAlloc(Size request_size, Size *chunk_size) or Size MemoryContextFree(Size chunk_size), but those would be significantly more efficient for my new allocator. Possibly better still would be to have the context itself track utilization and soft-fail when we run out of space; even for aset.c, it makes little sense to refuse to accept another tuple when the AllocSet has a suitable object on the freelist. That's probably not a critical flaw because, since tuples being sorted are likely all about the same size, the waste may be little or nothing. Other allocators might have different space-management strategies, though, where this matters more. 5. It's tempting to look at other ways of solving the parallel sort problem that don't need an allocator - perhaps by simply packing all the tuples into a DSM one after the next. But this is not easy to do, or at least it's not easy to do efficiently. Tuples enter tuplesort.c through one of the tuplesort_put*() functions, most of which end up calling one of the copytup_*() functions. But you can't easily just change those functions to stick the tuple someplace else, because they don't necessarily get the address of the space to be used from palloc directly. In particular, copytup_heap() calls ExecCopySlotMinimalTuple(), and copytup_cluster() calls heap_copytuple(). heap_copytuple() is simple enough that you might be able to finagle a new API that would make it work, but I'm not sure what we could really do about ExecCopySlotMinimalTuple() except call it and then copy the result. Perhaps that'd be OK for a first version. 6. In general, I'm worried that it's going to be hard to keep the overhead of parallel sort from leaking into the non-parallel case. With the no-allocator approach, every place that uses GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the DSM and non-DSM cases differently, which isn't great for either performance or maintainability. And even with an allocator, the SortTuple array will need to use relative pointers in a DSM; that might burden the non-DSM case. Thoughts, comments, suggestions? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 5. It's tempting to look at other ways of solving the parallel sort > problem that don't need an allocator - perhaps by simply packing all > the tuples into a DSM one after the next. But this is not easy to do, > or at least it's not easy to do efficiently. Tuples enter tuplesort.c > through one of the tuplesort_put*() functions, most of which end up > calling one of the copytup_*() functions. But you can't easily just > change those functions to stick the tuple someplace else, because they > don't necessarily get the address of the space to be used from palloc > directly. In particular, copytup_heap() calls > ExecCopySlotMinimalTuple(), and copytup_cluster() calls > heap_copytuple(). heap_copytuple() is simple enough that you might be > able to finagle a new API that would make it work, but I'm not sure > what we could really do about ExecCopySlotMinimalTuple() except call > it and then copy the result. Perhaps that'd be OK for a first > version. Perhaps. If my experience with sort support for text is anything to go on, the cost of copying over the tuples isn't really that high relative to the cost of performing the sort, particularly when there are many tuples to sort. OTOH, I'd be concerned about the cost of main memory accesses for pass-by-reference types during parallel tuple sorts in particular. The same concern applies to cases where the tuple proper must be accessed, although presumably that occurs less frequently. The LLVM project's new libc++ library passes remark on a similar issue in their rationale for yet another C++ standard library: "For example, it is generally accepted that building std::string using the "short string optimization" instead of using Copy On Write (COW) is a superior approach for multicore machines...". It seems likely that they're mostly referencing the apparent trend of memory bandwidth per core stagnation [1], [2]. To get the real benefit of a parallel sort, we must do anything we can to avoid having cores/workers remain idle during the sort while waiting on memory access. I believe that that's the bigger problem. [1] http://www.admin-magazine.com/HPC/Articles/Finding-Memory-Bottlenecks-with-Stream [2] https://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-in-threaded-applications -- Peter Geoghegan
On Mon, Jul 7, 2014 at 5:37 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> 5. It's tempting to look at other ways of solving the parallel sort >> problem that don't need an allocator - perhaps by simply packing all >> the tuples into a DSM one after the next. But this is not easy to do, >> or at least it's not easy to do efficiently. Tuples enter tuplesort.c >> through one of the tuplesort_put*() functions, most of which end up >> calling one of the copytup_*() functions. But you can't easily just >> change those functions to stick the tuple someplace else, because they >> don't necessarily get the address of the space to be used from palloc >> directly. In particular, copytup_heap() calls >> ExecCopySlotMinimalTuple(), and copytup_cluster() calls >> heap_copytuple(). heap_copytuple() is simple enough that you might be >> able to finagle a new API that would make it work, but I'm not sure >> what we could really do about ExecCopySlotMinimalTuple() except call >> it and then copy the result. Perhaps that'd be OK for a first >> version. > > Perhaps. If my experience with sort support for text is anything to go > on, the cost of copying over the tuples isn't really that high > relative to the cost of performing the sort, particularly when there > are many tuples to sort. The testing I did showed about a 5% overhead on REINDEX INDEX pgbench_accounts_pkey from one extra tuple copy (cf. 9f03ca915196dfc871804a1f8aad26207f601fd6). Of course that could vary by circumstance for a variety of reasons. > OTOH, I'd be concerned about the cost of main > memory accesses for pass-by-reference types during parallel tuple > sorts in particular. The same concern applies to cases where the tuple > proper must be accessed, although presumably that occurs less > frequently. I do think that's a problem with our sort implementation, but it's not clear to me whether it's *more* of a problem for parallel sort than it is for single-backend sort. > The LLVM project's new libc++ library passes remark on a similar issue > in their rationale for yet another C++ standard library: "For example, > it is generally accepted that building std::string using the "short > string optimization" instead of using Copy On Write (COW) is a > superior approach for multicore machines...". It seems likely that > they're mostly referencing the apparent trend of memory bandwidth per > core stagnation [1], [2]. To get the real benefit of a parallel sort, > we must do anything we can to avoid having cores/workers remain idle > during the sort while waiting on memory access. I believe that that's > the bigger problem. Yes, I agree that's a problem. There are algorithms for parallel quicksort in the literature which seem to offer promising solutions, but unfortunately these infrastructure issues have prevented me from reaching them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 7, 2014 at 7:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > The testing I did showed about a 5% overhead on REINDEX INDEX > pgbench_accounts_pkey from one extra tuple copy (cf. > 9f03ca915196dfc871804a1f8aad26207f601fd6). Of course that could vary > by circumstance for a variety of reasons. Be careful of the check for pre-sorted input within qsort() giving an overly optimistic picture of things, thereby exaggerating the proportion of time spent copying if taken as generally representative. >> OTOH, I'd be concerned about the cost of main >> memory accesses for pass-by-reference types during parallel tuple >> sorts in particular. The same concern applies to cases where the tuple >> proper must be accessed, although presumably that occurs less >> frequently. > > I do think that's a problem with our sort implementation, but it's not > clear to me whether it's *more* of a problem for parallel sort than it > is for single-backend sort. If you'll forgive me for going on about my patch on this thread, I think the pgbench "-c 4" and "-c 1" cases that I tested suggest it is a particular problem for parallel sorts, as there is a much bigger both absolute and proportional difference in transaction throughput between those two with the patch applied. It seems reasonable to suppose the difference would be larger still if we were considering a single parallel sort, as opposed to multiple independent sorts (of the same data) that happen to occur in parallel. -- Peter Geoghegan
On Mon, Jul 7, 2014 at 7:29 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I do think that's a problem with our sort implementation, but it's not >> clear to me whether it's *more* of a problem for parallel sort than it >> is for single-backend sort. > > If you'll forgive me for going on about my patch on this thread, I > think the pgbench "-c 4" and "-c 1" cases that I tested suggest it is > a particular problem for parallel sorts, as there is a much bigger > both absolute and proportional difference in transaction throughput > between those two with the patch applied. It seems reasonable to > suppose the difference would be larger still if we were considering a > single parallel sort, as opposed to multiple independent sorts (of the > same data) that happen to occur in parallel. I think that I may have been too optimistic when I said that there was an apparent trend of memory bandwidth per core merely stagnating: http://users.ece.cmu.edu/~omutlu/pub/mutlu_memory-scaling_imw13_invited-talk.pdf As slide 8 indicates, memory capacity per core is expected to go down 30% every two years, while the trend for memory bandwidth per core is even worse. -- Peter Geoghegan
On Tue, Jul 8, 2014 at 1:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> 6. In general, I'm worried that it's going to be hard to keep the
> overhead of parallel sort from leaking into the non-parallel case.
> With the no-allocator approach, every place that uses
> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
> DSM and non-DSM cases differently, which isn't great for either
> performance or maintainability. And even with an allocator, the
> SortTuple array will need to use relative pointers in a DSM; that
> might burden the non-DSM case.
I think you are right in saying that there can be a performance impact
>
> 6. In general, I'm worried that it's going to be hard to keep the
> overhead of parallel sort from leaking into the non-parallel case.
> With the no-allocator approach, every place that uses
> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
> DSM and non-DSM cases differently, which isn't great for either
> performance or maintainability. And even with an allocator, the
> SortTuple array will need to use relative pointers in a DSM; that
> might burden the non-DSM case.
I think you are right in saying that there can be a performance impact
for non-parallel case due to pfree() (or other similar calls) and/or due
to usage of relative pointers, but can it have measurable impact during
actual sort operation? Changes for any of them doesn't seem to cause
much impact for overall sort operation, although I am not sure what kind
of impact usage of relative pointers can cause, do you any specific point
in mind due to which you think that it can cause noticeable performance
impact.
If there is an noticeable impact, then do you think having separate
file/infrastructure for parallel sort can help, basically non-parallel
and parallel sort will have some common and some specific parts.
The above layer will call the parallel/non-parallel functions based on
operation need.
On Thu, Jul 10, 2014 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jul 8, 2014 at 1:27 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> 6. In general, I'm worried that it's going to be hard to keep the >> overhead of parallel sort from leaking into the non-parallel case. >> With the no-allocator approach, every place that uses >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the >> DSM and non-DSM cases differently, which isn't great for either >> performance or maintainability. And even with an allocator, the >> SortTuple array will need to use relative pointers in a DSM; that >> might burden the non-DSM case. > > I think you are right in saying that there can be a performance impact > for non-parallel case due to pfree() (or other similar calls) and/or due > to usage of relative pointers, but can it have measurable impact during > actual sort operation? Benchmarking seems to indicate that it indeed can. > If there is an noticeable impact, then do you think having separate > file/infrastructure for parallel sort can help, basically non-parallel > and parallel sort will have some common and some specific parts. > The above layer will call the parallel/non-parallel functions based on > operation need. The tuplesort.c already handles so many cases already that adding yet another axis of variability is intimidating. But it may work out that there's no better option. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 11, 2014 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jul 10, 2014 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > If there is an noticeable impact, then do you think having separate
> > file/infrastructure for parallel sort can help, basically non-parallel
> > and parallel sort will have some common and some specific parts.
> > The above layer will call the parallel/non-parallel functions based on
> > operation need.
>
> The tuplesort.c already handles so many cases already that adding yet
> another axis of variability is intimidating. But it may work out that
> there's no better option.
> On Thu, Jul 10, 2014 at 1:05 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > If there is an noticeable impact, then do you think having separate
> > file/infrastructure for parallel sort can help, basically non-parallel
> > and parallel sort will have some common and some specific parts.
> > The above layer will call the parallel/non-parallel functions based on
> > operation need.
>
> The tuplesort.c already handles so many cases already that adding yet
> another axis of variability is intimidating. But it may work out that
> there's no better option.
For using new allocator, one idea is that it works seemlesly with current
memory allocation routines (palloc, pfree, repalloc, ..), another could be
that have separate routines (shm_palloc, shm_pfree, ..) for allocation
from new allocator. I think having separate routines means all the call
sites for allocation/deallocation needs to be aware of operation type
(parallel or non-parallel), however I think this can avoid the overhead
for non-parallel cases.
I think it is bit difficult to say which approach ("with allocator" or
"directly use dsm") will turn out to be better as both have its pros and
cons as listed by you, however I feel that if we "directly using dsm", we
might need to change more core logic than "with allocator".
I believe we will face the same question again if somebody wants to
parallelize the join, so one parameter to decide could be based on
which approach will lead to less change in core logic/design.
Hi Robert, On 2014-07-07 15:57:00 -0400, Robert Haas wrote: > 1. I tried to write a single allocator which would be better than > aset.c in two ways: first, by allowing allocation from dynamic shared > memory; and second, by being more memory-efficient than aset.c. > Heikki told me at PGCon that he thought that was a bad idea, and I now > think he was right. I agree with that as well. I am wondering whether it's possible to use most of the same code and compile it once as a per process allocator and once as the allocator for shared memory. > Allocating from dynamic shared memory requires > the use of relative rather than absolute pointers (because the DSM > could be mapped at different addresses in different processes) and, if > concurrent allocating and freeing is to be supported, locking. > Although neither of these things requires very many extra > instructions, the common path for aset.c is extremely short, and we > haven't got those instructions to spare. Code like "if (lock != NULL) > LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the > branch is never taken, apparently because being prepared to call a > function at that point requires storing more stuff in our stack frame, > and extra instructions to save and restore registers are a painfully > expensive on allocation-heavy workloads. On PPC64, to match aset.c's > performance, a palloc/pfree cycle must run in ~120 instructions, ~80 > cycles, which doesn't leave much room for fluff. The actual if (lock != NULL) bit costs significant amounts of cycles? I'd have assumed that branch prediction takes care of that. Or is it actually the icache not keeping up? Did you measure icache vs. dcache misses? Have you played with unlikely()/likely() type of macros? I don't think that any such fixes changes will make enough of a difference to negate the point that these simply are two different allocators with different requirements, but it's still interesting. > 2. After ripping the relative-pointer and locking stuff out of my > allocator, I came up with something which performs about like aset.c > on allocation, but with significantly better memory efficiency on > small allocations. REINDEX pgbench_accounts_pkey saves about 22%, > because the SortTuple array saves nothing but the individual tuples > take only half as much space, by avoiding the StandardChunkHeader > overhead. This seems like a savings worth pursuing, if we can figure > out how to get it. Yes, that'd certainly be nice. > Unfortunately, there's some incremental overhead > in pfree, amounting to a single test of global variable, even when the > new allocator is never used, which is measurable on a microbenchmark; > I don't remember the exact numbers off-hand. Hm. Why's that needed? Because you're searching for your allocator's superblocks in pfree()? I guess you don't store information about the size/context for every allocatation like aset.c does? > When the new allocator > *is* used, pfree gets a lot slower - mostly because one of the data > structures used by the new allocator suffer occasional cache misses > during free operations. I think there might be an argument that some > hit on pfree would be worth taking in exchange for better > space-efficiency, because we mostly free contexts by resetting them > anyway; but I haven't yet managed to make that hit small enough to > feel especially good about it. The speed bump is hit when pfree()ing a allocation that's been allocated with the new allocator or also when there's any concurrent activity for that allocator? > 3. The current MemoryContext abstraction layer is inadequate for > almost everything one might want to do. The idea of a context that > allows only allocation, and ignores requests to free memory, has been > discussed more than once. Such an allocator could skip the > power-of-two rounding that aset.c does, but it couldn't omit the chunk > header, which means that in practice it would save no memory at all > for cases like the one mentioned above (REINDEX > pgbench_accounts_pkey). I personally think that the performance benefit of not doing the rounding, not accessing the freelist, and such is more interesting for such an allocator than the memory savings. I'm pretty sure such a 'allocation only' allocator for e.g. the parser and parts of the executor would be a rather noticeable performance benefit. But even for the cases where the space savings are the important bit: To me it sounds feasible to declare that some allocators don't allow reallocations and freeing. Those then would be allowed to omit the chunk header. To make that debuggable we would need to add a chunk header when assertions are enabled to error out when such an operation is performed - but that's a acceptable price imo. Btw, am I the only one that wondered if it wouldn't be rather beneficial to make aset.c add the chunk header before rounding? > While there might be some benefit in other > cases, without getting rid of or at least reducing the size of the > chunk-header, I expect that the benefits will be too narrow to make > this approach worth pursuing. The chunk-header requirement is also a > killer for DSM, again because segments can be mapped at different > addresses in different backends. I'm inclined to think that if we > can't find a way to get rid of the chunk-header requirement, we ought > to consider ripping out the whole abstraction layer as a > performance-wasting exercise in futility. FWIW, I have done that in an experimental patch and at least on x86-64 the performance benefits were neglegible. Didn't evaluate memory efficiency though. > 4. The MemoryContext layer embeds assumptions not only about the > layout of memory, but the performance of various operations. For > example, GetMemoryContextSpace() is assumed to be cheap, so there's no > interface like void *MemoryContextAlloc(Size request_size, Size > *chunk_size) or Size MemoryContextFree(Size chunk_size), but those > would be significantly more efficient for my new allocator. I think mcxct.c is at least partially right in that. An memory allocation interface that requires keeping track of the current allocation size in userland *sucks*. Sure there are some cases where that's not too much of a bother, but in the majority of cases it'll made code harder to read and quite possibly just move the overhead in lots more places. I don't really see a problem with tweaking the interface so it allows MemoryContextAlloc to immediately return the actually allocated size - if we're doing that in a way that doesn't break existing users - why not? > Possibly > better still would be to have the context itself track utilization and > soft-fail when we run out of space; even for aset.c, it makes little > sense to refuse to accept another tuple when the AllocSet has a > suitable object on the freelist. That's probably not a critical flaw > because, since tuples being sorted are likely all about the same size, > the waste may be little or nothing. Other allocators might have > different space-management strategies, though, where this matters > more. I can't really figure out what you're talking about here. > 5. It's tempting to look at other ways of solving the parallel sort > problem that don't need an allocator - perhaps by simply packing all > the tuples into a DSM one after the next. But this is not easy to do, > or at least it's not easy to do efficiently. Tuples enter tuplesort.c > through one of the tuplesort_put*() functions, most of which end up > calling one of the copytup_*() functions. But you can't easily just > change those functions to stick the tuple someplace else, because they > don't necessarily get the address of the space to be used from palloc > directly. In particular, copytup_heap() calls > ExecCopySlotMinimalTuple(), and copytup_cluster() calls > heap_copytuple(). heap_copytuple() is simple enough that you might be > able to finagle a new API that would make it work, but I'm not sure > what we could really do about ExecCopySlotMinimalTuple() except call > it and then copy the result. Perhaps that'd be OK for a first > version. This sounds like a problem that can be solved in a relatively localized fashion. I think it'd be fair enough to provide one function that does it efficiently and just fall back to ocpying for everything else. > 6. In general, I'm worried that it's going to be hard to keep the > overhead of parallel sort from leaking into the non-parallel case. > With the no-allocator approach, every place that uses > GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the > DSM and non-DSM cases differently, which isn't great for either > performance or maintainability. And even with an allocator, the > SortTuple array will need to use relative pointers in a DSM; that > might burden the non-DSM case. Yes, I share that concern. I somewhat doubt that tuplesort.c really is the right layer to do the parallel part of parallel sort including the chunked storage. Why not pack the tuples outside it and teach tuplesort.c to not copy tuples for some _put* routine? Then tuplesort can entirely happen inside a single process and doesn't have to worry about indirect pointers and anything? Yes, it'll need slightly more memory, but that doesn't sound too bad. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: > The actual if (lock != NULL) bit costs significant amounts of cycles? > I'd have assumed that branch prediction takes care of that. Or is it > actually the icache not keeping up? Did you measure icache vs. dcache > misses? > Have you played with unlikely()/likely() type of macros? I have not. I think it's really got more to do with how much stuff needs to be saved in the stack frame, but I might be wrong about that. >> Unfortunately, there's some incremental overhead >> in pfree, amounting to a single test of global variable, even when the >> new allocator is never used, which is measurable on a microbenchmark; >> I don't remember the exact numbers off-hand. > > Hm. Why's that needed? Because you're searching for your allocator's > superblocks in pfree()? I guess you don't store information about the > size/context for every allocatation like aset.c does? Since the chunks don't have a StandardChunkHeader, pfree has got to decide whether it's safe to look at the 16 bytes preceding the chunk before doing so. If the new allocator's not in use (single global variable test) that's pretty cheap, but not so cheap as you might hope. I found that it was possible to buy back most of the cost by replacing (*context->methods->free_p) (context, pointer); with a hard-coded AllocSetFree(context, pointer), so that gives you some idea what order of magnitude we're talking about here. >> When the new allocator >> *is* used, pfree gets a lot slower - mostly because one of the data >> structures used by the new allocator suffer occasional cache misses >> during free operations. I think there might be an argument that some >> hit on pfree would be worth taking in exchange for better >> space-efficiency, because we mostly free contexts by resetting them >> anyway; but I haven't yet managed to make that hit small enough to >> feel especially good about it. > > The speed bump is hit when pfree()ing a allocation that's been allocated > with the new allocator or also when there's any concurrent activity for > that allocator? The latter. >> 3. The current MemoryContext abstraction layer is inadequate for >> almost everything one might want to do. The idea of a context that >> allows only allocation, and ignores requests to free memory, has been >> discussed more than once. Such an allocator could skip the >> power-of-two rounding that aset.c does, but it couldn't omit the chunk >> header, which means that in practice it would save no memory at all >> for cases like the one mentioned above (REINDEX >> pgbench_accounts_pkey). > > I personally think that the performance benefit of not doing the > rounding, not accessing the freelist, and such is more interesting for > such an allocator than the memory savings. I'm pretty sure such a > 'allocation only' allocator for e.g. the parser and parts of the > executor would be a rather noticeable performance benefit. I don't have any reason to believe that. All palloc() does in the common case is bump the boundary pointer and stick the chunk header in there. You're not going to be able to make that much faster. > But even for the cases where the space savings are the important bit: To > me it sounds feasible to declare that some allocators don't allow > reallocations and freeing. Those then would be allowed to omit the chunk > header. To make that debuggable we would need to add a chunk header > when assertions are enabled to error out when such an operation is > performed - but that's a acceptable price imo. Hmm, yeah, that could be done. However, it does have the disadvantage of requiring you to affirmatively avoid pfreeing anything that might have come from such a context, rather than simply making pfree a noop. > Btw, am I the only one that wondered if it wouldn't be rather > beneficial to make aset.c add the chunk header before rounding? Well, it would help in some cases, hurt in others, and make no difference at all in still others. I mean, it just has to do with how well your size classes fit your actual memory allocation pattern; you'd be effectively changing the size classes from 32, 64, 128, 256 to 16, 48, 112, 240 and that could be a win or a loss depending on the actual allocation pattern. I'm sure it's pretty trivial to construct a test case favoring either outcome. >> 5. It's tempting to look at other ways of solving the parallel sort >> problem that don't need an allocator - perhaps by simply packing all >> the tuples into a DSM one after the next. But this is not easy to do, >> or at least it's not easy to do efficiently. Tuples enter tuplesort.c >> through one of the tuplesort_put*() functions, most of which end up >> calling one of the copytup_*() functions. But you can't easily just >> change those functions to stick the tuple someplace else, because they >> don't necessarily get the address of the space to be used from palloc >> directly. In particular, copytup_heap() calls >> ExecCopySlotMinimalTuple(), and copytup_cluster() calls >> heap_copytuple(). heap_copytuple() is simple enough that you might be >> able to finagle a new API that would make it work, but I'm not sure >> what we could really do about ExecCopySlotMinimalTuple() except call >> it and then copy the result. Perhaps that'd be OK for a first >> version. > > This sounds like a problem that can be solved in a relatively localized > fashion. I think it'd be fair enough to provide one function that does > it efficiently and just fall back to ocpying for everything else. OK. >> 6. In general, I'm worried that it's going to be hard to keep the >> overhead of parallel sort from leaking into the non-parallel case. >> With the no-allocator approach, every place that uses >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the >> DSM and non-DSM cases differently, which isn't great for either >> performance or maintainability. And even with an allocator, the >> SortTuple array will need to use relative pointers in a DSM; that >> might burden the non-DSM case. > > Yes, I share that concern. > > I somewhat doubt that tuplesort.c really is the right layer to do the > parallel part of parallel sort including the chunked storage. Why not > pack the tuples outside it and teach tuplesort.c to not copy tuples for > some _put* routine? Then tuplesort can entirely happen inside a single > process and doesn't have to worry about indirect pointers and anything? > Yes, it'll need slightly more memory, but that doesn't sound too bad. I'm not sure I understand what you're describing here: - the _put* routine has to do with how tuples get into the tuplesort; but parallel sort has to do with how we get them in order once they're in there - having tuplesort happen inside a single process seems like the exact opposite of parallel sort - why would we need more memory? But having expressed my confusion, I'd certainly welcome further comment on this point, because I think figuring out how to set up the abstraction is in some ways the hardest part of this problem - and it is certainly made harder by the complexity of the existing abstraction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-07-14 11:24:26 -0400, Robert Haas wrote: > On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > The actual if (lock != NULL) bit costs significant amounts of cycles? > > I'd have assumed that branch prediction takes care of that. Or is it > > actually the icache not keeping up? Did you measure icache vs. dcache > > misses? > > Have you played with unlikely()/likely() type of macros? > > I have not. I think it's really got more to do with how much stuff > needs to be saved in the stack frame, but I might be wrong about that. I don't really see why that'd play such a big role. The register pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL) alone requires to push anything on the stack. Sure, the call to LWLockAcquire() will, but if that's done in the separate branch... > >> Unfortunately, there's some incremental overhead > >> in pfree, amounting to a single test of global variable, even when the > >> new allocator is never used, which is measurable on a microbenchmark; > >> I don't remember the exact numbers off-hand. > > > > Hm. Why's that needed? Because you're searching for your allocator's > > superblocks in pfree()? I guess you don't store information about the > > size/context for every allocatation like aset.c does? > > Since the chunks don't have a StandardChunkHeader, pfree has got to > decide whether it's safe to look at the 16 bytes preceding the chunk > before doing so. If the new allocator's not in use (single global > variable test) that's pretty cheap, but not so cheap as you might > hope. That's what I was afraid of :/ > I found that it was possible to buy back most of the cost by > replacing (*context->methods->free_p) (context, pointer); with a > hard-coded AllocSetFree(context, pointer), so that gives you some idea > what order of magnitude we're talking about here. That's measured with a microbenchmark or actual postgres workloads? Because in my measurements there wasn't consistent benefit in doing so even when benchmarking workloads where allocation is a bottleneck. > >> 3. The current MemoryContext abstraction layer is inadequate for > >> almost everything one might want to do. The idea of a context that > >> allows only allocation, and ignores requests to free memory, has been > >> discussed more than once. Such an allocator could skip the > >> power-of-two rounding that aset.c does, but it couldn't omit the chunk > >> header, which means that in practice it would save no memory at all > >> for cases like the one mentioned above (REINDEX > >> pgbench_accounts_pkey). > > > > I personally think that the performance benefit of not doing the > > rounding, not accessing the freelist, and such is more interesting for > > such an allocator than the memory savings. I'm pretty sure such a > > 'allocation only' allocator for e.g. the parser and parts of the > > executor would be a rather noticeable performance benefit. > > I don't have any reason to believe that. All palloc() does in the > common case is bump the boundary pointer and stick the chunk header in > there. You're not going to be able to make that much faster. In my profiling the access to the freelist (to test whether there's free chunks in there) actually is noticeable stall. Removing it makes things measurably faster. Also adds memory overhead :) If you look at it, a simple AllocSetAlloc() will currently access three cachelines inside AllocSetContext - with some care that can be one. > > But even for the cases where the space savings are the important bit: To > > me it sounds feasible to declare that some allocators don't allow > > reallocations and freeing. Those then would be allowed to omit the chunk > > header. To make that debuggable we would need to add a chunk header > > when assertions are enabled to error out when such an operation is > > performed - but that's a acceptable price imo. > > Hmm, yeah, that could be done. However, it does have the disadvantage > of requiring you to affirmatively avoid pfreeing anything that might > have come from such a context, rather than simply making pfree a noop. Yep. Sounds feasible for a number of cases imo. I have serious doubts that adding a measurable cost to pfree()s for allocations from all contests is going to be fun. There's some operators that do a lot of those - especially inside btree opclasses which have to do so. Since those are the ones you'll often be calling from parallel sort... > > Btw, am I the only one that wondered if it wouldn't be rather > > beneficial to make aset.c add the chunk header before rounding? > > Well, it would help in some cases, hurt in others, and make no > difference at all in still others. I mean, it just has to do with how > well your size classes fit your actual memory allocation pattern; > you'd be effectively changing the size classes from 32, 64, 128, 256 > to 16, 48, 112, 240 and that could be a win or a loss depending on the > actual allocation pattern. I'm sure it's pretty trivial to construct > a test case favoring either outcome. I was actually thinking about it more from the CPU than from the memory packing angle. There's a fair amount of +-ALLOC_CHUNKHDRSZ and it doesn't look like the compiler removes it all. > >> 6. In general, I'm worried that it's going to be hard to keep the > >> overhead of parallel sort from leaking into the non-parallel case. > >> With the no-allocator approach, every place that uses > >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the > >> DSM and non-DSM cases differently, which isn't great for either > >> performance or maintainability. And even with an allocator, the > >> SortTuple array will need to use relative pointers in a DSM; that > >> might burden the non-DSM case. > > > > Yes, I share that concern. > > > > I somewhat doubt that tuplesort.c really is the right layer to do the > > parallel part of parallel sort including the chunked storage. Why not > > pack the tuples outside it and teach tuplesort.c to not copy tuples for > > some _put* routine? Then tuplesort can entirely happen inside a single > > process and doesn't have to worry about indirect pointers and anything? > > Yes, it'll need slightly more memory, but that doesn't sound too bad. > > I'm not sure I understand what you're describing here: > > - the _put* routine has to do with how tuples get into the tuplesort; > but parallel sort has to do with how we get them in order once they're > in there > - having tuplesort happen inside a single process seems like the exact > opposite of parallel sort > - why would we need more memory? > > But having expressed my confusion, I'd certainly welcome further > comment on this point, because I think figuring out how to set up the > abstraction is in some ways the hardest part of this problem - and it > is certainly made harder by the complexity of the existing > abstraction. I think we're imagining quite different approaches to the problem. In my mind the parallelism part is essentially a layer *above* tuplesort.c, but I think you're thinking about a much more integrated approach. Could you maybe describe what the layering, distribution and locking look like in your mind (or possibly even prototype?). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jul 14, 2014 at 12:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-07-14 11:24:26 -0400, Robert Haas wrote: >> On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > The actual if (lock != NULL) bit costs significant amounts of cycles? >> > I'd have assumed that branch prediction takes care of that. Or is it >> > actually the icache not keeping up? Did you measure icache vs. dcache >> > misses? >> > Have you played with unlikely()/likely() type of macros? >> >> I have not. I think it's really got more to do with how much stuff >> needs to be saved in the stack frame, but I might be wrong about that. > > I don't really see why that'd play such a big role. The register > pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL) > alone requires to push anything on the stack. Sure, the call to > LWLockAcquire() will, but if that's done in the separate branch... I can't tell you for sure what the cause is; I just saw that the difference was very significant. >> I found that it was possible to buy back most of the cost by >> replacing (*context->methods->free_p) (context, pointer); with a >> hard-coded AllocSetFree(context, pointer), so that gives you some idea >> what order of magnitude we're talking about here. > > That's measured with a microbenchmark or actual postgres workloads? > Because in my measurements there wasn't consistent benefit in doing so > even when benchmarking workloads where allocation is a bottleneck. Microbenchmark. >> >> 6. In general, I'm worried that it's going to be hard to keep the >> >> overhead of parallel sort from leaking into the non-parallel case. >> >> With the no-allocator approach, every place that uses >> >> GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the >> >> DSM and non-DSM cases differently, which isn't great for either >> >> performance or maintainability. And even with an allocator, the >> >> SortTuple array will need to use relative pointers in a DSM; that >> >> might burden the non-DSM case. >> > >> > Yes, I share that concern. >> > >> > I somewhat doubt that tuplesort.c really is the right layer to do the >> > parallel part of parallel sort including the chunked storage. Why not >> > pack the tuples outside it and teach tuplesort.c to not copy tuples for >> > some _put* routine? Then tuplesort can entirely happen inside a single >> > process and doesn't have to worry about indirect pointers and anything? >> > Yes, it'll need slightly more memory, but that doesn't sound too bad. >> >> I'm not sure I understand what you're describing here: >> >> - the _put* routine has to do with how tuples get into the tuplesort; >> but parallel sort has to do with how we get them in order once they're >> in there >> - having tuplesort happen inside a single process seems like the exact >> opposite of parallel sort >> - why would we need more memory? >> >> But having expressed my confusion, I'd certainly welcome further >> comment on this point, because I think figuring out how to set up the >> abstraction is in some ways the hardest part of this problem - and it >> is certainly made harder by the complexity of the existing >> abstraction. > > I think we're imagining quite different approaches to the problem. In my > mind the parallelism part is essentially a layer *above* tuplesort.c, > but I think you're thinking about a much more integrated approach. So, I don't see how that can work. tuplesort.c is *the* entrypoint for sorting throughout the backend. If you didn't use those same entrypoints, you'd have to update every caller, which seems tedious and rather pointless. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company