Thread: Performance problem in aset.c
While testing the just-committed executor memory leak fixes, I observe that there are still slow leaks in some operations such as sorts on large text fields. With a little digging, it seems that the leak must be blamed on the behavior of aset.c for large chunks. Chunks between 1K and 64K (ALLOC_SMALLCHUNK_LIMIT and ALLOC_BIGCHUNK_LIMIT) are all kept in the same freelist, and when a request is made for an amount of memory in that range, aset.c gives back the first chunk that's big enough from that freelist. For example, let's say you are allocating and freeing roughly equal numbers of 2K and 10K blocks. Over time, about half of the 2K requests will be answered by returning a 10K block --- which will prevent the next 10K request from being filled by recycling that 10K block, causing a new 10K block to be allocated. If aset.c had given back a 2K block whenever possible, the net memory usage would be static, but since it doesn't, the usage gradually creeps up as more and more chunks are used inefficiently. Our actual maximum memory usage might be m * 2K plus n * 10K but the allocated space will creep towards (m + n) * 10K where *all* the active blocks are the larger size. A straightforward solution would be to scan the entire freelist and give back the smallest block that's big enough for the request. That could be pretty slow (and induce lots of swap thrashing) so I don't like it much. Another idea is to split the returned chunk and put the wasted part back as a smaller free chunk, but I don't think this solves the problem; it just means that the wasted space ends up on a small-chunk freelist, not that you can actually do anything with it. But maybe someone can figure out a variant that works better. It might be that this behavior won't be seen much in practice and we shouldn't slow down aset.c at all to try to deal with it. But I think it's worth looking for solutions that won't slow typical cases much. regards, tom lane
Can you have a set of free lists. Like chunks of 2^8 or bigger, 2^9, 2^10, 2^11 etc. It should be faster than finding the first block like now and error would be mostly bounded to a factor of 2. Tom Lane wrote: > > While testing the just-committed executor memory leak fixes, I observe > that there are still slow leaks in some operations such as sorts on > large text fields. With a little digging, it seems that the leak must > be blamed on the behavior of aset.c for large chunks. Chunks between > 1K and 64K (ALLOC_SMALLCHUNK_LIMIT and ALLOC_BIGCHUNK_LIMIT) are all > kept in the same freelist, and when a request is made for an amount > of memory in that range, aset.c gives back the first chunk that's > big enough from that freelist. > > For example, let's say you are allocating and freeing roughly equal > numbers of 2K and 10K blocks. Over time, about half of the 2K > requests will be answered by returning a 10K block --- which will > prevent the next 10K request from being filled by recycling that > 10K block, causing a new 10K block to be allocated. If aset.c > had given back a 2K block whenever possible, the net memory usage > would be static, but since it doesn't, the usage gradually creeps > up as more and more chunks are used inefficiently. Our actual > maximum memory usage might be m * 2K plus n * 10K but the allocated > space will creep towards (m + n) * 10K where *all* the active blocks > are the larger size. > > A straightforward solution would be to scan the entire freelist > and give back the smallest block that's big enough for the request. > That could be pretty slow (and induce lots of swap thrashing) so > I don't like it much. > > Another idea is to split the returned chunk and put the wasted part back > as a smaller free chunk, but I don't think this solves the problem; it > just means that the wasted space ends up on a small-chunk freelist, not > that you can actually do anything with it. But maybe someone can figure > out a variant that works better. > > It might be that this behavior won't be seen much in practice and we > shouldn't slow down aset.c at all to try to deal with it. But I think > it's worth looking for solutions that won't slow typical cases much. > > regards, tom lane
At 23:48 11/07/00 -0400, Tom Lane wrote: >Another idea is to split the returned chunk and put the wasted part back >as a smaller free chunk, but I don't think this solves the problem; it >just means that the wasted space ends up on a small-chunk freelist, not >that you can actually do anything with it. But maybe someone can figure >out a variant that works better. Can you maintain one free list for each power of 2 (which it might already be doing by the look of it), and always allocate the max size for the list. Then when you want a 10k chunk, you get a 16k chunk, but you know from the request size which list to go to, and anything on the list will satisfy the requirement. In the absolute worst case this will double your memory consumption, and on average should only use 50% more memory. If this worries you, then maintain twice as many lists and halve the wastage. The process of finding the right list is just a matter of examining the MSB of the chunk size. If you want twice the number of lists, then look at the top two bits etc. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > Can you maintain one free list for each power of 2 (which it might already > be doing by the look of it), and always allocate the max size for the list. > Then when you want a 10k chunk, you get a 16k chunk, but you know from the > request size which list to go to, and anything on the list will satisfy the > requirement. That is how it works for small chunks (< 1K with the current parameters). I don't think we want to do it that way for really huge chunks though. Maybe the right answer is to eliminate the gap between small chunks (which basically work as Philip sketches above) and huge chunks (for which we fall back on malloc). The problem is with the stuff in between, for which we have a kind of half-baked approach... regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [000711 22:23] wrote: > Philip Warner <pjw@rhyme.com.au> writes: > > Can you maintain one free list for each power of 2 (which it might already > > be doing by the look of it), and always allocate the max size for the list. > > Then when you want a 10k chunk, you get a 16k chunk, but you know from the > > request size which list to go to, and anything on the list will satisfy the > > requirement. > > That is how it works for small chunks (< 1K with the current > parameters). I don't think we want to do it that way for really > huge chunks though. > > Maybe the right answer is to eliminate the gap between small chunks > (which basically work as Philip sketches above) and huge chunks (for > which we fall back on malloc). The problem is with the stuff in > between, for which we have a kind of half-baked approach... Er, are you guys seriously layering your own general purpose allocator over the OS/c library allocator? Don't do that! The only time you may want to do this is if you're doing a special purpose allocator like a zone or slab allocator, otherwise it's a pessimization. The algorithms you're discussing to fix these leaks have been implemented in almost any modern allocator that I know of. Sorry if i'm totally off base, but "for which we fall back on malloc" makes me wonder what's going on here. thanks, -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
At 01:21 12/07/00 -0400, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >> Can you maintain one free list for each power of 2 (which it might already >> be doing by the look of it), and always allocate the max size for the list. >> Then when you want a 10k chunk, you get a 16k chunk, but you know from the >> request size which list to go to, and anything on the list will satisfy the >> requirement. > >Maybe the right answer is to eliminate the gap between small chunks >(which basically work as Philip sketches above) and huge chunks (for >which we fall back on malloc). The problem is with the stuff in >between, for which we have a kind of half-baked approach... That sounds good to me. You *might* want to enable some kind of memory statistics in shared memory (for a mythical future repoting tool) so you can see how many memory allocations fall into the 'big chunk' range, and adjust your definition of 'big chunk' appropriately. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 22:33 11/07/00 -0700, Alfred Perlstein wrote: > >The only time you may want to do this is if you're doing a special purpose >allocator like a zone or slab allocator, otherwise it's a pessimization. That is why it is done (I believe) - memory can be allocated using one context then freed as a block when the context is no longer used. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Alfred Perlstein wrote: > * Tom Lane <tgl@sss.pgh.pa.us> [000711 22:23] wrote: > > Philip Warner <pjw@rhyme.com.au> writes: > > > Can you maintain one free list for each power of 2 (which it might already > > > be doing by the look of it), and always allocate the max size for the list. > > > Then when you want a 10k chunk, you get a 16k chunk, but you know from the > > > request size which list to go to, and anything on the list will satisfy the > > > requirement. > > > > That is how it works for small chunks (< 1K with the current > > parameters). I don't think we want to do it that way for really > > huge chunks though. > > > > Maybe the right answer is to eliminate the gap between small chunks > > (which basically work as Philip sketches above) and huge chunks (for > > which we fall back on malloc). The problem is with the stuff in > > between, for which we have a kind of half-baked approach... > > Er, are you guys seriously layering your own general purpose allocator > over the OS/c library allocator? > > Don't do that! > > The only time you may want to do this is if you're doing a special purpose > allocator like a zone or slab allocator, otherwise it's a pessimization. > The algorithms you're discussing to fix these leaks have been implemented > in almost any modern allocator that I know of. > > Sorry if i'm totally off base, but "for which we fall back on malloc" > makes me wonder what's going on here. To clearify this: I developed this in aset.c because of the fact that we use alot (really alot) of very small chunks beeing palloc()'d. Any allocation must be remembered in some linked lists to know what to free at memory context reset ordestruction. In the old version, every however small amount was allocated using malloc() and remembered separatelyin one huge List for the context. Traversing this list was awfully slow when a context said bye. And Isaw no way to speedup this traversal. With the actual concept, only big chunks are remembered for their own. All small allocations aren't tracked that accurately and memory context destruction simply can throw away all the blocks allocated for it. At the time I implemented it it gained a speedup of ~10% for the regression test. It's an approach of gaining speedby wasting memory. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
> At 23:48 11/07/00 -0400, Tom Lane wrote: > >Another idea is to split the returned chunk and put the wasted part back > >as a smaller free chunk, but I don't think this solves the problem; it > >just means that the wasted space ends up on a small-chunk freelist, not > >that you can actually do anything with it. But maybe someone can figure > >out a variant that works better. > > Can you maintain one free list for each power of 2 (which it might already > be doing by the look of it), and always allocate the max size for the list. > Then when you want a 10k chunk, you get a 16k chunk, but you know from the > request size which list to go to, and anything on the list will satisfy the > requirement. Sounds like the BSD malloc memory manager. I can post details, or the pages from a kernel book I have about it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Tue, 11 Jul 2000, Tom Lane wrote: > A straightforward solution would be to scan the entire freelist > and give back the smallest block that's big enough for the request. > That could be pretty slow (and induce lots of swap thrashing) so > I don't like it much. Yes. It is right solution. A little better chunk selection can be allowed with more freelist than current 8. If we will 16 free lists a chunk real size and request size will more similar. > Another idea is to split the returned chunk and put the wasted part back > as a smaller free chunk, but I don't think this solves the problem; it > just means that the wasted space ends up on a small-chunk freelist, not > that you can actually do anything with it. But maybe someone can figure > out a variant that works better. If I good understand it is a little like my idea with one-free-chunk from block residual space. It is not bad idea, but we have aligned chunks now. A question is if in wasted part will always space for new *aligned* chunk. And second question, not will this method create small and smaller chunks? For example you after 1000 free/alloc you will have very much small chunks (from the wasted part). The chunk from wasted part is good, but you must have usage of this chunks. *IMHO* current new mem design create new demand on context. In old design we used one context for more different proccess and request-allocation-size was more heterogeneous. But now, we use for specific operetions specific contexts and it probably very often create context that use very homogeneous chunks. For example if a testing my query cache 90% of all alocation are in the range 16-32b and one cached plan need 2000-5000b --- effect it that query cache not need 8Kb blocks ...etc. Very simular situation will in other parts in PG.Tom add to AllocSet context defaul/max block size setting for each context. It is good and it is first step to more specific context setting. Hmm, I haven't idea for some next step now... But something setting is needful for chunks operations (for example primary-default chunk size and from this frequent freelist, big chunk limit setting ..etc.) Karel