Thread: copy.c allocation constant
While reading copy.c I noticed this line: #define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */ Doesn't that seem rather odd? If we're adding 1 wouldn't it be better as 65535 so we palloc a power of 2? I have no idea if this affects performance, but it did strike me as strange. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Nov 28, 2017 at 11:51:28AM -0500, Andrew Dunstan wrote: > > > While reading copy.c I noticed this line: > > > #define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */ > > > Doesn't that seem rather odd? If we're adding 1 wouldn't it be better as > 65535 so we palloc a power of 2? > > > I have no idea if this affects performance, but it did strike me as strange. Coming in late here, but it does seem very odd. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 01/24/2018 04:14 AM, Bruce Momjian wrote: > On Tue, Nov 28, 2017 at 11:51:28AM -0500, Andrew Dunstan wrote: >> >> >> While reading copy.c I noticed this line: >> >> >> #define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */ >> >> >> Doesn't that seem rather odd? If we're adding 1 wouldn't it be better as >> 65535 so we palloc a power of 2? >> >> >> I have no idea if this affects performance, but it did strike me >> as strange. > > Coming in late here, but it does seem very odd. > I think there there are two things to consider - aset.c and glibc. In AllocSet this is handled as over-sized chunk, that is chunk greater than ALLOC_CHUNK_LIMIT (which ends up being 8kB). Which means it's allocated as a separate block, and does not go through the freelists. So the size does not follow the usual "power of 2" pattern and is only maxalign-ed, and then freed immediately on pfree. So for AllocSet we're fine, and this should not result in any memory inefficiency, assuming there are not many such requests (which would result in many malloc/free calls, which may be somewhat expensive). At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator with similar ideas (freelists, ...) so hopefully it's fine too. And then there are the systems without glibc, or with other libc implementations. No idea about those. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 24, 2018 at 06:25:01PM +0100, Tomas Vondra wrote: > > > > I think there there are two things to consider - aset.c and glibc. > > In AllocSet this is handled as over-sized chunk, that is chunk greater > than ALLOC_CHUNK_LIMIT (which ends up being 8kB). Which means it's > allocated as a separate block, and does not go through the freelists. So > the size does not follow the usual "power of 2" pattern and is only > maxalign-ed, and then freed immediately on pfree. > > So for AllocSet we're fine, and this should not result in any memory > inefficiency, assuming there are not many such requests (which would > result in many malloc/free calls, which may be somewhat expensive). > > At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator > with similar ideas (freelists, ...) so hopefully it's fine too. > > And then there are the systems without glibc, or with other libc > implementations. No idea about those. Thanks for the analysis. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator > with similar ideas (freelists, ...) so hopefully it's fine too. > > And then there are the systems without glibc, or with other libc > implementations. No idea about those. My guess is that a fairly common pattern for larger chunks will be to round the size up to a multiple of 4kB, the usual memory page size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-01-24 13:19:19 -0500, Robert Haas wrote: > On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator > > with similar ideas (freelists, ...) so hopefully it's fine too. > > > > And then there are the systems without glibc, or with other libc > > implementations. No idea about those. > > My guess is that a fairly common pattern for larger chunks will be to > round the size up to a multiple of 4kB, the usual memory page size. Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most platforms though. From man mallopt: Balancing these factors leads to a default setting of 128*1024 for the M_MMAP_THRESHOLD parameter. Additionally, even when malloc() chooses to use mmap() to back an allocation, it'll still needs a header to know the size of the allocation and such. So exactly using a size of a multiple of 4KB will still leave you with wasted space. Due to the latter I can't see it mattering whether or not we add +1 to a power-of-two size. There's malloc_usable_size() returning the actual size of an allocation. It'd probably be worthwhile to use that in a few of our own allocator routines. If not available on a platform we can just have a stub that returns the requested size... Greetings, Andres Freund
On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote: > Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most > platforms though. From man mallopt: > Balancing these factors leads to a default setting of 128*1024 for the M_MMAP_THRESHOLD parameter. > Additionally, even when malloc() chooses to use mmap() to back an > allocation, it'll still needs a header to know the size of the > allocation and such. So exactly using a size of a multiple of 4KB will > still leave you with wasted space. Due to the latter I can't see it > mattering whether or not we add +1 to a power-of-two size. Well, it depends on how it works. dsa_allocate, for example, never adds a header to the size of the allocation. Allocations < 8kB are bucketed by size class and stored in superblocks carved up into equal-sized chunks. Allocations > 8kB are rounded to a multiple of the 4kB page size and we grab that many consecutive free pages. I didn't make those behaviors up; I copied them from elsewhere. Some other allocator I read about did small-medium-large allocations: large with mmap(), medium with multiples of the page size, small with closely-spaced size classes. It doesn't seem like a particularly good idea to take a 64kB+1 byte allocation, stick a header on it, and pack it tightly up against other allocations on both sides. Seems like that could lead to fragmentation problems. Is that really what it does? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-01-24 14:25:37 -0500, Robert Haas wrote: > On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote: > > Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most > > platforms though. From man mallopt: > > Balancing these factors leads to a default setting of 128*1024 for the M_MMAP_THRESHOLD parameter. > > Additionally, even when malloc() chooses to use mmap() to back an > > allocation, it'll still needs a header to know the size of the > > allocation and such. So exactly using a size of a multiple of 4KB will > > still leave you with wasted space. Due to the latter I can't see it > > mattering whether or not we add +1 to a power-of-two size. > > Well, it depends on how it works. dsa_allocate, for example, never > adds a header to the size of the allocation. glibc's malloc does add a header. My half-informed suspicion is that most newer malloc backing allocators will have a header, because maintaining a shared lookup-by-address table is pretty expensive to maintain. A bit of metadata indicating size and/or source of the allocation makes using thread-local information a lot easier. > Allocations < 8kB are > bucketed by size class and stored in superblocks carved up into > equal-sized chunks. Allocations > 8kB are rounded to a multiple of > the 4kB page size and we grab that many consecutive free pages. I > didn't make those behaviors up; I copied them from elsewhere. Some > other allocator I read about did small-medium-large allocations: large > with mmap(), medium with multiples of the page size, small with > closely-spaced size classes. Sure - all I'm trying to say that it likely won't matter whether we use power-of-two or power-of-two + 1, because it seems likely that due to overhead considerations we'll likely not quite fit into a size class anyway. > It doesn't seem like a particularly good idea to take a 64kB+1 byte > allocation, stick a header on it, and pack it tightly up against other > allocations on both sides. Seems like that could lead to > fragmentation problems. Is that really what it does? No, I'm fairly sure it's not. Greetings, Andres Freund
Andres Freund wrote: > On 2018-01-24 14:25:37 -0500, Robert Haas wrote: > > On Wed, Jan 24, 2018 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote: > > > Indeed. Don't think RAW_BUF_SIZE is quite big enough for that on most > > > platforms though. From man mallopt: > > > Balancing these factors leads to a default setting of 128*1024 for the M_MMAP_THRESHOLD parameter. > > > Additionally, even when malloc() chooses to use mmap() to back an > > > allocation, it'll still needs a header to know the size of the > > > allocation and such. So exactly using a size of a multiple of 4KB will > > > still leave you with wasted space. Due to the latter I can't see it > > > mattering whether or not we add +1 to a power-of-two size. > > > > Well, it depends on how it works. dsa_allocate, for example, never > > adds a header to the size of the allocation. > > glibc's malloc does add a header. My half-informed suspicion is that > most newer malloc backing allocators will have a header, because > maintaining a shared lookup-by-address table is pretty expensive to > maintain. A bit of metadata indicating size and/or source of the > allocation makes using thread-local information a lot easier. Sounds like it'd be smart to allocate something closer to M_MMAP_THRESHOLD (which with typical values would be about double the amount of memory the current RAW_BUF_SIZE value), minus a few dozen bytes to allow for palloc's and malloc's respective headers. That way we bet for a mmap()ed allocation with minimum space wastage across all layers. Not sure whether we want to try to minimize wastage through clever use of malloc_usable_size() on each backend's first allocation -- that'd probably just be overengineering. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-01-24 17:07:01 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > glibc's malloc does add a header. My half-informed suspicion is that > > most newer malloc backing allocators will have a header, because > > maintaining a shared lookup-by-address table is pretty expensive to > > maintain. A bit of metadata indicating size and/or source of the > > allocation makes using thread-local information a lot easier. > > Sounds like it'd be smart to allocate something closer to > M_MMAP_THRESHOLD (which with typical values would be about double the > amount of memory the current RAW_BUF_SIZE value), minus a few dozen > bytes to allow for palloc's and malloc's respective headers. That way > we bet for a mmap()ed allocation with minimum space wastage across all > layers. In general there might be cases where that's worthwhile, although M_MMAP_THRESHOLD IIRC isn't a constant anymore, complicating things. The specific case this thread is discussing doesn't seem worthy of such attention though, there's afaict no actual practical problem here. > Not sure whether we want to try to minimize wastage through > clever use of malloc_usable_size() on each backend's first allocation -- > that'd probably just be overengineering. Likely also dangerous, I don't think this is a constant. Greetings, Andres Freund
On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator >> with similar ideas (freelists, ...) so hopefully it's fine too. >> >> And then there are the systems without glibc, or with other libc >> implementations. No idea about those. > > My guess is that a fairly common pattern for larger chunks will be to > round the size up to a multiple of 4kB, the usual memory page size. See also this discussion: https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=K11Q6o4edQLgkcg@mail.gmail.com TL;DR glibc doesn't actually round up like that below 128kB, but many others including FreeBSD, macOS etc round up to various page sizes or size classes including 8kB (!), 512 bytes. I find this a bit frustrating because it means that the most popular libc implementation doesn't have the problem so this kind of thing probably isn't a high priority, but probably on most other Unices (and I have no clue for Windows) including my current favourite we waste a bunch of memory. -- Thomas Munro http://www.enterprisedb.com
On Thu, Jan 25, 2018 at 09:30:54AM +1300, Thomas Munro wrote: > On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jan 24, 2018 at 12:25 PM, Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > >> At the glibc level ... I'm not so sure. AFAIK glibc uses an allocator > >> with similar ideas (freelists, ...) so hopefully it's fine too. > >> > >> And then there are the systems without glibc, or with other libc > >> implementations. No idea about those. > > > > My guess is that a fairly common pattern for larger chunks will be to > > round the size up to a multiple of 4kB, the usual memory page size. > > See also this discussion: > > https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=K11Q6o4edQLgkcg@mail.gmail.com > > TL;DR glibc doesn't actually round up like that below 128kB, but many > others including FreeBSD, macOS etc round up to various page sizes or > size classes including 8kB (!), 512 bytes. I find this a bit > frustrating because it means that the most popular libc implementation > doesn't have the problem so this kind of thing probably isn't a high > priority, but probably on most other Unices (and I have no clue for > Windows) including my current favourite we waste a bunch of memory. The BSD memory allocator used to allocate in powers of two, and keep the header in a separate location. They did this so they could combine two free, identically-sized memory blocks into a single one that was double the size. I have no idea how it works now. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Jan 25, 2018 at 9:35 AM, Bruce Momjian <bruce@momjian.us> wrote: > The BSD memory allocator used to allocate in powers of two, and keep the > header in a separate location. They did this so they could combine two > free, identically-sized memory blocks into a single one that was double > the size. I have no idea how it works now. According to "man malloc" on my FreeBSD 11.1 box (which uses jemalloc, I think NetBSD and OpenBSD use something else), the code Andrew showed at the top of this thread will waste ~16kB (because it'll round up to 80kB) and nodeHash.c will waste ~8kB for every ~32kB chunk of tuples as described in that other thread. +----------------------------------------------------------+ |Category Spacing Size | |Small lg [8] | | 16 [16, 32, 48, 64, 80, 96, 112, 128] | | 32 [160, 192, 224, 256] | | 64 [320, 384, 448, 512] | | 128 [640, 768, 896, 1024] | | 256 [1280, 1536, 1792, 2048] | | 512 [2560, 3072, 3584, 4096] | | 1 KiB [5 KiB, 6 KiB, 7 KiB, 8 KiB] | | 2 KiB [10 KiB, 12 KiB, 14 KiB] | |Large 2 KiB [16 KiB] | | 4 KiB [20 KiB, 24 KiB, 28 KiB, 32 KiB] | | 8 KiB [40 KiB, 48 KiB, 54 KiB, 64 KiB] | | 16 KiB [80 KiB, 96 KiB, 112 KiB, 128 KiB] | | 32 KiB [160 KiB, 192 KiB, 224 KiB, 256 KiB] | | 64 KiB [320 KiB, 384 KiB, 448 KiB, 512 KiB] | | 128 KiB [640 KiB, 768 KiB, 896 KiB, 1 MiB] | | 256 KiB [1280 KiB, 1536 KiB, 1792 KiB] | |Huge 256 KiB [2 MiB] | | 512 KiB [2560 KiB, 3 MiB, 3584 KiB, 4 MiB] | | 1 MiB [5 MiB, 6 MiB, 7 MiB, 8 MiB] | | 2 MiB [10 MiB, 12 MiB, 14 MiB, 16 MiB] | | 4 MiB [20 MiB, 24 MiB, 28 MiB, 32 MiB] | | 8 MiB [40 MiB, 48 MiB, 56 MiB, 64 MiB] | | ... ... | | 512 PiB [2560 PiB, 3 EiB, 3584 PiB, 4 EiB] | | 1 EiB [5 EiB, 6 EiB, 7 EiB] | +----------------------------------------------------------+ -- Thomas Munro http://www.enterprisedb.com
Bruce Momjian <bruce@momjian.us> writes: > On Thu, Jan 25, 2018 at 09:30:54AM +1300, Thomas Munro wrote: >> On Thu, Jan 25, 2018 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> My guess is that a fairly common pattern for larger chunks will be to >>> round the size up to a multiple of 4kB, the usual memory page size. >> >> See also this discussion: >> https://www.postgresql.org/message-id/flat/CAEepm%3D1bRyd%2B_W9eW-QmP1RGP03ti48zgd%3DK11Q6o4edQLgkcg%40mail.gmail.com#CAEepm=1bRyd+_W9eW-QmP1RGP03ti48zgd=K11Q6o4edQLgkcg@mail.gmail.com >> TL;DR glibc doesn't actually round up like that below 128kB, but many >> others including FreeBSD, macOS etc round up to various page sizes or >> size classes including 8kB (!), 512 bytes. I find this a bit >> frustrating because it means that the most popular libc implementation >> doesn't have the problem so this kind of thing probably isn't a high >> priority, but probably on most other Unices (and I have no clue for >> Windows) including my current favourite we waste a bunch of memory. > The BSD memory allocator used to allocate in powers of two, and keep the > header in a separate location. They did this so they could combine two > free, identically-sized memory blocks into a single one that was double > the size. I have no idea how it works now. It seems like there's fairly good reason to suppose that most versions of malloc are efficient for power-of-2 request sizes. Our big problem is the overhead that we add ourselves. That, however, we could compensate for by adjusting request sizes in cases where the request is flexible. I'd definitely support some sort of memory context API to allow such adjustment, as we speculated about in the thread Thomas points to above. regards, tom lane