Thread: POSIX question
Hello, I had some idea with hugepagse, and I read why PostgreSQL doesn't support POSIX (need of nattach). During read about POSIX/SysVI found this (thread about dynamic chunking shared memory). http://archives.postgresql.org/pgsql-hackers/2010-08/msg00586.php When playing with mmap I done some approach how to deal with growing files, so... Maybe this approach could resolve both of above problems (POSIX and dynamic shared memory). Here is idea: 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).2. init small SysV chunkfor shmem header (to keep "fallout" requirements)3. SysV remap is Linux specific so unmap few 1st vm pages of step 1.and attach there (2.)3. a. Lock header when adding chunks (1st chunk is header) (we don't want concurrent chunk allocation)4.allocate some other chunks of shared memory (POSIX is the best way) and put it in shmem header, put there informationlike chunk id/name, is this POSIX or SysV, some useful flags (hugepage?) needed by reattaching, attach those in1.4b. unlock 3a Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this,and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. Mmap guarants that mmaping with mmap_fixed over already mmaped area will unmap old. A working "preview" changeset applied for sysv_memory.c maybe quite small. If someone will want to "extend" memory, he may add new chunk (ofcourse to keep header memory continuous number of chunksis limited). What do you think about this? Regards,Radek
On Jun20, 2011, at 15:27 , Radosław Smogura wrote: > 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory). > ... > Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk ofthis, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a wayto tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there'sa API for that... best regards, Florian Pflug
Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58 > On Jun20, 2011, at 15:27 , Radosław Smogura wrote: > > 1. mmap some large amount of anonymous virtual memory (this will be > > maximum size of shared memory). ... > > Point 1. will no eat memory, as memory allocation is delayed and in 64bit > > platforms you may reserve quite huge chunk of this, and in future it may > > be possible using mmap / munmap to concat chunks / defrag it etc. > > I think this breaks with strict overcommit settings (i.e. > vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell > the kernel (or glibc) to simply reserve a chunk of virtual address space > for further user. Not sure if there's a API for that... > > best regards, > Florian Pflug This may be achived by many other things, like mmap /dev/null. Regards, Radek
* Florian Pflug: > I think this breaks with strict overcommit settings > (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a > way to tell the kernel (or glibc) to simply reserve a chunk of virtual > address space for further user. Not sure if there's a API for that... mmap with PROT_NONE and subsequent update with mprotect does this on Linux. (It's not clear to me what this is trying to solve, though.) -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Jun20, 2011, at 16:39 , Radosław Smogura wrote: > Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58 >> On Jun20, 2011, at 15:27 , Radosław Smogura wrote: >>> 1. mmap some large amount of anonymous virtual memory (this will be >>> maximum size of shared memory). ... >>> Point 1. will no eat memory, as memory allocation is delayed and in 64bit >>> platforms you may reserve quite huge chunk of this, and in future it may >>> be possible using mmap / munmap to concat chunks / defrag it etc. >> >> I think this breaks with strict overcommit settings (i.e. >> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell >> the kernel (or glibc) to simply reserve a chunk of virtual address space >> for further user. Not sure if there's a API for that... >> >> best regards, >> Florian Pflug > > This may be achived by many other things, like mmap /dev/null. Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory? Or at least this is what I always thought glibc does when you malloc() are large block at once. (This allows it to actually return the memory to the kernel once you free() it, which isn't possible if the memory was allocated simply by extending the heap). You can work around this by mmap()ing an actual file, because then the kernel knows it can use the file as backing store and thus doesn't need to reserve actual physical memory. (In a way, this just adds additional swap space). Doesn't seem very clean though... Even if there's a way to work around a strict overcommit setting, unless the workaround is a syscall *explicitly* designed for that purpose, I'd be very careful with using it. You might just as well be exploiting a bug in the overcommit accounting logic and future kernel versions may simply choose to fix the bug... best regards, Florian Pflug
Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:01:40 > On Jun20, 2011, at 16:39 , Radosław Smogura wrote: > > Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58 > > > >> On Jun20, 2011, at 15:27 , Radosław Smogura wrote: > >>> 1. mmap some large amount of anonymous virtual memory (this will be > >>> maximum size of shared memory). ... > >>> Point 1. will no eat memory, as memory allocation is delayed and in > >>> 64bit platforms you may reserve quite huge chunk of this, and in > >>> future it may be possible using mmap / munmap to concat chunks / > >>> defrag it etc. > >> > >> I think this breaks with strict overcommit settings (i.e. > >> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to > >> tell the kernel (or glibc) to simply reserve a chunk of virtual address > >> space for further user. Not sure if there's a API for that... > >> > >> best regards, > >> Florian Pflug > > > > This may be achived by many other things, like mmap /dev/null. > > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory? > > Or at least this is what I always thought glibc does when you malloc() > are large block at once. (This allows it to actually return the memory > to the kernel once you free() it, which isn't possible if the memory > was allocated simply by extending the heap). > > You can work around this by mmap()ing an actual file, because then > the kernel knows it can use the file as backing store and thus doesn't > need to reserve actual physical memory. (In a way, this just adds > additional swap space). Doesn't seem very clean though... > > Even if there's a way to work around a strict overcommit setting, unless > the workaround is a syscall *explicitly* designed for that purpose, I'd > be very careful with using it. You might just as well be exploiting a > bug in the overcommit accounting logic and future kernel versions may > simply choose to fix the bug... > > best regards, > Florian Pflug I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably, about 100GB of memory. Regards, Radek
On Jun20, 2011, at 17:05 , Radosław Smogura wrote: > I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably, > about 100GB of memory. You need to set vm.overcommit_memory to "2" to see the difference. Did you do that? You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory" or by editing /etc/sysctl.conf and issuing "sysctl -p". best regards, Florian Pflug
On Monday, June 20, 2011 17:05:48 Radosław Smogura wrote: > Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:01:40 > > > On Jun20, 2011, at 16:39 , Radosław Smogura wrote: > > > Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58 > > > > > >> On Jun20, 2011, at 15:27 , Radosław Smogura wrote: > > >>> 1. mmap some large amount of anonymous virtual memory (this will be > > >>> maximum size of shared memory). ... > > >>> Point 1. will no eat memory, as memory allocation is delayed and in > > >>> 64bit platforms you may reserve quite huge chunk of this, and in > > >>> future it may be possible using mmap / munmap to concat chunks / > > >>> defrag it etc. > > >> > > >> I think this breaks with strict overcommit settings (i.e. > > >> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to > > >> tell the kernel (or glibc) to simply reserve a chunk of virtual > > >> address space for further user. Not sure if there's a API for that... > > >> > > >> best regards, > > >> Florian Pflug > > > > > > This may be achived by many other things, like mmap /dev/null. > > > > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory? > > > > Or at least this is what I always thought glibc does when you malloc() > > are large block at once. (This allows it to actually return the memory > > to the kernel once you free() it, which isn't possible if the memory > > was allocated simply by extending the heap). > > > > You can work around this by mmap()ing an actual file, because then > > the kernel knows it can use the file as backing store and thus doesn't > > need to reserve actual physical memory. (In a way, this just adds > > additional swap space). Doesn't seem very clean though... > > > > Even if there's a way to work around a strict overcommit setting, unless > > the workaround is a syscall *explicitly* designed for that purpose, I'd > > be very careful with using it. You might just as well be exploiting a > > bug in the overcommit accounting logic and future kernel versions may > > simply choose to fix the bug... > > > > best regards, > > Florian Pflug > > I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably, > about 100GB of memory. The default setting is to allow overcommit. Andres
On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp@phlo.org> wrote: > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory? > > Or at least this is what I always thought glibc does when you malloc() It mmaps /dev/zero actually. -- greg
On Mon, Jun 20, 2011 at 04:16:58PM +0200, Florian Pflug wrote: > On Jun20, 2011, at 15:27 , Radosław Smogura wrote: > > 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory). > > ... > > Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunkof this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. > > I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need away to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there'sa API for that... I run discless swapless cluster systems with zero overcommit (i.e. it's entirely disabled), which means that all operations are strict success/fail due to allocation being immediate. mmap of a large amount of anonymous memory would almost certainly fail on such a setup--you definitely can't assume that a large anonymous mmap will always succeed, since there is no delayed allocation. [we do in reality have a small overcommit allowance to permit efficient fork(2), but it's tiny and (in this context) irrelevant] Regards, Roger -- .''`. Roger Leigh: :' : Debian GNU/Linux http://people.debian.org/~rleigh/`. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail.
Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:07:55 > On Jun20, 2011, at 17:05 , Radosław Smogura wrote: > > I'm sure at 99%. When I ware "playing" with mmap I preallocated, > > probably, about 100GB of memory. > > You need to set vm.overcommit_memory to "2" to see the difference. Did > you do that? > > You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory" > or by editing /etc/sysctl.conf and issuing "sysctl -p". > > best regards, > Florian Pflug I've just created 127TB mapping in Linux - maximum allowed by VM. Trying overcommit with 0,1,2. Regards, Radek
On Monday, June 20, 2011 17:11:14 Greg Stark wrote: > On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp@phlo.org> wrote: > > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory? > > > > Or at least this is what I always thought glibc does when you malloc() > > It mmaps /dev/zero actually. As the nitpicking has already started: Afair its just passing -1 as fd and uses the MAP_ANONYMOUS flag argument ;) Andres
Radek, On 06/20/2011 03:27 PM, Radosław Smogura wrote: > When playing with mmap I done some approach how to deal with growing > files, so... Your approach seems to require a SysV alloc (for nattach) as well as POSIX shmem and/or mmap. Adding requirements for these syscalls certainly needs to give a good benefit for Postgres, as they presumably pose portability issues. > 3. a. Lock header when adding chunks (1st chunk is header) (we don't > want concurrent chunk allocation) Sure we don't? There are at least a dozen memory allocators for multi-threaded applications, all trying to optimize for concurrency. The programmer of a multi-threaded application doesn't need to care much about concurrent allocations. He can allocate (and free) quite a lot of tiny chunks concurrently from shared memory. Regards Markus Wanner