Thread: POSIX question

POSIX question

From
Radosław Smogura
Date:
Hello,
I had some idea with hugepagse, and I read why PostgreSQL doesn't support POSIX (need of nattach). During read about
POSIX/SysVI found this (thread about dynamic chunking shared memory).
 
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00586.php
When playing with mmap I done some approach how to deal with growing files, so...
Maybe this approach could resolve both of above problems (POSIX and dynamic shared memory). Here is idea:
1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).2. init small SysV
chunkfor shmem header (to keep "fallout" requirements)3. SysV remap is Linux specific so unmap few 1st vm pages of step
1.and attach there (2.)3. a. Lock header when adding chunks (1st chunk is header) (we don't want concurrent chunk
allocation)4.allocate some other chunks of shared memory (POSIX is the best way) and put it in shmem header, put there
informationlike chunk id/name, is this POSIX or SysV, some useful flags (hugepage?) needed by reattaching, attach those
in1.4b. unlock 3a
 
Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of
this,and in future it may be possible using mmap / munmap to concat chunks / defrag it etc.
 
Mmap guarants that mmaping with mmap_fixed over already mmaped area will unmap old.
A working "preview" changeset applied for sysv_memory.c maybe quite small.
If someone will want to "extend" memory, he may add new chunk (ofcourse to keep header memory continuous number of
chunksis limited).
 
What do you think about this?
Regards,Radek


Re: POSIX question

From
Florian Pflug
Date:
On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).
> ...
> Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk
ofthis, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. 

I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a
wayto tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if
there'sa API for that... 

best regards,
Florian Pflug



Re: POSIX question

From
Radosław Smogura
Date:
Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58
> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> > 1. mmap some large amount of anonymous virtual memory (this will be
> > maximum size of shared memory). ...
> > Point 1. will no eat memory, as memory allocation is delayed and in 64bit
> > platforms you may reserve quite huge chunk of this, and in future it may
> > be possible using mmap / munmap to concat chunks / defrag it etc.
>
> I think this breaks with strict overcommit settings (i.e.
> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell
> the kernel (or glibc) to simply reserve a chunk of virtual address space
> for further user. Not sure if there's a API for that...
>
> best regards,
> Florian Pflug

This may be achived by many other things, like mmap /dev/null.

Regards,
Radek


Re: POSIX question

From
Florian Weimer
Date:
* Florian Pflug:

> I think this breaks with strict overcommit settings
> (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a
> way to tell the kernel (or glibc) to simply reserve a chunk of virtual
> address space for further user. Not sure if there's a API for that...

mmap with PROT_NONE and subsequent update with mprotect does this on
Linux.

(It's not clear to me what this is trying to solve, though.)

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: POSIX question

From
Florian Pflug
Date:
On Jun20, 2011, at 16:39 , Radosław Smogura wrote:
> Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58
>> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
>>> 1. mmap some large amount of anonymous virtual memory (this will be
>>> maximum size of shared memory). ...
>>> Point 1. will no eat memory, as memory allocation is delayed and in 64bit
>>> platforms you may reserve quite huge chunk of this, and in future it may
>>> be possible using mmap / munmap to concat chunks / defrag it etc.
>>
>> I think this breaks with strict overcommit settings (i.e.
>> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell
>> the kernel (or glibc) to simply reserve a chunk of virtual address space
>> for further user. Not sure if there's a API for that...
>>
>> best regards,
>> Florian Pflug
>
> This may be achived by many other things, like mmap /dev/null.

Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?

Or at least this is what I always thought glibc does when you malloc()
are large block at once. (This allows it to actually return the memory
to the kernel once you free() it, which isn't possible if the memory
was allocated simply by extending the heap).

You can work around this by mmap()ing an actual file, because then
the kernel knows it can use the file as backing store and thus doesn't
need to reserve actual physical memory. (In a way, this just adds
additional swap space). Doesn't seem very clean though...

Even if there's a way to work around a strict overcommit setting, unless
the workaround is a syscall *explicitly* designed for that purpose, I'd
be very careful with using it. You might just as well be exploiting a
bug in the overcommit accounting logic and future kernel versions may
simply choose to fix the bug...

best regards,
Florian Pflug



Re: POSIX question

From
Radosław Smogura
Date:
Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:01:40
> On Jun20, 2011, at 16:39 , Radosław Smogura wrote:
> > Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58
> >
> >> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> >>> 1. mmap some large amount of anonymous virtual memory (this will be
> >>> maximum size of shared memory). ...
> >>> Point 1. will no eat memory, as memory allocation is delayed and in
> >>> 64bit platforms you may reserve quite huge chunk of this, and in
> >>> future it may be possible using mmap / munmap to concat chunks /
> >>> defrag it etc.
> >>
> >> I think this breaks with strict overcommit settings (i.e.
> >> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to
> >> tell the kernel (or glibc) to simply reserve a chunk of virtual address
> >> space for further user. Not sure if there's a API for that...
> >>
> >> best regards,
> >> Florian Pflug
> >
> > This may be achived by many other things, like mmap /dev/null.
>
> Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
>
> Or at least this is what I always thought glibc does when you malloc()
> are large block at once. (This allows it to actually return the memory
> to the kernel once you free() it, which isn't possible if the memory
> was allocated simply by extending the heap).
>
> You can work around this by mmap()ing an actual file, because then
> the kernel knows it can use the file as backing store and thus doesn't
> need to reserve actual physical memory. (In a way, this just adds
> additional swap space). Doesn't seem very clean though...
>
> Even if there's a way to work around a strict overcommit setting, unless
> the workaround is a syscall *explicitly* designed for that purpose, I'd
> be very careful with using it. You might just as well be exploiting a
> bug in the overcommit accounting logic and future kernel versions may
> simply choose to fix the bug...
>
> best regards,
> Florian Pflug

I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
about 100GB of memory.

Regards,
Radek


Re: POSIX question

From
Florian Pflug
Date:
On Jun20, 2011, at 17:05 , Radosław Smogura wrote:
> I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
> about 100GB of memory.

You need to set vm.overcommit_memory to "2" to see the difference. Did
you do that?

You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory"
or by editing /etc/sysctl.conf and issuing "sysctl -p".

best regards,
Florian Pflug



Re: POSIX question

From
Andres Freund
Date:
On Monday, June 20, 2011 17:05:48 Radosław Smogura wrote:
> Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:01:40
>
> > On Jun20, 2011, at 16:39 , Radosław Smogura wrote:
> > > Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 16:16:58
> > >
> > >> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> > >>> 1. mmap some large amount of anonymous virtual memory (this will be
> > >>> maximum size of shared memory). ...
> > >>> Point 1. will no eat memory, as memory allocation is delayed and in
> > >>> 64bit platforms you may reserve quite huge chunk of this, and in
> > >>> future it may be possible using mmap / munmap to concat chunks /
> > >>> defrag it etc.
> > >>
> > >> I think this breaks with strict overcommit settings (i.e.
> > >> vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to
> > >> tell the kernel (or glibc) to simply reserve a chunk of virtual
> > >> address space for further user. Not sure if there's a API for that...
> > >>
> > >> best regards,
> > >> Florian Pflug
> > >
> > > This may be achived by many other things, like mmap /dev/null.
> >
> > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
> >
> > Or at least this is what I always thought glibc does when you malloc()
> > are large block at once. (This allows it to actually return the memory
> > to the kernel once you free() it, which isn't possible if the memory
> > was allocated simply by extending the heap).
> >
> > You can work around this by mmap()ing an actual file, because then
> > the kernel knows it can use the file as backing store and thus doesn't
> > need to reserve actual physical memory. (In a way, this just adds
> > additional swap space). Doesn't seem very clean though...
> >
> > Even if there's a way to work around a strict overcommit setting, unless
> > the workaround is a syscall *explicitly* designed for that purpose, I'd
> > be very careful with using it. You might just as well be exploiting a
> > bug in the overcommit accounting logic and future kernel versions may
> > simply choose to fix the bug...
> >
> > best regards,
> > Florian Pflug
>
> I'm sure at 99%. When I ware "playing" with mmap I preallocated, probably,
> about 100GB of memory.
The default setting is to allow overcommit.

Andres


Re: POSIX question

From
Greg Stark
Date:
On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp@phlo.org> wrote:
> Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
>
> Or at least this is what I always thought glibc does when you malloc()

It mmaps /dev/zero actually.


-- 
greg


Re: POSIX question

From
Roger Leigh
Date:
On Mon, Jun 20, 2011 at 04:16:58PM +0200, Florian Pflug wrote:
> On Jun20, 2011, at 15:27 , Radosław Smogura wrote:
> > 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory).
> > ...
> > Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge
chunkof this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. 
>
> I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need
away to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if
there'sa API for that... 

I run discless swapless cluster systems with zero overcommit (i.e.
it's entirely disabled), which means that all operations are
strict success/fail due to allocation being immediate.  mmap of a
large amount of anonymous memory would almost certainly fail on
such a setup--you definitely can't assume that a large anonymous
mmap will always succeed, since there is no delayed allocation.

[we do in reality have a small overcommit allowance to permit
efficient fork(2), but it's tiny and (in this context) irrelevant]

Regards,
Roger

--  .''`.  Roger Leigh: :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/`. `'   Printing on
GNU/Linux?      http://gutenprint.sourceforge.net/  `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail. 

Re: POSIX question

From
Radosław Smogura
Date:
Florian Pflug <fgp@phlo.org> Monday 20 of June 2011 17:07:55
> On Jun20, 2011, at 17:05 , Radosław Smogura wrote:
> > I'm sure at 99%. When I ware "playing" with mmap I preallocated,
> > probably, about 100GB of memory.
>
> You need to set vm.overcommit_memory to "2" to see the difference. Did
> you do that?
>
> You can do that either with "echo 2 > /proc/sys/vm/overcommit_memory"
> or by editing /etc/sysctl.conf and issuing "sysctl -p".
>
> best regards,
> Florian Pflug
I've just created 127TB mapping in Linux - maximum allowed by VM. Trying
overcommit with 0,1,2.

Regards,
Radek


Re: POSIX question

From
Andres Freund
Date:
On Monday, June 20, 2011 17:11:14 Greg Stark wrote:
> On Mon, Jun 20, 2011 at 4:01 PM, Florian Pflug <fgp@phlo.org> wrote:
> > Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory?
> > 
> > Or at least this is what I always thought glibc does when you malloc()
> 
> It mmaps /dev/zero actually.
As the nitpicking has already started: Afair its just passing -1 as fd and 
uses the MAP_ANONYMOUS flag argument ;)

Andres


Re: POSIX question

From
Markus Wanner
Date:
Radek,

On 06/20/2011 03:27 PM, Radosław Smogura wrote:
> When playing with mmap I done some approach how to deal with growing
> files, so...

Your approach seems to require a SysV alloc (for nattach) as well as
POSIX shmem and/or mmap.  Adding requirements for these syscalls
certainly needs to give a good benefit for Postgres, as they presumably
pose portability issues.

> 3. a. Lock header when adding chunks (1st chunk is header) (we don't
> want concurrent chunk allocation)

Sure we don't?  There are at least a dozen memory allocators for
multi-threaded applications, all trying to optimize for concurrency.
The programmer of a multi-threaded application doesn't need to care much
about concurrent allocations.  He can allocate (and free) quite a lot of
tiny chunks concurrently from shared memory.

Regards

Markus Wanner