Thread: [PATCHES] A patch for xlog.c

[PATCHES] A patch for xlog.c

From
Bruce Momjian
Date:
[ Send to hackers]

> I'd be willing to consider using mmap as a compile-time option if it
> can be shown to be a substantial performance win where it's available.
> (I suspect that's a very big "if".)  If it's not a substantial win,
> I don't think we should accept the change --- the portability risks and
> testing/maintenance costs loom too large for me.
> 

I was considering it because you can use a much larger amount of shared
memory without reconfiguring the kernel.

> BTW, how exactly is mmap a substitute for SysV shared memory?  AFAICT
> it's only defined to map a disk file into your address space, not to
> allow a shared memory region to be set up that's independent of any
> disk file.

It allows no backing store on disk.  It is the BSD solution to SysV
share memory.  Here are all the BSDi flags:
    MAP_ANON    Map anonymous memory not associated with any specific file.                The file descriptor used for
creatingMAP_ANON must be -1.                The offset parameter is ignored.
 
    MAP_FIXED   Do not permit the system to select a different address than                the one specified.  If the
specifiedaddress cannot be used,                mmap will fail.  If MAP_FIXED is specified, addr must be a
 multiple of the pagesize.  Use of this option is discouraged.
 
    MAP_PRIVATE                Modifications are private.
    MAP_SHARED  Modifications are shared.

We would use MAP_ANON|MAP_SHARED I guess.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [PATCHES] A patch for xlog.c

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> It allows no backing store on disk.  It is the BSD solution to SysV
> share memory.  Here are all the BSDi flags:

>      MAP_ANON    Map anonymous memory not associated with any specific file.
>                  The file descriptor used for creating MAP_ANON must be -1.
>                  The offset parameter is ignored.

Hmm.  Now that I read down to the "nonstandard extensions" part of the
HPUX man page for mmap(), I find
    If MAP_ANONYMOUS is set in flags:
         o    A new memory region is created and initialized to all zeros.              This memory region can be
sharedonly with descendants of              the current process.
 

While I've said before that I don't think it's really necessary for
processes that aren't children of the postmaster to access the shared
memory, I'm not sure that I want to go over to a mechanism that makes it
*impossible* for that to be done.  Especially not if the only motivation
is to avoid having to configure the kernel's shared memory settings.

Besides, what makes you think there's not a limit on the size of shmem
allocatable via mmap()?
        regards, tom lane


Re: [PATCHES] A patch for xlog.c

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I have had this item on the TODO list for a while:
>     * Use mmap() rather than SYSV shared memory(?)
> Should I remove it?

It's fine as long as it's got that question mark on it ;-).
I don't say we *shouldn't* do this, I'm just raising questions
that would need to be answered.
        regards, tom lane


Re: [PATCHES] A patch for xlog.c

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > It allows no backing store on disk.  It is the BSD solution to SysV
> > share memory.  Here are all the BSDi flags:
> 
> >      MAP_ANON    Map anonymous memory not associated with any specific file.
> >                  The file descriptor used for creating MAP_ANON must be -1.
> >                  The offset parameter is ignored.
> 
> Hmm.  Now that I read down to the "nonstandard extensions" part of the
> HPUX man page for mmap(), I find
> 
>      If MAP_ANONYMOUS is set in flags:
> 
>           o    A new memory region is created and initialized to all zeros.
>                This memory region can be shared only with descendants of
>                the current process.
> 
> While I've said before that I don't think it's really necessary for
> processes that aren't children of the postmaster to access the shared
> memory, I'm not sure that I want to go over to a mechanism that makes it
> *impossible* for that to be done.  Especially not if the only motivation
> is to avoid having to configure the kernel's shared memory settings.

Agreed.  It would make it impossible and a possible limitation.

> Besides, what makes you think there's not a limit on the size of shmem
> allocatable via mmap()?

I figured mmap() was different than SysV becuase mmap() is file based.

I have had this item on the TODO list for a while:
* Use mmap() rather than SYSV shared memory(?)

Should I remove it?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [PATCHES] A patch for xlog.c

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I have had this item on the TODO list for a while:
> >     * Use mmap() rather than SYSV shared memory(?)
> > Should I remove it?
> 
> It's fine as long as it's got that question mark on it ;-).
> I don't say we *shouldn't* do this, I'm just raising questions
> that would need to be answered.

Yea, it is one of those question mark things.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: [PATCHES] A patch for xlog.c

From
ncm@zembu.com (Nathan Myers)
Date:
On Sun, Feb 25, 2001 at 11:28:46PM -0500, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > It allows no backing store on disk.  

I.e. it allows you to map memory without an associated inode; the memory
may still be swapped.  Of course, there is no problem with mapping an 
inode too, so that unrelated processes can join in.  Solarix has a flag
to pin the shared pages in RAM so they can't be swapped out.

> > It is the BSD solution to SysV
> > share memory.  Here are all the BSDi flags:
> 
> >      MAP_ANON    Map anonymous memory not associated with any specific
> >                  file.  The file descriptor used for creating MAP_ANON
> >                  must be -1.  The offset parameter is ignored.
> 
> Hmm.  Now that I read down to the "nonstandard extensions" part of the
> HPUX man page for mmap(), I find
> 
>      If MAP_ANONYMOUS is set in flags:
> 
>           o    A new memory region is created and initialized to all zeros.
>                This memory region can be shared only with descendants of
>                the current process.

This is supported on Linux and BSD, but not on Solarix 7.  It's not 
necessary; you can just map /dev/zero on SysV systems that don't 
have MAP_ANON.

> While I've said before that I don't think it's really necessary for
> processes that aren't children of the postmaster to access the shared
> memory, I'm not sure that I want to go over to a mechanism that makes it
> *impossible* for that to be done.  Especially not if the only motivation
> is to avoid having to configure the kernel's shared memory settings.

There are enormous advantages to avoiding the need to configure kernel 
settings.  It makes PG a better citizen.  PG is much easier to drop in 
and use if you don't need attention from the IT department.

But I don't know of any reason to avoid mapping an actual inode,
so using mmap doesn't necessarily mean giving up sharing among
unrelated processes.

> Besides, what makes you think there's not a limit on the size of shmem
> allocatable via mmap()?

I've never seen any mmap limit documented.  Since mmap() is how 
everybody implements shared libraries, such a limit would be equivalent 
to a limit on how much/many shared libraries are used.  mmap() with 
MAP_ANONYMOUS (or its SysV /dev/zero equivalent) is a common, modern 
way to get raw storage for malloc(), so such a limit would be a limit
on malloc() too.

The mmap architecture comes to us from the Mach microkernel memory
manager, backported into BSD and then copied widely.  Since it was
the fundamental mechanism for all memory operations in Mach, arbitrary
limits would make no sense.  That it worked so well is the reason it 
was copied everywhere else, so adding arbitrary limits while copying 
it would be silly.  I don't think we'll see any systems like that.

Nathan Myers
ncm@zembu.com


Re: Re: [PATCHES] A patch for xlog.c

From
The Hermit Hacker
Date:
On Mon, 26 Feb 2001, Nathan Myers wrote:

> > While I've said before that I don't think it's really necessary for
> > processes that aren't children of the postmaster to access the shared
> > memory, I'm not sure that I want to go over to a mechanism that makes it
> > *impossible* for that to be done.  Especially not if the only motivation
> > is to avoid having to configure the kernel's shared memory settings.
>
> There are enormous advantages to avoiding the need to configure kernel
> settings.  It makes PG a better citizen.  PG is much easier to drop in
> and use if you don't need attention from the IT department.

Is there a reason why Oracle still uses shared memory and hasn't moved to
mmap()?  Are there advantages to it that we aren't seeing, or is oracle
just too much of a mahemouth for that sort of overhaul?  Don't go with the
quick answer either ...

> > Besides, what makes you think there's not a limit on the size of shmem
> > allocatable via mmap()?
>
> I've never seen any mmap limit documented.  Since mmap() is how
> everybody implements shared libraries, such a limit would be equivalent
> to a limit on how much/many shared libraries are used.

There are/will be limits based on how an admin sets his/her per user
datasize limits on their OS ...




Re: Re: [PATCHES] A patch for xlog.c

From
Bruce Momjian
Date:
> On Sun, Feb 25, 2001 at 11:28:46PM -0500, Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > It allows no backing store on disk.  
> 
> I.e. it allows you to map memory without an associated inode; the memory
> may still be swapped.  Of course, there is no problem with mapping an 
> inode too, so that unrelated processes can join in.  Solarix has a flag
> to pin the shared pages in RAM so they can't be swapped out.

We don't want to generate i/o to disk just for shared memory
modifications, that is why we can't use a disk file.

> 
> > > It is the BSD solution to SysV
> > > share memory.  Here are all the BSDi flags:
> > 
> > >      MAP_ANON    Map anonymous memory not associated with any specific
> > >                  file.  The file descriptor used for creating MAP_ANON
> > >                  must be -1.  The offset parameter is ignored.
> > 
> > Hmm.  Now that I read down to the "nonstandard extensions" part of the
> > HPUX man page for mmap(), I find
> > 
> >      If MAP_ANONYMOUS is set in flags:
> > 
> >           o    A new memory region is created and initialized to all zeros.
> >                This memory region can be shared only with descendants of
> >                the current process.
> 
> This is supported on Linux and BSD, but not on Solarix 7.  It's not 
> necessary; you can just map /dev/zero on SysV systems that don't 
> have MAP_ANON.

Oh, really.  Yes, I have seen people do that.

> > While I've said before that I don't think it's really necessary for
> > processes that aren't children of the postmaster to access the shared
> > memory, I'm not sure that I want to go over to a mechanism that makes it
> > *impossible* for that to be done.  Especially not if the only motivation
> > is to avoid having to configure the kernel's shared memory settings.
> 
> There are enormous advantages to avoiding the need to configure kernel 
> settings.  It makes PG a better citizen.  PG is much easier to drop in 
> and use if you don't need attention from the IT department.

One big advantage is that mmap() removes itself when all processes using
it exit, while SysV stays around and has to be cleaned up manually in
some cases.

> But I don't know of any reason to avoid mapping an actual inode,
> so using mmap doesn't necessarily mean giving up sharing among
> unrelated processes.

See above.

> 
> > Besides, what makes you think there's not a limit on the size of shmem
> > allocatable via mmap()?
> 
> I've never seen any mmap limit documented.  Since mmap() is how 
> everybody implements shared libraries, such a limit would be equivalent 
> to a limit on how much/many shared libraries are used.  mmap() with 
> MAP_ANONYMOUS (or its SysV /dev/zero equivalent) is a common, modern 
> way to get raw storage for malloc(), so such a limit would be a limit
> on malloc() too.
> 
> The mmap architecture comes to us from the Mach microkernel memory
> manager, backported into BSD and then copied widely.  Since it was
> the fundamental mechanism for all memory operations in Mach, arbitrary
> limits would make no sense.  That it worked so well is the reason it 
> was copied everywhere else, so adding arbitrary limits while copying 
> it would be silly.  I don't think we'll see any systems like that.

This is encouraging.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: [PATCHES] A patch for xlog.c

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> This is supported on Linux and BSD, but not on Solarix 7.  It's not 
> necessary; you can just map /dev/zero on SysV systems that don't 
> have MAP_ANON.

HPUX says:
    The mmap() function is supported for regular files.  Support for any    other type of file is unspecified.

> But I don't know of any reason to avoid mapping an actual inode,

How about wasted I/O due to the kernel thinking it needs to reflect
writes to the memory region back out to the underlying file?

> Since mmap() is how everybody implements shared libraries,

Now *there's* a sweeping generalization.  Documentation of this
claim, please?

> The mmap architecture comes to us from the Mach microkernel memory
> manager, backported into BSD and then copied widely.

If everyone copied the Mach implementation, why is it they don't even
agree on the spellings of the user-visible flags?


This looks a lot like exchanging the devil we know (SysV shmem) for a
devil we don't know.  Do I need to remind you about, for example, the
mmap bugs in early Linux releases?  (I still vividly remember having to
abandon mmap on a project a few years back that needed to be portable
to Linux.  Perhaps that colors my opinions here.)  I don't think the
problems with shmem are sufficiently large to justify venturing into
a whole new terra incognita of portability issues and kernel bugs.
        regards, tom lane


Re: Re: [PATCHES] A patch for xlog.c

From
Ian Lance Taylor
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > Since mmap() is how everybody implements shared libraries,
> 
> Now *there's* a sweeping generalization.  Documentation of this
> claim, please?

I've seen a lot of shared library implementations (I used to be the
GNU binutils maintainer), and Nathan is approximately correct.  Most
ELF systems use a dynamic linker inherited from the original SVR4
implementation, which uses mmap.  You can see this by running strace
on an SVR4 system.  The *BSD and GNU dynamic linker implementations
are of course independently derived, but they use mmap too.

mmap is the natural way to implement ELF style shared libraries.  The
basic operation you have to do is to map the shared library into the
process memory space, and then to process a few relocations.  Mapping
the shared library in can be done either using mmap, or using
open/read/close.  For a large file, mmap is going to be much faster
than open/read/close, because it doesn't require actually reading the
file.

There are, of course, many non-ELF shared libraries implementations.
SVR3 does not use mmap.  SunOS does use mmap (SunOS shared libraries
were taken into SVR4 and the ELF standard).  I don't know offhand
about AIX, Digital Unix, or Windows.

mmap is standardized by the most recent version of POSIX.1.

Ian


Re[2]: Re: [PATCHES] A patch for xlog.c

From
jamexu
Date:
Hello Tom,

Tuesday, February 27, 2001, 12:23:25 AM, you wrote:

TL> This looks a lot like exchanging the devil we know (SysV shmem) for a
TL> devil we don't know.  Do I need to remind you about, for example, the
TL> mmap bugs in early Linux releases?  (I still vividly remember having to
TL> abandon mmap on a project a few years back that needed to be portable
TL> to Linux.  Perhaps that colors my opinions here.)  I don't think the
TL> problems with shmem are sufficiently large to justify venturing into
TL> a whole new terra incognita of portability issues and kernel bugs.

TL>                         regards, tom lane

the only problem is because if we need to tune Postermaster to use
large buffer while system havn't so many SYSV shared memory, in many
systemes, we need to recompile OS kernel, this is a small problem to install
PGSQL to product environment.

-- 
Best regards,
XuYifeng




Re[2]: Re: [PATCHES] A patch for xlog.c

From
The Hermit Hacker
Date:
On Tue, 27 Feb 2001, jamexu wrote:

> Hello Tom,
>
> Tuesday, February 27, 2001, 12:23:25 AM, you wrote:
>
> TL> This looks a lot like exchanging the devil we know (SysV shmem) for a
> TL> devil we don't know.  Do I need to remind you about, for example, the
> TL> mmap bugs in early Linux releases?  (I still vividly remember having to
> TL> abandon mmap on a project a few years back that needed to be portable
> TL> to Linux.  Perhaps that colors my opinions here.)  I don't think the
> TL> problems with shmem are sufficiently large to justify venturing into
> TL> a whole new terra incognita of portability issues and kernel bugs.
>
> TL>                         regards, tom lane
>
> the only problem is because if we need to tune Postermaster to use
> large buffer while system havn't so many SYSV shared memory, in many
> systemes, we need to recompile OS kernel, this is a small problem to install
> PGSQL to product environment.

What?  You don't automatically recompile your OS kernel when you build a
system in the first place??  First step on any OS install of FreeBSD is to
rid myself of the 'extras' that are in the generic kernel, and enable
SharedMemory (even if I'm not using PgSQL on that machine) ...




Re: Re[2]: Re: [PATCHES] A patch for xlog.c

From
Tom Lane
Date:
>> the only problem is because if we need to tune Postermaster to use
>> large buffer while system havn't so many SYSV shared memory, in many
>> systemes, we need to recompile OS kernel, this is a small problem to install
>> PGSQL to product environment.

Of course, if you haven't got mmap(), a recompile won't help ...

I'd be somewhat more enthusiastic about mmap if I thought we could
abandon the SysV shmem support completely, but I don't foresee that
happening for a long while yet.
        regards, tom lane


Re[3]: Re: [PATCHES] A patch for xlog.c

From
jamexu
Date:
Hello The,

Tuesday, February 27, 2001, 11:00:05 AM, you wrote:

THH> On Tue, 27 Feb 2001, jamexu wrote:

>> Hello Tom,
>>
>> Tuesday, February 27, 2001, 12:23:25 AM, you wrote:
>>
>> TL> This looks a lot like exchanging the devil we know (SysV shmem) for a
>> TL> devil we don't know.  Do I need to remind you about, for example, the
>> TL> mmap bugs in early Linux releases?  (I still vividly remember having to
>> TL> abandon mmap on a project a few years back that needed to be portable
>> TL> to Linux.  Perhaps that colors my opinions here.)  I don't think the
>> TL> problems with shmem are sufficiently large to justify venturing into
>> TL> a whole new terra incognita of portability issues and kernel bugs.
>>
>> TL>                         regards, tom lane
>>
>> the only problem is because if we need to tune Postermaster to use
>> large buffer while system havn't so many SYSV shared memory, in many
>> systemes, we need to recompile OS kernel, this is a small problem to install
>> PGSQL to product environment.

THH> What?  You don't automatically recompile your OS kernel when you build a
THH> system in the first place??  First step on any OS install of FreeBSD is to
THH> rid myself of the 'extras' that are in the generic kernel, and enable
THH> SharedMemory (even if I'm not using PgSQL on that machine) ...

heihei, why do you think users always using FreeBSD and not other
UNIX systemes?
your assume is false.

---
Xu Yifeng




Re[3]: Re: [PATCHES] A patch for xlog.c

From
The Hermit Hacker
Date:
On Tue, 27 Feb 2001, jamexu wrote:

> Hello The,
>
> Tuesday, February 27, 2001, 11:00:05 AM, you wrote:
>
> THH> On Tue, 27 Feb 2001, jamexu wrote:
>
> >> Hello Tom,
> >>
> >> Tuesday, February 27, 2001, 12:23:25 AM, you wrote:
> >>
> >> TL> This looks a lot like exchanging the devil we know (SysV shmem) for a
> >> TL> devil we don't know.  Do I need to remind you about, for example, the
> >> TL> mmap bugs in early Linux releases?  (I still vividly remember having to
> >> TL> abandon mmap on a project a few years back that needed to be portable
> >> TL> to Linux.  Perhaps that colors my opinions here.)  I don't think the
> >> TL> problems with shmem are sufficiently large to justify venturing into
> >> TL> a whole new terra incognita of portability issues and kernel bugs.
> >>
> >> TL>                         regards, tom lane
> >>
> >> the only problem is because if we need to tune Postermaster to use
> >> large buffer while system havn't so many SYSV shared memory, in many
> >> systemes, we need to recompile OS kernel, this is a small problem to install
> >> PGSQL to product environment.
>
> THH> What?  You don't automatically recompile your OS kernel when you build a
> THH> system in the first place??  First step on any OS install of FreeBSD is to
> THH> rid myself of the 'extras' that are in the generic kernel, and enable
> THH> SharedMemory (even if I'm not using PgSQL on that machine) ...
>
> heihei, why do you think users always using FreeBSD and not other
> UNIX systemes?
> your assume is false.

I don't ... I personally admin FreeBSD and Solaris boxen ... FreeBSD,
first step is to always recompile the kernel after an install, to get rid
of crud and add Shared Memory ... the Solaris boxes, you add a couple of
lines to /etc/system and reboot, and you have Shared Memory ...

I don't know about other 'commercial OSs', but I'd be shocked if a Linux
admin never does any kernel config cleanup befor egoing production *shrug*





Re: Re[2]: Re: [PATCHES] A patch for xlog.c

From
Bruce Momjian
Date:
> > the only problem is because if we need to tune Postermaster to use
> > large buffer while system havn't so many SYSV shared memory, in many
> > systemes, we need to recompile OS kernel, this is a small problem to install
> > PGSQL to product environment.
> 
> What?  You don't automatically recompile your OS kernel when you build a
> system in the first place??  First step on any OS install of FreeBSD is to
> rid myself of the 'extras' that are in the generic kernel, and enable
> SharedMemory (even if I'm not using PgSQL on that machine) ...

He is saying the machine is already in production.  Suppose he has run
PostgreSQL for a few months, then needs to increase number of buffers. 
He can't exceed the kernel limit unless he recompiles.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [PATCHES] A patch for xlog.c

From
Thomas Lockhart
Date:
> I don't know about other 'commercial OSs', but I'd be shocked if a Linux
> admin never does any kernel config cleanup befor egoing production *shrug*

oops...
                    - Thomas


Re: Re[3]: Re: [PATCHES] A patch for xlog.c

From
Peter Eisentraut
Date:
The Hermit Hacker writes:

> I don't ... I personally admin FreeBSD and Solaris boxen ... FreeBSD,
> first step is to always recompile the kernel after an install, to get rid
> of crud and add Shared Memory ... the Solaris boxes, you add a couple of
> lines to /etc/system and reboot, and you have Shared Memory ...
>
> I don't know about other 'commercial OSs', but I'd be shocked if a Linux
> admin never does any kernel config cleanup befor egoing production *shrug*

Linux allows you to load and unload kernel modules, while the system is
running, to add and remove stuff as you need it.  But this is moot because
Linux also allows you to increase shared memory (up to the total
addressable memory)  while the system is running.  Recompiling Linux
kernels is a thing of the past with modern distributions.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: Re[3]: Re: [PATCHES] A patch for xlog.c

From
The Hermit Hacker
Date:
On Tue, 27 Feb 2001, Peter Eisentraut wrote:

> The Hermit Hacker writes:
>
> > I don't ... I personally admin FreeBSD and Solaris boxen ... FreeBSD,
> > first step is to always recompile the kernel after an install, to get rid
> > of crud and add Shared Memory ... the Solaris boxes, you add a couple of
> > lines to /etc/system and reboot, and you have Shared Memory ...
> >
> > I don't know about other 'commercial OSs', but I'd be shocked if a Linux
> > admin never does any kernel config cleanup befor egoing production *shrug*
>
> Linux allows you to load and unload kernel modules, while the system is
> running, to add and remove stuff as you need it.  But this is moot because
> Linux also allows you to increase shared memory (up to the total
> addressable memory)  while the system is running.  Recompiling Linux
> kernels is a thing of the past with modern distributions.

Actually, just found that out for FreeBSD too *sigh*  You do have to
enable SYSV* in the kernel itself, but increasing shared memory and
semaphores is a simple sysctl that can be run while the system is live ...