Thread: [PATCHES] A patch for xlog.c
[ Send to hackers] > I'd be willing to consider using mmap as a compile-time option if it > can be shown to be a substantial performance win where it's available. > (I suspect that's a very big "if".) If it's not a substantial win, > I don't think we should accept the change --- the portability risks and > testing/maintenance costs loom too large for me. > I was considering it because you can use a much larger amount of shared memory without reconfiguring the kernel. > BTW, how exactly is mmap a substitute for SysV shared memory? AFAICT > it's only defined to map a disk file into your address space, not to > allow a shared memory region to be set up that's independent of any > disk file. It allows no backing store on disk. It is the BSD solution to SysV share memory. Here are all the BSDi flags: MAP_ANON Map anonymous memory not associated with any specific file. The file descriptor used for creatingMAP_ANON must be -1. The offset parameter is ignored. MAP_FIXED Do not permit the system to select a different address than the one specified. If the specifiedaddress cannot be used, mmap will fail. If MAP_FIXED is specified, addr must be a multiple of the pagesize. Use of this option is discouraged. MAP_PRIVATE Modifications are private. MAP_SHARED Modifications are shared. We would use MAP_ANON|MAP_SHARED I guess. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > It allows no backing store on disk. It is the BSD solution to SysV > share memory. Here are all the BSDi flags: > MAP_ANON Map anonymous memory not associated with any specific file. > The file descriptor used for creating MAP_ANON must be -1. > The offset parameter is ignored. Hmm. Now that I read down to the "nonstandard extensions" part of the HPUX man page for mmap(), I find If MAP_ANONYMOUS is set in flags: o A new memory region is created and initialized to all zeros. This memory region can be sharedonly with descendants of the current process. While I've said before that I don't think it's really necessary for processes that aren't children of the postmaster to access the shared memory, I'm not sure that I want to go over to a mechanism that makes it *impossible* for that to be done. Especially not if the only motivation is to avoid having to configure the kernel's shared memory settings. Besides, what makes you think there's not a limit on the size of shmem allocatable via mmap()? regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I have had this item on the TODO list for a while: > * Use mmap() rather than SYSV shared memory(?) > Should I remove it? It's fine as long as it's got that question mark on it ;-). I don't say we *shouldn't* do this, I'm just raising questions that would need to be answered. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > It allows no backing store on disk. It is the BSD solution to SysV > > share memory. Here are all the BSDi flags: > > > MAP_ANON Map anonymous memory not associated with any specific file. > > The file descriptor used for creating MAP_ANON must be -1. > > The offset parameter is ignored. > > Hmm. Now that I read down to the "nonstandard extensions" part of the > HPUX man page for mmap(), I find > > If MAP_ANONYMOUS is set in flags: > > o A new memory region is created and initialized to all zeros. > This memory region can be shared only with descendants of > the current process. > > While I've said before that I don't think it's really necessary for > processes that aren't children of the postmaster to access the shared > memory, I'm not sure that I want to go over to a mechanism that makes it > *impossible* for that to be done. Especially not if the only motivation > is to avoid having to configure the kernel's shared memory settings. Agreed. It would make it impossible and a possible limitation. > Besides, what makes you think there's not a limit on the size of shmem > allocatable via mmap()? I figured mmap() was different than SysV becuase mmap() is file based. I have had this item on the TODO list for a while: * Use mmap() rather than SYSV shared memory(?) Should I remove it? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I have had this item on the TODO list for a while: > > * Use mmap() rather than SYSV shared memory(?) > > Should I remove it? > > It's fine as long as it's got that question mark on it ;-). > I don't say we *shouldn't* do this, I'm just raising questions > that would need to be answered. Yea, it is one of those question mark things. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Sun, Feb 25, 2001 at 11:28:46PM -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > It allows no backing store on disk. I.e. it allows you to map memory without an associated inode; the memory may still be swapped. Of course, there is no problem with mapping an inode too, so that unrelated processes can join in. Solarix has a flag to pin the shared pages in RAM so they can't be swapped out. > > It is the BSD solution to SysV > > share memory. Here are all the BSDi flags: > > > MAP_ANON Map anonymous memory not associated with any specific > > file. The file descriptor used for creating MAP_ANON > > must be -1. The offset parameter is ignored. > > Hmm. Now that I read down to the "nonstandard extensions" part of the > HPUX man page for mmap(), I find > > If MAP_ANONYMOUS is set in flags: > > o A new memory region is created and initialized to all zeros. > This memory region can be shared only with descendants of > the current process. This is supported on Linux and BSD, but not on Solarix 7. It's not necessary; you can just map /dev/zero on SysV systems that don't have MAP_ANON. > While I've said before that I don't think it's really necessary for > processes that aren't children of the postmaster to access the shared > memory, I'm not sure that I want to go over to a mechanism that makes it > *impossible* for that to be done. Especially not if the only motivation > is to avoid having to configure the kernel's shared memory settings. There are enormous advantages to avoiding the need to configure kernel settings. It makes PG a better citizen. PG is much easier to drop in and use if you don't need attention from the IT department. But I don't know of any reason to avoid mapping an actual inode, so using mmap doesn't necessarily mean giving up sharing among unrelated processes. > Besides, what makes you think there's not a limit on the size of shmem > allocatable via mmap()? I've never seen any mmap limit documented. Since mmap() is how everybody implements shared libraries, such a limit would be equivalent to a limit on how much/many shared libraries are used. mmap() with MAP_ANONYMOUS (or its SysV /dev/zero equivalent) is a common, modern way to get raw storage for malloc(), so such a limit would be a limit on malloc() too. The mmap architecture comes to us from the Mach microkernel memory manager, backported into BSD and then copied widely. Since it was the fundamental mechanism for all memory operations in Mach, arbitrary limits would make no sense. That it worked so well is the reason it was copied everywhere else, so adding arbitrary limits while copying it would be silly. I don't think we'll see any systems like that. Nathan Myers ncm@zembu.com
On Mon, 26 Feb 2001, Nathan Myers wrote: > > While I've said before that I don't think it's really necessary for > > processes that aren't children of the postmaster to access the shared > > memory, I'm not sure that I want to go over to a mechanism that makes it > > *impossible* for that to be done. Especially not if the only motivation > > is to avoid having to configure the kernel's shared memory settings. > > There are enormous advantages to avoiding the need to configure kernel > settings. It makes PG a better citizen. PG is much easier to drop in > and use if you don't need attention from the IT department. Is there a reason why Oracle still uses shared memory and hasn't moved to mmap()? Are there advantages to it that we aren't seeing, or is oracle just too much of a mahemouth for that sort of overhaul? Don't go with the quick answer either ... > > Besides, what makes you think there's not a limit on the size of shmem > > allocatable via mmap()? > > I've never seen any mmap limit documented. Since mmap() is how > everybody implements shared libraries, such a limit would be equivalent > to a limit on how much/many shared libraries are used. There are/will be limits based on how an admin sets his/her per user datasize limits on their OS ...
> On Sun, Feb 25, 2001 at 11:28:46PM -0500, Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > It allows no backing store on disk. > > I.e. it allows you to map memory without an associated inode; the memory > may still be swapped. Of course, there is no problem with mapping an > inode too, so that unrelated processes can join in. Solarix has a flag > to pin the shared pages in RAM so they can't be swapped out. We don't want to generate i/o to disk just for shared memory modifications, that is why we can't use a disk file. > > > > It is the BSD solution to SysV > > > share memory. Here are all the BSDi flags: > > > > > MAP_ANON Map anonymous memory not associated with any specific > > > file. The file descriptor used for creating MAP_ANON > > > must be -1. The offset parameter is ignored. > > > > Hmm. Now that I read down to the "nonstandard extensions" part of the > > HPUX man page for mmap(), I find > > > > If MAP_ANONYMOUS is set in flags: > > > > o A new memory region is created and initialized to all zeros. > > This memory region can be shared only with descendants of > > the current process. > > This is supported on Linux and BSD, but not on Solarix 7. It's not > necessary; you can just map /dev/zero on SysV systems that don't > have MAP_ANON. Oh, really. Yes, I have seen people do that. > > While I've said before that I don't think it's really necessary for > > processes that aren't children of the postmaster to access the shared > > memory, I'm not sure that I want to go over to a mechanism that makes it > > *impossible* for that to be done. Especially not if the only motivation > > is to avoid having to configure the kernel's shared memory settings. > > There are enormous advantages to avoiding the need to configure kernel > settings. It makes PG a better citizen. PG is much easier to drop in > and use if you don't need attention from the IT department. One big advantage is that mmap() removes itself when all processes using it exit, while SysV stays around and has to be cleaned up manually in some cases. > But I don't know of any reason to avoid mapping an actual inode, > so using mmap doesn't necessarily mean giving up sharing among > unrelated processes. See above. > > > Besides, what makes you think there's not a limit on the size of shmem > > allocatable via mmap()? > > I've never seen any mmap limit documented. Since mmap() is how > everybody implements shared libraries, such a limit would be equivalent > to a limit on how much/many shared libraries are used. mmap() with > MAP_ANONYMOUS (or its SysV /dev/zero equivalent) is a common, modern > way to get raw storage for malloc(), so such a limit would be a limit > on malloc() too. > > The mmap architecture comes to us from the Mach microkernel memory > manager, backported into BSD and then copied widely. Since it was > the fundamental mechanism for all memory operations in Mach, arbitrary > limits would make no sense. That it worked so well is the reason it > was copied everywhere else, so adding arbitrary limits while copying > it would be silly. I don't think we'll see any systems like that. This is encouraging. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
ncm@zembu.com (Nathan Myers) writes: > This is supported on Linux and BSD, but not on Solarix 7. It's not > necessary; you can just map /dev/zero on SysV systems that don't > have MAP_ANON. HPUX says: The mmap() function is supported for regular files. Support for any other type of file is unspecified. > But I don't know of any reason to avoid mapping an actual inode, How about wasted I/O due to the kernel thinking it needs to reflect writes to the memory region back out to the underlying file? > Since mmap() is how everybody implements shared libraries, Now *there's* a sweeping generalization. Documentation of this claim, please? > The mmap architecture comes to us from the Mach microkernel memory > manager, backported into BSD and then copied widely. If everyone copied the Mach implementation, why is it they don't even agree on the spellings of the user-visible flags? This looks a lot like exchanging the devil we know (SysV shmem) for a devil we don't know. Do I need to remind you about, for example, the mmap bugs in early Linux releases? (I still vividly remember having to abandon mmap on a project a few years back that needed to be portable to Linux. Perhaps that colors my opinions here.) I don't think the problems with shmem are sufficiently large to justify venturing into a whole new terra incognita of portability issues and kernel bugs. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > > Since mmap() is how everybody implements shared libraries, > > Now *there's* a sweeping generalization. Documentation of this > claim, please? I've seen a lot of shared library implementations (I used to be the GNU binutils maintainer), and Nathan is approximately correct. Most ELF systems use a dynamic linker inherited from the original SVR4 implementation, which uses mmap. You can see this by running strace on an SVR4 system. The *BSD and GNU dynamic linker implementations are of course independently derived, but they use mmap too. mmap is the natural way to implement ELF style shared libraries. The basic operation you have to do is to map the shared library into the process memory space, and then to process a few relocations. Mapping the shared library in can be done either using mmap, or using open/read/close. For a large file, mmap is going to be much faster than open/read/close, because it doesn't require actually reading the file. There are, of course, many non-ELF shared libraries implementations. SVR3 does not use mmap. SunOS does use mmap (SunOS shared libraries were taken into SVR4 and the ELF standard). I don't know offhand about AIX, Digital Unix, or Windows. mmap is standardized by the most recent version of POSIX.1. Ian
Hello Tom, Tuesday, February 27, 2001, 12:23:25 AM, you wrote: TL> This looks a lot like exchanging the devil we know (SysV shmem) for a TL> devil we don't know. Do I need to remind you about, for example, the TL> mmap bugs in early Linux releases? (I still vividly remember having to TL> abandon mmap on a project a few years back that needed to be portable TL> to Linux. Perhaps that colors my opinions here.) I don't think the TL> problems with shmem are sufficiently large to justify venturing into TL> a whole new terra incognita of portability issues and kernel bugs. TL> regards, tom lane the only problem is because if we need to tune Postermaster to use large buffer while system havn't so many SYSV shared memory, in many systemes, we need to recompile OS kernel, this is a small problem to install PGSQL to product environment. -- Best regards, XuYifeng
On Tue, 27 Feb 2001, jamexu wrote: > Hello Tom, > > Tuesday, February 27, 2001, 12:23:25 AM, you wrote: > > TL> This looks a lot like exchanging the devil we know (SysV shmem) for a > TL> devil we don't know. Do I need to remind you about, for example, the > TL> mmap bugs in early Linux releases? (I still vividly remember having to > TL> abandon mmap on a project a few years back that needed to be portable > TL> to Linux. Perhaps that colors my opinions here.) I don't think the > TL> problems with shmem are sufficiently large to justify venturing into > TL> a whole new terra incognita of portability issues and kernel bugs. > > TL> regards, tom lane > > the only problem is because if we need to tune Postermaster to use > large buffer while system havn't so many SYSV shared memory, in many > systemes, we need to recompile OS kernel, this is a small problem to install > PGSQL to product environment. What? You don't automatically recompile your OS kernel when you build a system in the first place?? First step on any OS install of FreeBSD is to rid myself of the 'extras' that are in the generic kernel, and enable SharedMemory (even if I'm not using PgSQL on that machine) ...
>> the only problem is because if we need to tune Postermaster to use >> large buffer while system havn't so many SYSV shared memory, in many >> systemes, we need to recompile OS kernel, this is a small problem to install >> PGSQL to product environment. Of course, if you haven't got mmap(), a recompile won't help ... I'd be somewhat more enthusiastic about mmap if I thought we could abandon the SysV shmem support completely, but I don't foresee that happening for a long while yet. regards, tom lane
Hello The, Tuesday, February 27, 2001, 11:00:05 AM, you wrote: THH> On Tue, 27 Feb 2001, jamexu wrote: >> Hello Tom, >> >> Tuesday, February 27, 2001, 12:23:25 AM, you wrote: >> >> TL> This looks a lot like exchanging the devil we know (SysV shmem) for a >> TL> devil we don't know. Do I need to remind you about, for example, the >> TL> mmap bugs in early Linux releases? (I still vividly remember having to >> TL> abandon mmap on a project a few years back that needed to be portable >> TL> to Linux. Perhaps that colors my opinions here.) I don't think the >> TL> problems with shmem are sufficiently large to justify venturing into >> TL> a whole new terra incognita of portability issues and kernel bugs. >> >> TL> regards, tom lane >> >> the only problem is because if we need to tune Postermaster to use >> large buffer while system havn't so many SYSV shared memory, in many >> systemes, we need to recompile OS kernel, this is a small problem to install >> PGSQL to product environment. THH> What? You don't automatically recompile your OS kernel when you build a THH> system in the first place?? First step on any OS install of FreeBSD is to THH> rid myself of the 'extras' that are in the generic kernel, and enable THH> SharedMemory (even if I'm not using PgSQL on that machine) ... heihei, why do you think users always using FreeBSD and not other UNIX systemes? your assume is false. --- Xu Yifeng
On Tue, 27 Feb 2001, jamexu wrote: > Hello The, > > Tuesday, February 27, 2001, 11:00:05 AM, you wrote: > > THH> On Tue, 27 Feb 2001, jamexu wrote: > > >> Hello Tom, > >> > >> Tuesday, February 27, 2001, 12:23:25 AM, you wrote: > >> > >> TL> This looks a lot like exchanging the devil we know (SysV shmem) for a > >> TL> devil we don't know. Do I need to remind you about, for example, the > >> TL> mmap bugs in early Linux releases? (I still vividly remember having to > >> TL> abandon mmap on a project a few years back that needed to be portable > >> TL> to Linux. Perhaps that colors my opinions here.) I don't think the > >> TL> problems with shmem are sufficiently large to justify venturing into > >> TL> a whole new terra incognita of portability issues and kernel bugs. > >> > >> TL> regards, tom lane > >> > >> the only problem is because if we need to tune Postermaster to use > >> large buffer while system havn't so many SYSV shared memory, in many > >> systemes, we need to recompile OS kernel, this is a small problem to install > >> PGSQL to product environment. > > THH> What? You don't automatically recompile your OS kernel when you build a > THH> system in the first place?? First step on any OS install of FreeBSD is to > THH> rid myself of the 'extras' that are in the generic kernel, and enable > THH> SharedMemory (even if I'm not using PgSQL on that machine) ... > > heihei, why do you think users always using FreeBSD and not other > UNIX systemes? > your assume is false. I don't ... I personally admin FreeBSD and Solaris boxen ... FreeBSD, first step is to always recompile the kernel after an install, to get rid of crud and add Shared Memory ... the Solaris boxes, you add a couple of lines to /etc/system and reboot, and you have Shared Memory ... I don't know about other 'commercial OSs', but I'd be shocked if a Linux admin never does any kernel config cleanup befor egoing production *shrug*
> > the only problem is because if we need to tune Postermaster to use > > large buffer while system havn't so many SYSV shared memory, in many > > systemes, we need to recompile OS kernel, this is a small problem to install > > PGSQL to product environment. > > What? You don't automatically recompile your OS kernel when you build a > system in the first place?? First step on any OS install of FreeBSD is to > rid myself of the 'extras' that are in the generic kernel, and enable > SharedMemory (even if I'm not using PgSQL on that machine) ... He is saying the machine is already in production. Suppose he has run PostgreSQL for a few months, then needs to increase number of buffers. He can't exceed the kernel limit unless he recompiles. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> I don't know about other 'commercial OSs', but I'd be shocked if a Linux > admin never does any kernel config cleanup befor egoing production *shrug* oops... - Thomas
The Hermit Hacker writes: > I don't ... I personally admin FreeBSD and Solaris boxen ... FreeBSD, > first step is to always recompile the kernel after an install, to get rid > of crud and add Shared Memory ... the Solaris boxes, you add a couple of > lines to /etc/system and reboot, and you have Shared Memory ... > > I don't know about other 'commercial OSs', but I'd be shocked if a Linux > admin never does any kernel config cleanup befor egoing production *shrug* Linux allows you to load and unload kernel modules, while the system is running, to add and remove stuff as you need it. But this is moot because Linux also allows you to increase shared memory (up to the total addressable memory) while the system is running. Recompiling Linux kernels is a thing of the past with modern distributions. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
On Tue, 27 Feb 2001, Peter Eisentraut wrote: > The Hermit Hacker writes: > > > I don't ... I personally admin FreeBSD and Solaris boxen ... FreeBSD, > > first step is to always recompile the kernel after an install, to get rid > > of crud and add Shared Memory ... the Solaris boxes, you add a couple of > > lines to /etc/system and reboot, and you have Shared Memory ... > > > > I don't know about other 'commercial OSs', but I'd be shocked if a Linux > > admin never does any kernel config cleanup befor egoing production *shrug* > > Linux allows you to load and unload kernel modules, while the system is > running, to add and remove stuff as you need it. But this is moot because > Linux also allows you to increase shared memory (up to the total > addressable memory) while the system is running. Recompiling Linux > kernels is a thing of the past with modern distributions. Actually, just found that out for FreeBSD too *sigh* You do have to enable SYSV* in the kernel itself, but increasing shared memory and semaphores is a simple sysctl that can be run while the system is live ...