Thread: Providing anonymous mmap as an option of sharing memory
Hello All, I was looking thr. the source and thought it would be worth to seek opinion on this proposal. From what I understood so far, the core shared memory handling is done in pgsql/src/backend/port/sysv_shmem.c. It is linked by configure as per the runtime environment. So I need to write another source code file which exports same APIs as above(i.e. all non static functions in that file) but using mmap and that would do it for using anon mmap instead of sysV shared memory. It might seem unnecessary to provide mmap based shared memory. but this is just one step I was thinking of. In pgsql/src/backend/storage/ipc/shmem.c, all the shared memory allocations are done. I was thinking of creating a structure of all global variables in that file. The global variables would still be in place so that existing code would not break. But the structure would hold database specific buffering information. Let's call that structure database context. That way we can assign different mmaped(anon, of course) regions per database. In the backend, we could just switch the database contexts i.e. assign global variables from the database context and let the backend write to appropriate shared memory region. Every database would need at least two shared memory regions. One for operating on it's own buffers and another for system where it could write to shared catalogs etc. It can close the shared memory region belonging to other databases on startup. Of course, buffer management alone would not cover database contexts altogether. WAL need to be lumped in as well(Not necessarily though. If all WAL buffering go thr. system shared region, everything will still work). I don't know if clog and data file handling is affected by this. If WAL goes in database context, we can probably provide per database WAL which could go well with tablespaces as well. In case of WAL per database, the operations done on a shared catalog from a backend would need flushing system WAL and database WAL to ensure such transaction commit. Otherwise only flushing database WAL would do. This way we can provided a background writer process per database, a common buffer per database minimising impact of cross database load significantly. e.g. vacuum full on one database would not hog another database due to buffer cache pollution. (IO can still saturate though.) This way we can push hardware to limit which might not possible right now in some cases. I was looking for the reason large number of buffers degrades the performance and the source code browsing spiralled in this thought. So far I haven't figured out any reason why large numebr of buffers can degrade the performance. Still looking for it. Comments? Shridhar
Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes: > I was looking thr. the source and thought it would be worth to seek > opinion on this proposal. This has been discussed and rejected before. See the archives. regards, tom lane
Tom Lane wrote: > Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes: > >>I was looking thr. the source and thought it would be worth to seek >>opinion on this proposal. > This has been discussed and rejected before. See the archives. I went thr. this for details. http://developer.postgresql.org/cvsweb.cgi/pgsql-server/doc/TODO.detail/mmap There seem to be two objections to mmap. 1. If a backend from last crashed running postmaster exists then it might have file etc. open and that is in general not such a good idea 2. For replacing stdio for data and WAL files with mmap, mmap does not guarantee order of IO which defeats WAL. I covered only first point in my post. IMO it is not such a unsolvable problem. If a postmaster crashes hard but leaves a backend running, would it clean pid file etc? I don't think so. So if a postmaster can start on a 'pid-clean' state, then it is guaranteed to be no childs left around. There were issues where linux not supporting MAP_SHARE and MAP_ANONYMOUS simaltenously but they are quite old messages, from 1998, talking of linux 2.0.x. I don't think it is still true anymore but need to check. Too bad, freeBSD M_NOSYNC is not a standard otherwise even for point 2, it could have been considered. Did I miss something? Shridhar
Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes: > I covered only first point in my post. IMO it is not such a unsolvable > problem. If a postmaster crashes hard but leaves a backend running, > would it clean pid file etc? I don't think so. So if a postmaster can > start on a 'pid-clean' state, then it is guaranteed to be no childs > left around. And that helps how? The problem is to detect whether there are any children left from the old postmaster, when what you have to work from is the pid-file it left behind. In any case, you're still handwaving away the very real portability issues around mmap. Linux is not the universe, and Linux+BSD isn't either. We might still have considered it, despite the negatives, if anyone had been able to point to any actual *advantages* of mmap. There are none. Yes, the SysV shmem API is old and ugly and crufty, but it does what we need it to do. regards, tom lane
Tom Lane wrote: > Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes: > > I covered only first point in my post. IMO it is not such a unsolvable > > problem. If a postmaster crashes hard but leaves a backend running, > > would it clean pid file etc? I don't think so. So if a postmaster can > > start on a 'pid-clean' state, then it is guaranteed to be no childs > > left around. > > And that helps how? The problem is to detect whether there are any > children left from the old postmaster, when what you have to work from > is the pid-file it left behind. > > In any case, you're still handwaving away the very real portability > issues around mmap. Linux is not the universe, and Linux+BSD isn't > either. > > We might still have considered it, despite the negatives, if anyone had > been able to point to any actual *advantages* of mmap. There are none. > Yes, the SysV shmem API is old and ugly and crufty, but it does what we > need it to do. Plus many operating systems can lock SvssV shmem into RAM to prevent it from being swapped out. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Shridhar Daithankar wrote: > There seem to be two objections to mmap. > > 1. If a backend from last crashed running postmaster exists then it might > have file etc. open and that is in general not such a good idea > > 2. For replacing stdio for data and WAL files with mmap, mmap does not > guarantee order of IO which defeats WAL. [...] > Did I miss something? Yes: based on everything that I've read on the subject, you can't change the size of a shared memory segment allocated by mmap(). It's unlikely that mremap() will propgate changes to all other processes that share the same area, since it can (if passed the proper flags) return a starting address that differs from the current starting address. And since propagation of the shared memory area is done via duplication of the parent's page tables by the kernel at fork() time, new segments would not be picked up by existing backends -- only new ones -- and then only if the postmaster is the process that allocates them. -- Kevin Brown kevin@sysexperts.com
Tom Lane wrote: > Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes: > >>I covered only first point in my post. IMO it is not such a unsolvable >>problem. If a postmaster crashes hard but leaves a backend running, >>would it clean pid file etc? I don't think so. So if a postmaster can >>start on a 'pid-clean' state, then it is guaranteed to be no childs >>left around. > > > And that helps how? The problem is to detect whether there are any > children left from the old postmaster, when what you have to work from > is the pid-file it left behind. fine. We need shared memory for that. How about using 1 8K page just for detecting that? We don't need to base shared memory model on that, right? May be we can put clog in shared memory segment which would serve as process counter and move shared buffers to mmap? > In any case, you're still handwaving away the very real portability > issues around mmap. Linux is not the universe, and Linux+BSD isn't > either. From the machines I can access here, following have anon and shared mmap.. [ost] ~> uname -a SunOS host 5.8 Generic_108528-21 sun4u sparc SUNW,Sun-Fire-880 Solaris [host] ~> uname -a AIX host 1 5 0001A5CA4C00 [/home/user]uname -a HP-UX host B.11.00 A 9000/785 2005950738 two-user license Is it enough of support? > > We might still have considered it, despite the negatives, if anyone had > been able to point to any actual *advantages* of mmap. There are none. > Yes, the SysV shmem API is old and ugly and crufty, but it does what we > need it to do. 1) Per database buffers Postgresql does not perform well with large number of buffers. Say an installation is configured for 100K buffers and have 5 databases. Now what would happen if each of these databases get their own 100K buffers? mmap can not expand shared memory without a server restart. The current implementation of shared memory behaves the same way. Rather than moving it to use shared memory as and when required, we could push per database buffers to improve scalability. I think of this. 1. Introduce parameter columns in pg_database, for shared memory size (to start with) and number of live connections to that database. May be a callback to daemon postmaster to reread configuration if possible. (In shared memory, may be?) 2. Provide start and stop server commands which essentially either let a connection happen or not. Now somebody modifies the buffer parameters for a database(Say via alter database), all it has to do is disconnect and reconnect. If this is a first connection to the database, the parent postmaster should reread the per database parameters and force them. Same can happen with start/stop commands. 2) No more kernel mucking required. Recent linux installations are provide sane enough default of SHMMAX but I am sure plenty of folks would be glad to see that dependency go. I also want to talk about mmap for file IO but not in this thread. Shridhar
Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes: > Tom Lane wrote: >> And that helps how? The problem is to detect whether there are any >> children left from the old postmaster, when what you have to work from >> is the pid-file it left behind. > fine. We need shared memory for that. How about using 1 8K page just for > detecting that? We don't need to base shared memory model on that, right? So why should we depend on two kernel APIs when one is sufficient? You still haven't pointed to any actual advantage that mmap'ing shared buffers would offer over allocating them with SysV shmem. regards, tom lane