Thread: Providing anonymous mmap as an option of sharing memory

Providing anonymous mmap as an option of sharing memory

From

Shridhar Daithankar

Date:

25 November 2003, 11:57:12

Hello All,

I was looking thr. the source and thought it would be worth to seek opinion on 
this proposal.
From what I understood so far, the core shared memory handling is done in 
pgsql/src/backend/port/sysv_shmem.c. It is linked by configure as per the 
runtime environment.

So I need to write another source code file which exports same APIs as 
above(i.e. all non static functions in that file) but using mmap and that would 
do it for using anon mmap instead of sysV shared memory.

It might seem unnecessary to provide mmap based shared memory. but this is just 
one step I was thinking of.

In pgsql/src/backend/storage/ipc/shmem.c, all the shared memory allocations are 
done. I was thinking of creating a structure of all global variables in that 
file. The global variables would still be in place so that existing code would 
not break. But the structure would hold database specific buffering information. 
Let's call that structure database context.

That way we can assign different mmaped(anon, of course) regions per database. 
In the backend, we could just switch the database contexts i.e. assign global 
variables from the database context and let the backend write to appropriate 
shared memory region. Every database would need at least two shared memory 
regions. One for operating on it's own buffers and another for system where it 
could write to shared catalogs etc. It can close the shared memory region 
belonging to other databases on startup.

Of course, buffer management alone would not cover database contexts altogether. 
WAL need to be lumped in as well(Not necessarily though. If all WAL buffering go 
thr. system shared region, everything will still work). I don't know if clog and 
data file handling is affected by this. If WAL goes in database context, we can 
probably provide per database WAL which could go well with tablespaces as well.

In case of WAL per database, the operations done on a shared catalog from a 
backend would need flushing system WAL and database WAL to ensure such 
transaction commit. Otherwise only flushing database WAL would do.

This way we can provided a background writer process per database, a common 
buffer per database minimising impact of cross database load significantly. e.g. 
vacuum full on one database would not hog another database due to buffer cache 
pollution. (IO can still saturate though.) This way we can push hardware to 
limit which might not possible right now in some cases.

I was looking for the reason large number of buffers degrades the performance 
and the source code browsing spiralled in this thought. So far I haven't figured 
out any reason why large numebr of buffers can degrade the performance. Still 
looking for it.

Comments?
 Shridhar

Re: Providing anonymous mmap as an option of sharing memory

From

Tom Lane

Date:

25 November 2003, 12:31:36

Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> I was looking thr. the source and thought it would be worth to seek
> opinion on this proposal.

This has been discussed and rejected before.  See the archives.
        regards, tom lane

Re: Providing anonymous mmap as an option of sharing memory

From

Shridhar Daithankar

Date:

26 November 2003, 11:24:38

Tom Lane wrote:
> Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> 
>>I was looking thr. the source and thought it would be worth to seek
>>opinion on this proposal.
> This has been discussed and rejected before.  See the archives.

I went thr. this for details.

http://developer.postgresql.org/cvsweb.cgi/pgsql-server/doc/TODO.detail/mmap

There seem to be two objections to mmap.

1. If a backend from last crashed running postmaster exists then it might have 
file etc. open and that is in general not such a good idea

2. For replacing stdio for data and WAL files with mmap, mmap does not guarantee 
order of IO which defeats WAL.

I covered only first point in my post. IMO it is not such a unsolvable problem. 
If a postmaster crashes hard but leaves a backend running, would it clean pid 
file etc? I don't think so. So if a postmaster can start on a 'pid-clean' state, 
then it is guaranteed to be no childs left around.

There were issues where linux not supporting MAP_SHARE and MAP_ANONYMOUS 
simaltenously but they are quite old messages, from 1998, talking of linux 
2.0.x. I don't think it is still true anymore but need to check.

Too bad, freeBSD M_NOSYNC is not a standard otherwise even for point 2, it could 
have been considered.

Did I miss something?
 Shridhar

Re: Providing anonymous mmap as an option of sharing memory

From

Tom Lane

Date:

26 November 2003, 12:10:24

Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> I covered only first point in my post. IMO it is not such a unsolvable
> problem.  If a postmaster crashes hard but leaves a backend running,
> would it clean pid file etc? I don't think so. So if a postmaster can
> start on a 'pid-clean' state, then it is guaranteed to be no childs
> left around.

And that helps how?  The problem is to detect whether there are any
children left from the old postmaster, when what you have to work from
is the pid-file it left behind.

In any case, you're still handwaving away the very real portability
issues around mmap.  Linux is not the universe, and Linux+BSD isn't
either.

We might still have considered it, despite the negatives, if anyone had
been able to point to any actual *advantages* of mmap.  There are none.
Yes, the SysV shmem API is old and ugly and crufty, but it does what we
need it to do.
        regards, tom lane

Re: Providing anonymous mmap as an option of sharing memory

From

Bruce Momjian

Date:

26 November 2003, 14:40:03

Tom Lane wrote:
> Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> > I covered only first point in my post. IMO it is not such a unsolvable
> > problem.  If a postmaster crashes hard but leaves a backend running,
> > would it clean pid file etc? I don't think so. So if a postmaster can
> > start on a 'pid-clean' state, then it is guaranteed to be no childs
> > left around.
> 
> And that helps how?  The problem is to detect whether there are any
> children left from the old postmaster, when what you have to work from
> is the pid-file it left behind.
> 
> In any case, you're still handwaving away the very real portability
> issues around mmap.  Linux is not the universe, and Linux+BSD isn't
> either.
> 
> We might still have considered it, despite the negatives, if anyone had
> been able to point to any actual *advantages* of mmap.  There are none.
> Yes, the SysV shmem API is old and ugly and crufty, but it does what we
> need it to do.

Plus many operating systems can lock SvssV shmem into RAM to prevent it
from being swapped out.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Providing anonymous mmap as an option of sharing memory

From

Kevin Brown

Date:

26 November 2003, 23:18:17

Shridhar Daithankar wrote:
> There seem to be two objections to mmap.
> 
> 1. If a backend from last crashed running postmaster exists then it might 
> have file etc. open and that is in general not such a good idea
> 
> 2. For replacing stdio for data and WAL files with mmap, mmap does not 
> guarantee order of IO which defeats WAL.

[...]

> Did I miss something?

Yes:  based on everything that I've read on the subject, you can't change
the size of a shared memory segment allocated by mmap().  It's unlikely
that mremap() will propgate changes to all other processes that share the
same area, since it can (if passed the proper flags) return a starting
address that differs from the current starting address.

And since propagation of the shared memory area is done via duplication
of the parent's page tables by the kernel at fork() time, new segments
would not be picked up by existing backends -- only new ones -- and then
only if the postmaster is the process that allocates them.

-- 
Kevin Brown                          kevin@sysexperts.com

Re: Providing anonymous mmap as an option of sharing memory

From

Shridhar Daithankar

Date:

27 November 2003, 02:38:01

Tom Lane wrote:

> Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> 
>>I covered only first point in my post. IMO it is not such a unsolvable
>>problem.  If a postmaster crashes hard but leaves a backend running,
>>would it clean pid file etc? I don't think so. So if a postmaster can
>>start on a 'pid-clean' state, then it is guaranteed to be no childs
>>left around.
> 
> 
> And that helps how?  The problem is to detect whether there are any
> children left from the old postmaster, when what you have to work from
> is the pid-file it left behind.

fine. We need shared memory for that. How about using 1 8K page just for 
detecting that? We don't need to base shared memory model on that, right?

May be we can put clog in shared memory segment which would serve as process 
counter and move shared buffers to mmap?

> In any case, you're still handwaving away the very real portability
> issues around mmap.  Linux is not the universe, and Linux+BSD isn't
> either.
From the machines I can access here, following have anon and shared mmap..

[ost] ~> uname -a
SunOS host 5.8 Generic_108528-21 sun4u sparc SUNW,Sun-Fire-880 Solaris

[host] ~> uname -a
AIX host 1 5 0001A5CA4C00

[/home/user]uname -a
HP-UX host B.11.00 A 9000/785 2005950738 two-user license

Is it enough of support?

> 
> We might still have considered it, despite the negatives, if anyone had
> been able to point to any actual *advantages* of mmap.  There are none.
> Yes, the SysV shmem API is old and ugly and crufty, but it does what we
> need it to do.

1) Per database buffers

Postgresql does not perform well with large number of buffers. Say an 
installation is configured for 100K buffers and have 5 databases. Now what would 
happen if each of these databases get their own 100K buffers?

mmap can not expand shared memory without a server restart. The current 
implementation of shared memory behaves the same way.

Rather than moving it to use shared memory as and when required, we could push 
per database buffers to improve scalability.

I think of this.

1. Introduce parameter columns in pg_database, for shared memory size (to start 
with) and number of live connections to that database. May be a callback to 
daemon postmaster to reread configuration if possible. (In shared memory, may be?)

2. Provide start and stop server commands which essentially either let a 
connection happen or not.

Now somebody modifies the buffer parameters for a database(Say via alter 
database), all it has to do is disconnect and reconnect. If this is a first 
connection to the database, the parent postmaster should reread the per database 
parameters and force them. Same can happen with start/stop commands.

2) No more kernel mucking required.

Recent linux installations are provide sane enough default of SHMMAX but I am 
sure plenty of folks would be glad to see that dependency go.

I also want to talk about mmap for file IO but not in this thread.
 Shridhar

Re: Providing anonymous mmap as an option of sharing memory

From

Tom Lane

Date:

27 November 2003, 12:42:47

Shridhar Daithankar <shridhar_daithankar@myrealbox.com> writes:
> Tom Lane wrote:
>> And that helps how?  The problem is to detect whether there are any
>> children left from the old postmaster, when what you have to work from
>> is the pid-file it left behind.

> fine. We need shared memory for that. How about using 1 8K page just for 
> detecting that? We don't need to base shared memory model on that, right?

So why should we depend on two kernel APIs when one is sufficient?  You
still haven't pointed to any actual advantage that mmap'ing shared buffers
would offer over allocating them with SysV shmem.
        regards, tom lane