Thread: Posix Shared Mem patch

Posix Shared Mem patch

From
Josh Berkus
Date:
Robert, all:

Last I checked, we had a reasonably acceptable patch to use mostly Posix
Shared mem with a very small sysv ram partition.  Is there anything
keeping this from going into 9.3?  It would eliminate a major
configuration headache for our users.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Posix Shared Mem patch

From
Alvaro Herrera
Date:
Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
> Robert, all:
>
> Last I checked, we had a reasonably acceptable patch to use mostly Posix
> Shared mem with a very small sysv ram partition.  Is there anything
> keeping this from going into 9.3?  It would eliminate a major
> configuration headache for our users.

I don't think that patch was all that reasonable.  It needed work, and
in any case it needs a rebase because it was pretty old.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
>> Robert, all:
>>
>> Last I checked, we had a reasonably acceptable patch to use mostly Posix
>> Shared mem with a very small sysv ram partition.  Is there anything
>> keeping this from going into 9.3?  It would eliminate a major
>> configuration headache for our users.
>
> I don't think that patch was all that reasonable.  It needed work, and
> in any case it needs a rebase because it was pretty old.

Yep, agreed.

I'd like to get this fixed too, but it hasn't made it up to the top of
my list of things to worry about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Josh Berkus
Date:
On 6/26/12 2:13 PM, Robert Haas wrote:
> On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
>>> Robert, all:
>>>
>>> Last I checked, we had a reasonably acceptable patch to use mostly Posix
>>> Shared mem with a very small sysv ram partition.  Is there anything
>>> keeping this from going into 9.3?  It would eliminate a major
>>> configuration headache for our users.
>>
>> I don't think that patch was all that reasonable.  It needed work, and
>> in any case it needs a rebase because it was pretty old.
> 
> Yep, agreed.
> 
> I'd like to get this fixed too, but it hasn't made it up to the top of
> my list of things to worry about.

Was there a post-AgentM version of the patch, which incorporated the
small SySV RAM partition?  I'm not finding it.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Posix Shared Mem patch

From
Daniel Farina
Date:
On Tue, Jun 26, 2012 at 2:18 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 6/26/12 2:13 PM, Robert Haas wrote:
>> On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>>> Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
>>>> Robert, all:
>>>>
>>>> Last I checked, we had a reasonably acceptable patch to use mostly Posix
>>>> Shared mem with a very small sysv ram partition.  Is there anything
>>>> keeping this from going into 9.3?  It would eliminate a major
>>>> configuration headache for our users.
>>>
>>> I don't think that patch was all that reasonable.  It needed work, and
>>> in any case it needs a rebase because it was pretty old.
>>
>> Yep, agreed.
>>
>> I'd like to get this fixed too, but it hasn't made it up to the top of
>> my list of things to worry about.
>
> Was there a post-AgentM version of the patch, which incorporated the
> small SySV RAM partition?  I'm not finding it.

On that, I used to be of the opinion that this is a good compromise (a
small amount of interlock space, plus mostly posix shmem), but I've
heard since then (I think via AgentM indirectly, but I'm not sure)
that there are cases where even the small SysV segment can cause
problems -- notably when other software tweaks shared memory settings
on behalf of a user, but only leaves just-enough for the software
being installed.  This is most likely on platforms that don't have a
high SysV shmem limit by default, so installers all feel the
prerogative to increase the limit, but there's no great answer for how
to compose a series of such installations.  It only takes one
installer that says "whatever, I'm just catenating stuff to
sysctl.conf that works for me" to sabotage Postgres' ability to start.

So there may be a benefit in finding a way to have no SysV memory at
all.  I wouldn't let perfect be the enemy of good to make progress
here, but it appears this was a witnessed real problem, so it may be
worth reconsidering if there is a way we can safely remove all SysV by
finding an alternative to the nattach mechanic.

-- 
fdr


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Tue, Jun 26, 2012 at 5:18 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 6/26/12 2:13 PM, Robert Haas wrote:
>> On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>>> Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
>>>> Robert, all:
>>>>
>>>> Last I checked, we had a reasonably acceptable patch to use mostly Posix
>>>> Shared mem with a very small sysv ram partition.  Is there anything
>>>> keeping this from going into 9.3?  It would eliminate a major
>>>> configuration headache for our users.
>>>
>>> I don't think that patch was all that reasonable.  It needed work, and
>>> in any case it needs a rebase because it was pretty old.
>>
>> Yep, agreed.
>>
>> I'd like to get this fixed too, but it hasn't made it up to the top of
>> my list of things to worry about.
>
> Was there a post-AgentM version of the patch, which incorporated the
> small SySV RAM partition?  I'm not finding it.

To my knowledge, no.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Josh Berkus
Date:
> On that, I used to be of the opinion that this is a good compromise (a
> small amount of interlock space, plus mostly posix shmem), but I've
> heard since then (I think via AgentM indirectly, but I'm not sure)
> that there are cases where even the small SysV segment can cause
> problems -- notably when other software tweaks shared memory settings
> on behalf of a user, but only leaves just-enough for the software
> being installed.  This is most likely on platforms that don't have a
> high SysV shmem limit by default, so installers all feel the
> prerogative to increase the limit, but there's no great answer for how
> to compose a series of such installations.  It only takes one
> installer that says "whatever, I'm just catenating stuff to
> sysctl.conf that works for me" to sabotage Postgres' ability to start.

Personally, I see this as rather an extreme case, and aside from AgentM
himself, have never run into it before.  Certainly it would be useful to
not need SysV RAM at all, but it's more important to get a working patch
for 9.3.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Tue, Jun 26, 2012 at 5:44 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> On that, I used to be of the opinion that this is a good compromise (a
>> small amount of interlock space, plus mostly posix shmem), but I've
>> heard since then (I think via AgentM indirectly, but I'm not sure)
>> that there are cases where even the small SysV segment can cause
>> problems -- notably when other software tweaks shared memory settings
>> on behalf of a user, but only leaves just-enough for the software
>> being installed.  This is most likely on platforms that don't have a
>> high SysV shmem limit by default, so installers all feel the
>> prerogative to increase the limit, but there's no great answer for how
>> to compose a series of such installations.  It only takes one
>> installer that says "whatever, I'm just catenating stuff to
>> sysctl.conf that works for me" to sabotage Postgres' ability to start.
>
> Personally, I see this as rather an extreme case, and aside from AgentM
> himself, have never run into it before.  Certainly it would be useful to
> not need SysV RAM at all, but it's more important to get a working patch
> for 9.3.

+1.

I'd sort of given up on finding a solution that doesn't involve system
V shmem anyway, but now that I think about it... what about using a
FIFO?  The man page for open on MacOS X says:
   [ENXIO]            O_NONBLOCK and O_WRONLY are set, the file is a FIFO,                      and no process has it
openfor reading. 

And Linux says:
     ENXIO  O_NONBLOCK | O_WRONLY is set, the named file is a  FIFO  and  no            process has the file open for
reading. Or, the file is a device            special file and no corresponding device exists. 

And HP/UX says:
         [ENXIO]        O_NDELAY is set, the named file is a FIFO,                        O_WRONLY is set, and no
processhas the file open                        for reading. 

So, what about keeping a FIFO in the data directory?  When the
postmaster starts up, it tries to open the file with O_NONBLOCK |
O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
anything other than ENXIO, it bails out.  If it fails with exactly
ENXIO, then it opens the pipe with O_RDONLY and arranges to pass the
file descriptor down to all of its children, so that a subsequent open
will fail if it or any of its children are still alive.

This might even be more reliable than what we do right now, because
our current system appears not to be robust against the removal of
postmaster.pid.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Alvaro Herrera
Date:
Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:

> On that, I used to be of the opinion that this is a good compromise (a
> small amount of interlock space, plus mostly posix shmem), but I've
> heard since then (I think via AgentM indirectly, but I'm not sure)
> that there are cases where even the small SysV segment can cause
> problems -- notably when other software tweaks shared memory settings
> on behalf of a user, but only leaves just-enough for the software
> being installed.

This argument is what killed the original patch.  If you want to get
anything done *at all* I think it needs to be dropped.  Changing shmem
implementation is already difficult enough --- you don't need to add the
requirement that the interlocking mechanism be changed simultaneously.
You (or whoever else) can always work on that as a followup patch.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Posix Shared Mem patch

From
Daniel Farina
Date:
On Tue, Jun 26, 2012 at 2:53 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>
> Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:
>
>> On that, I used to be of the opinion that this is a good compromise (a
>> small amount of interlock space, plus mostly posix shmem), but I've
>> heard since then (I think via AgentM indirectly, but I'm not sure)
>> that there are cases where even the small SysV segment can cause
>> problems -- notably when other software tweaks shared memory settings
>> on behalf of a user, but only leaves just-enough for the software
>> being installed.
>
> This argument is what killed the original patch.  If you want to get
> anything done *at all* I think it needs to be dropped.  Changing shmem
> implementation is already difficult enough --- you don't need to add the
> requirement that the interlocking mechanism be changed simultaneously.
> You (or whoever else) can always work on that as a followup patch.

True, but then again, I did very intentionally write:

> Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:
>> *I wouldn't let perfect be the enemy of good* to make progress
>> here, but it appears this was a witnessed real problem, so it may
>> be worth reconsidering if there is a way we can safely remove all
>> SysV by finding an alternative to the nattach mechanic.

(Emphasis mine).

I don't think that -hackers at the time gave the zero-shmem rationale
much weight (I also was not that happy about the safety mechanism of
that patch), but upon more reflection (and taking into account *other*
software that may mangle shmem settings) I think it's something at
least worth thinking about again one more time.  What killed the patch
was an attachment to the deemed-less-safe stategy for avoiding bogus
shmem attachments already in it, but I don't seem to recall anyone
putting a whole lot of thought at the time into the zero-shmem case
from what I could read on the list, because a small interlock with
nattach seemed good-enough.

I'm simply suggesting that for additional benefits it may be worth
thinking about getting around nattach and thus SysV shmem, especially
with regard to safety, in an open-ended way.  Maybe there's a solution
(like Robert's FIFO suggestion?) that is not too onerous and can
satisfy everyone.

-- 
fdr


Re: Posix Shared Mem patch

From
"A.M."
Date:
On Jun 26, 2012, at 5:44 PM, Josh Berkus wrote:

>
>> On that, I used to be of the opinion that this is a good compromise (a
>> small amount of interlock space, plus mostly posix shmem), but I've
>> heard since then (I think via AgentM indirectly, but I'm not sure)
>> that there are cases where even the small SysV segment can cause
>> problems -- notably when other software tweaks shared memory settings
>> on behalf of a user, but only leaves just-enough for the software
>> being installed.  This is most likely on platforms that don't have a
>> high SysV shmem limit by default, so installers all feel the
>> prerogative to increase the limit, but there's no great answer for how
>> to compose a series of such installations.  It only takes one
>> installer that says "whatever, I'm just catenating stuff to
>> sysctl.conf that works for me" to sabotage Postgres' ability to start.
>
> Personally, I see this as rather an extreme case, and aside from AgentM
> himself, have never run into it before.  Certainly it would be useful to
> not need SysV RAM at all, but it's more important to get a working patch
> for 9.3.


This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially newer
postgresqlwith a smaller SysV segment. This can occur with applications that bundle postgresql as part of the app. 

Cheers,
M





Re: Posix Shared Mem patch

From
Josh Berkus
Date:
> This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially
newerpostgresql with a smaller SysV segment. This can occur with applications that bundle postgresql as part of the
app.

I'm not saying it doesn't happen at all.  I'm saying it's not the 80%
case.

So let's fix the 80% case with something we feel confident in, and then
revisit the no-sysv interlock as a separate patch.  That way if we can't
fix the interlock issues, we still have a reduced-shmem version of Postgres.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Posix Shared Mem patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> So, what about keeping a FIFO in the data directory?

Hm, does that work if the data directory is on NFS?  Or some other weird
not-really-Unix file system?

> When the
> postmaster starts up, it tries to open the file with O_NONBLOCK |
> O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
> than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
> anything other than ENXIO, it bails out.  If it fails with exactly
> ENXIO, then it opens the pipe with O_RDONLY

... race condition here ...

> and arranges to pass the
> file descriptor down to all of its children, so that a subsequent open
> will fail if it or any of its children are still alive.

This might be made to work, but that doesn't sound quite right in
detail.

I remember we speculated about using an fcntl lock on some file in the
data directory, but that fails because child processes don't inherit
fcntl locks.

In the modern world, it'd be really a step forward if the lock mechanism
worked on shared storage, ie a data directory on NFS or similar could be
locked against all comers not just those on the same node as the
original postmaster.  I don't know how to do that though.

In the meantime, insisting that we solve this problem before we do
anything is a good recipe for ensuring that nothing happens, just
like it hasn't happened for the last half dozen years.  (I see Alvaro
just made the same point.)
        regards, tom lane


Re: Posix Shared Mem patch

From
"A.M."
Date:
On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:
>
> (Emphasis mine).
>
> I don't think that -hackers at the time gave the zero-shmem rationale
> much weight (I also was not that happy about the safety mechanism of
> that patch), but upon more reflection (and taking into account *other*
> software that may mangle shmem settings) I think it's something at
> least worth thinking about again one more time.  What killed the patch
> was an attachment to the deemed-less-safe stategy for avoiding bogus
> shmem attachments already in it, but I don't seem to recall anyone
> putting a whole lot of thought at the time into the zero-shmem case
> from what I could read on the list, because a small interlock with
> nattach seemed good-enough.
>
> I'm simply suggesting that for additional benefits it may be worth
> thinking about getting around nattach and thus SysV shmem, especially
> with regard to safety, in an open-ended way.  Maybe there's a solution
> (like Robert's FIFO suggestion?) that is not too onerous and can
> satisfy everyone.


I solved this via fcntl locking. I also set up gdb to break in critical regions to test the interlock and I found no
flawin the design. More eyes would be welcome, of course. 
https://github.com/agentm/postgres/tree/posix_shmem

Cheers,
M





Re: Posix Shared Mem patch

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> So let's fix the 80% case with something we feel confident in, and then
> revisit the no-sysv interlock as a separate patch.  That way if we can't
> fix the interlock issues, we still have a reduced-shmem version of Postgres.

Yes.  Insisting that we have the whole change in one patch is a good way
to prevent any forward progress from happening.  As Alvaro noted, there
are plenty of issues to resolve without trying to change the interlock
mechanism at the same time.
        regards, tom lane


Re: Posix Shared Mem patch

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In the meantime, insisting that we solve this problem before we do
> anything is a good recipe for ensuring that nothing happens, just
> like it hasn't happened for the last half dozen years.  (I see
> Alvaro just made the same point.)
And now so has Josh.
+1 from me, too.
-Kevin


Re: Posix Shared Mem patch

From
Tom Lane
Date:
"A.M." <agentm@themactionfaction.com> writes:
> This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially
newerpostgresql with a smaller SysV segment. This can occur with applications that bundle postgresql as part of the
app.

I don't believe that that case is a counterexample to what's being
proposed (namely, grabbing a minimum-size shmem segment, perhaps 1K).
It would only fail if the old postmaster ate up *exactly* SHMMAX worth
of shmem, which is not real likely.  As a data point, on my Mac laptop
with SHMMAX set to 32MB, 9.2 will by default eat up 31624KB, leaving
more than a meg available.  Sure, that isn't enough to start another
old-style postmaster, but it would be plenty of room for one that only
wants 1K.

Even if you actively try to configure the shmem settings to exactly
fill shmmax (which I concede some installation scripts might do),
it's going to be hard to do because of the 8K granularity of the main
knob, shared_buffers.  Moreover, a installation script that did that
would soon learn not to, because of the fact that we don't worry too
much about changing small details of shared memory consumption in minor
releases.
        regards, tom lane


Re: Posix Shared Mem patch

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012:

> Even if you actively try to configure the shmem settings to exactly
> fill shmmax (which I concede some installation scripts might do),
> it's going to be hard to do because of the 8K granularity of the main
> knob, shared_buffers.

Actually it's very easy -- just try to start postmaster on a system with
not enough shmmax and it will tell you how much shmem it wants.  Then
copy that number verbatim in the config file.  This might fail on picky
systems such as MacOSX that require some exact multiple or power of some
other parameter, but it works fine on Linux.

I think the minimum you can request, at least on Linux, is 1 byte.

> Moreover, a installation script that did that
> would soon learn not to, because of the fact that we don't worry too
> much about changing small details of shared memory consumption in minor
> releases.

+1

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Posix Shared Mem patch

From
Tom Lane
Date:
"A.M." <agentm@themactionfaction.com> writes:
> On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:
>> I'm simply suggesting that for additional benefits it may be worth
>> thinking about getting around nattach and thus SysV shmem, especially
>> with regard to safety, in an open-ended way.

> I solved this via fcntl locking.

No, you didn't, because fcntl locks aren't inherited by child processes.
Too bad, because they'd be a great solution otherwise.
        regards, tom lane


Re: Posix Shared Mem patch

From
"A.M."
Date:
On 06/26/2012 07:30 PM, Tom Lane wrote:
> "A.M." <agentm@themactionfaction.com> writes:
>> On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:
>>> I'm simply suggesting that for additional benefits it may be worth
>>> thinking about getting around nattach and thus SysV shmem, especially
>>> with regard to safety, in an open-ended way.
>
>> I solved this via fcntl locking.
>
> No, you didn't, because fcntl locks aren't inherited by child processes.
> Too bad, because they'd be a great solution otherwise.
>

You claimed this last time and I replied:
http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

"I address this race condition by ensuring that a lock-holding violator 
is the postmaster or a postmaster child. If such as condition is 
detected, the child exits immediately without touching the shared 
memory. POSIX shmem is inherited via file descriptors."

This is possible because the locking API allows one to request which PID 
violates the lock. The child expects the lock to be held and checks that 
the PID is the parent. If the lock is not held, that means that the 
postmaster is dead, so the child exits immediately.

Cheers,
M


Re: Posix Shared Mem patch

From
"A.M."
Date:
On 06/26/2012 07:15 PM, Alvaro Herrera wrote:
>
> Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012:
>
>> Even if you actively try to configure the shmem settings to exactly
>> fill shmmax (which I concede some installation scripts might do),
>> it's going to be hard to do because of the 8K granularity of the main
>> knob, shared_buffers.
>
> Actually it's very easy -- just try to start postmaster on a system with
> not enough shmmax and it will tell you how much shmem it wants.  Then
> copy that number verbatim in the config file.  This might fail on picky
> systems such as MacOSX that require some exact multiple or power of some
> other parameter, but it works fine on Linux.
>

Except that we have to account for other installers. A user can install 
an application in the future which clobbers the value and then the 
original application will fail to run. The options to get the first app 
working is:

a) to re-install the first app (potentially preventing the second app 
from running)
b) to have the first app detect the failure and readjust the value 
(guessing what it should be) and potentially forcing a reboot
c) to have the the user manually adjust the value and potentially force 
a reboot

The failure usually gets blamed on the first application.

That's why we had to nuke SysV shmem.

Cheers,
M




Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Tue, Jun 26, 2012 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> So, what about keeping a FIFO in the data directory?
>
> Hm, does that work if the data directory is on NFS?  Or some other weird
> not-really-Unix file system?

I would expect NFS to work in general.  We could test that.  Of
course, it's more than possible that there's some bizarre device out
there that purports to be NFS but doesn't actually support mkfifo.
It's difficult to prove a negative.

>> When the
>> postmaster starts up, it tries to open the file with O_NONBLOCK |
>> O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
>> than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
>> anything other than ENXIO, it bails out.  If it fails with exactly
>> ENXIO, then it opens the pipe with O_RDONLY
>
> ... race condition here ...

Oh, if someone tries to start two postmasters at the same time?  Hmm.

>> and arranges to pass the
>> file descriptor down to all of its children, so that a subsequent open
>> will fail if it or any of its children are still alive.
>
> This might be made to work, but that doesn't sound quite right in
> detail.
>
> I remember we speculated about using an fcntl lock on some file in the
> data directory, but that fails because child processes don't inherit
> fcntl locks.
>
> In the modern world, it'd be really a step forward if the lock mechanism
> worked on shared storage, ie a data directory on NFS or similar could be
> locked against all comers not just those on the same node as the
> original postmaster.  I don't know how to do that though.

Well, I think that in theory that DOES work.  But I also think it's
often misconfigured.  Which could also be said of NFS in general.

> In the meantime, insisting that we solve this problem before we do
> anything is a good recipe for ensuring that nothing happens, just
> like it hasn't happened for the last half dozen years.  (I see Alvaro
> just made the same point.)

Agreed all around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Tom Lane
Date:
"A.M." <agentm@themactionfaction.com> writes:
> On 06/26/2012 07:30 PM, Tom Lane wrote:
>>> I solved this via fcntl locking.

>> No, you didn't, because fcntl locks aren't inherited by child processes.
>> Too bad, because they'd be a great solution otherwise.

> You claimed this last time and I replied:
> http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

> "I address this race condition by ensuring that a lock-holding violator 
> is the postmaster or a postmaster child. If such as condition is 
> detected, the child exits immediately without touching the shared 
> memory. POSIX shmem is inherited via file descriptors."

> This is possible because the locking API allows one to request which PID 
> violates the lock. The child expects the lock to be held and checks that 
> the PID is the parent. If the lock is not held, that means that the 
> postmaster is dead, so the child exits immediately.

OK, I went back and re-read the original patch, and I now agree that
something like this is possible --- but I don't like the way you did
it. The dependence on particular PIDs seems both unnecessary and risky.

The key concept here seems to be that the postmaster first stakes a
claim on the data directory by exclusive-locking a lock file.  If
successful, it reduces that lock to shared mode (which can be done
atomically, according to the SUS fcntl specification), and then holds
the shared lock until it exits.  Spawned children will not initially
have a lock, but what they can do is attempt to acquire shared lock on
the lock file.  If fail, exit.  If successful, *check to see that the
parent postmaster is still alive* (ie, getppid() != 1).  If so, the
parent must have been continuously holding the lock, and the child has
successfully joined the pool of shared lock holders.  Otherwise bail
out without having changed anything.  It is the "parent is still alive"
check, not any test on individual PIDs, that makes this work.

There are two concrete reasons why I don't care for the
GetPIDHoldingLock() way.  Firstly, the fact that you can get a blocking
PID from F_GETLK isn't an essential part of the concept of file locking
IMO --- it's just an incidental part of this particular API.  May I
remind you that the reason we're stuck on SysV shmem in the first place
is that we decided to depend on an incidental part of that API, namely
nattch?  I would like to not require file locking to have any semantics
more specific than "a process can hold an exclusive or a shared lock on
a file, which is auto-released at process exit".  Secondly, in an NFS
world I don't believe that the returned l_pid value can be trusted for
anything.  If it's a PID from a different machine then it might
accidentally conflict with one on our machine, or not.

Reflecting on this further, it seems to me that the main remaining
failure modes are (1) file locking doesn't work, or (2) idiot DBA
manually removes the lock file.  Both of these could be ameliorated
with some refinements to the basic idea.  For (1), I suggest that
we tweak the startup process (only) to attempt to acquire exclusive lock
on the lock file.  If it succeeds, we know that file locking is broken,
and we can complain.  (This wouldn't help for cases where cross-machine
locking is broken, but I see no practical way to detect that.)
For (2), the problem really is that the proposed patch conflates the PID
file with the lock file, but people are conditioned to think that PID
files are removable.  I suggest that we create a separate, permanently
present file that serves only as the lock file and doesn't ever get
modified (it need have no content other than the string "Don't remove
this!").  It'd be created by initdb, not by individual postmaster runs;
indeed the postmaster should fail if it doesn't find the lock file
already present.  The postmaster PID file should still exist with its
current contents, but it would serve mostly as documentation and as
server-contact information for pg_ctl; it would not be part of the data
directory locking mechanism.

I wonder whether this design can be adapted to Windows?  IIRC we do
not have a bulletproof data directory lock scheme for Windows.
It seems like this makes few enough demands on the lock mechanism
that there ought to be suitable primitives available there too.
        regards, tom lane


Re: Posix Shared Mem patch

From
Tom Lane
Date:
I wrote:
> Reflecting on this further, it seems to me that the main remaining
> failure modes are (1) file locking doesn't work, or (2) idiot DBA
> manually removes the lock file.

Oh, wait, I just remembered the really fatal problem here: to quote from
the SUS fcntl spec,
All locks associated with a file for a given process are removedwhen a file descriptor for that file is closed by that
processorthe process holding that file descriptor terminates.
 

That carefully says "a file descriptor", not "the file descriptor
through which the lock was acquired".  Any close() referencing the lock
file will do.  That means that it is possible for perfectly innocent
code --- for example, something that scans all files in the data
directory, as say pg_basebackup might do --- to cause a backend process
to lose its lock.  When we looked at this before, it seemed like a
showstopper.  Even if we carefully taught every directory-scanning loop
in postgres not to touch the lock file, we cannot expect that for
instance a pl/perl function wouldn't accidentally break things.  And
99.999% of the time nobody would notice ... it would just be that last
0.001% of people that would be screwed.

Still, this discussion has yielded a useful advance, which is that we
now see how we might safely make use of lock mechanisms that don't
inherit across fork().  We just need something less broken than fcntl().
        regards, tom lane


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> So let's fix the 80% case with something we feel confident in, and then
>> revisit the no-sysv interlock as a separate patch.  That way if we can't
>> fix the interlock issues, we still have a reduced-shmem version of Postgres.
>
> Yes.  Insisting that we have the whole change in one patch is a good way
> to prevent any forward progress from happening.  As Alvaro noted, there
> are plenty of issues to resolve without trying to change the interlock
> mechanism at the same time.

So, here's a patch.  Instead of using POSIX shmem, I just took the
expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
memory.  The sysv shm is still allocated, but it's just a copy of
PGShmemHeader; the "real" shared memory is the anonymous block.  This
won't work if EXEC_BACKEND is defined so it just falls back on
straight sysv shm in that case.

There are obviously some portability issues here - this is documented
not to work on Linux <= 2.4, but it's not clear whether it fails with
some suitable error code or just pretends to work and does the wrong
thing.  I tested that it does compile and work on both Linux 3.2.6 and
MacOS X 10.6.8.  And the comments probably need work and... who knows
what else is wrong.  But, thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Posix Shared Mem patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> So, here's a patch.  Instead of using POSIX shmem, I just took the
> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
> memory.  The sysv shm is still allocated, but it's just a copy of
> PGShmemHeader; the "real" shared memory is the anonymous block.  This
> won't work if EXEC_BACKEND is defined so it just falls back on
> straight sysv shm in that case.

Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
like a bit of a showstopper.  I would not like to give up the ability
to debug EXEC_BACKEND mode on Unixen.

Would Posix shmem help with that at all?  Why did you choose not to
use the Posix API, anyway?
        regards, tom lane


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> So, here's a patch.  Instead of using POSIX shmem, I just took the
>> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
>> memory.  The sysv shm is still allocated, but it's just a copy of
>> PGShmemHeader; the "real" shared memory is the anonymous block.  This
>> won't work if EXEC_BACKEND is defined so it just falls back on
>> straight sysv shm in that case.
>
> Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
> like a bit of a showstopper.  I would not like to give up the ability
> to debug EXEC_BACKEND mode on Unixen.
>
> Would Posix shmem help with that at all?  Why did you choose not to
> use the Posix API, anyway?

It seemed more complicated.  If we use the POSIX API, we've got to
have code to find a non-colliding name for the shm, and we've got to
arrange to clean it up at process exit.  Anonymous shm doesn't require
a name and goes away automatically when it's no longer in use.

With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
make it continue to use a full-sized sysv shm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Magnus Hagander
Date:
On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "A.M." <agentm@themactionfaction.com> writes:
>> On 06/26/2012 07:30 PM, Tom Lane wrote:
>>>> I solved this via fcntl locking.
>
>>> No, you didn't, because fcntl locks aren't inherited by child processes.
>>> Too bad, because they'd be a great solution otherwise.
>
>> You claimed this last time and I replied:
>> http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php
>
>> "I address this race condition by ensuring that a lock-holding violator
>> is the postmaster or a postmaster child. If such as condition is
>> detected, the child exits immediately without touching the shared
>> memory. POSIX shmem is inherited via file descriptors."
>
>> This is possible because the locking API allows one to request which PID
>> violates the lock. The child expects the lock to be held and checks that
>> the PID is the parent. If the lock is not held, that means that the
>> postmaster is dead, so the child exits immediately.
>
> OK, I went back and re-read the original patch, and I now agree that
> something like this is possible --- but I don't like the way you did
> it. The dependence on particular PIDs seems both unnecessary and risky.
>
> The key concept here seems to be that the postmaster first stakes a
> claim on the data directory by exclusive-locking a lock file.  If
> successful, it reduces that lock to shared mode (which can be done
> atomically, according to the SUS fcntl specification), and then holds
> the shared lock until it exits.  Spawned children will not initially
> have a lock, but what they can do is attempt to acquire shared lock on
> the lock file.  If fail, exit.  If successful, *check to see that the
> parent postmaster is still alive* (ie, getppid() != 1).  If so, the
> parent must have been continuously holding the lock, and the child has
> successfully joined the pool of shared lock holders.  Otherwise bail
> out without having changed anything.  It is the "parent is still alive"
> check, not any test on individual PIDs, that makes this work.
>
> There are two concrete reasons why I don't care for the
> GetPIDHoldingLock() way.  Firstly, the fact that you can get a blocking
> PID from F_GETLK isn't an essential part of the concept of file locking
> IMO --- it's just an incidental part of this particular API.  May I
> remind you that the reason we're stuck on SysV shmem in the first place
> is that we decided to depend on an incidental part of that API, namely
> nattch?  I would like to not require file locking to have any semantics
> more specific than "a process can hold an exclusive or a shared lock on
> a file, which is auto-released at process exit".  Secondly, in an NFS
> world I don't believe that the returned l_pid value can be trusted for
> anything.  If it's a PID from a different machine then it might
> accidentally conflict with one on our machine, or not.
>
> Reflecting on this further, it seems to me that the main remaining
> failure modes are (1) file locking doesn't work, or (2) idiot DBA
> manually removes the lock file.  Both of these could be ameliorated
> with some refinements to the basic idea.  For (1), I suggest that
> we tweak the startup process (only) to attempt to acquire exclusive lock
> on the lock file.  If it succeeds, we know that file locking is broken,
> and we can complain.  (This wouldn't help for cases where cross-machine
> locking is broken, but I see no practical way to detect that.)
> For (2), the problem really is that the proposed patch conflates the PID
> file with the lock file, but people are conditioned to think that PID
> files are removable.  I suggest that we create a separate, permanently
> present file that serves only as the lock file and doesn't ever get
> modified (it need have no content other than the string "Don't remove
> this!").  It'd be created by initdb, not by individual postmaster runs;
> indeed the postmaster should fail if it doesn't find the lock file
> already present.  The postmaster PID file should still exist with its
> current contents, but it would serve mostly as documentation and as
> server-contact information for pg_ctl; it would not be part of the data
> directory locking mechanism.
>
> I wonder whether this design can be adapted to Windows?  IIRC we do
> not have a bulletproof data directory lock scheme for Windows.
> It seems like this makes few enough demands on the lock mechanism
> that there ought to be suitable primitives available there too.

I assume you're saying we need to make changes in the internal API,
right? Because we alreayd have a windows native implementation of
shared memory that AFAIK works, so if the new Unix stuff can be done
with the same internal APIs, it shouldn't nede to be changed. (Sorry,
haven't followed the thread in detail)

If so - can we define exactly what properties it is we *need*?

(A native API worth looking at is e.g.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365203(v=vs.85).aspx
- but there are probably others as well if that one doesn't do)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder whether this design can be adapted to Windows? �IIRC we do
>> not have a bulletproof data directory lock scheme for Windows.
>> It seems like this makes few enough demands on the lock mechanism
>> that there ought to be suitable primitives available there too.

> I assume you're saying we need to make changes in the internal API,
> right? Because we alreayd have a windows native implementation of
> shared memory that AFAIK works,

Right, but does it provide honest protection against starting two
postmasters in the same data directory?  Or more to the point,
does it prevent starting a new postmaster when the old postmaster
crashed but there are still orphaned backends making changes?
AFAIR we basically punted on those problems for the Windows port,
for lack of an equivalent to nattch.
        regards, tom lane


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Would Posix shmem help with that at all?  Why did you choose not to
>> use the Posix API, anyway?

> It seemed more complicated.  If we use the POSIX API, we've got to
> have code to find a non-colliding name for the shm, and we've got to
> arrange to clean it up at process exit.  Anonymous shm doesn't require
> a name and goes away automatically when it's no longer in use.

I see.  Those are pretty good reasons ...

> With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
> make it continue to use a full-sized sysv shm.

Well, if the ultimate objective is to get out from under the SysV APIs
entirely, we're not going to get there if we still have to have all that
code for the EXEC_BACKEND case.  Maybe it's time to decide that we don't
need to support EXEC_BACKEND on Unix.
        regards, tom lane


Re: Posix Shared Mem patch

From
Stephen Frost
Date:
All,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Would Posix shmem help with that at all?  Why did you choose not to
> >> use the Posix API, anyway?
>
> > It seemed more complicated.  If we use the POSIX API, we've got to
> > have code to find a non-colliding name for the shm, and we've got to
> > arrange to clean it up at process exit.  Anonymous shm doesn't require
> > a name and goes away automatically when it's no longer in use.
>
> I see.  Those are pretty good reasons ...

After talking to Magnus a bit this morning regarding this, it sounds
like what we're doing on Windows is closer to Anonymous shm, except that
they use an intentionally specific name, which also allows them to
detect if any children are still alive by using a "create-if-not-exists"
approach on the shm segment and failing if it still exists.  There were
some corner cases around restarts due to it taking a few seconds for the
Windows kernel to pick up on the fact that all the children are dead and
that the shm segment should go away, but they were able to work around
that, and failure to start is surely much better than possible
corruption.

What this all boils down to is- can you have a shm segment that goes
away when no one is still attached to it, but actually give it a name
and then detect if it already exists atomically on startup on
Linux/Unixes?  If so, perhaps we could use the same mechanism on both..
Thanks,
    Stephen

Re: Posix Shared Mem patch

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Right, but does it provide honest protection against starting two
> postmasters in the same data directory?  Or more to the point,
> does it prevent starting a new postmaster when the old postmaster
> crashed but there are still orphaned backends making changes?
> AFAIR we basically punted on those problems for the Windows port,
> for lack of an equivalent to nattch.

See my other mail, but, after talking to Magnus, it's my understanding
that we had that problem initially, but it was later solved by using a
named shared memory segment which the kernel will clean up when all
children are gone.  That, combined with a 'create-if-exists' call,
allows detection of lost children to be done.
Thanks,
    Stephen

Re: Posix Shared Mem patch

From
Magnus Hagander
Date:
On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I wonder whether this design can be adapted to Windows?  IIRC we do
>>> not have a bulletproof data directory lock scheme for Windows.
>>> It seems like this makes few enough demands on the lock mechanism
>>> that there ought to be suitable primitives available there too.
>
>> I assume you're saying we need to make changes in the internal API,
>> right? Because we alreayd have a windows native implementation of
>> shared memory that AFAIK works,
>
> Right, but does it provide honest protection against starting two
> postmasters in the same data directory?  Or more to the point,
> does it prevent starting a new postmaster when the old postmaster
> crashed but there are still orphaned backends making changes?
> AFAIR we basically punted on those problems for the Windows port,
> for lack of an equivalent to nattch.

No, we spent a lot of time trying to *fix* it, and IIRC we did.

We create a shared memory segment with a fixed name based on the data
directory. This shared memory segment is inherited by all children. It
will automatically go away only when all processes that have an open
handle to it go away (in fact, it can even take a second or two more,
if they go away by crash and not by cleanup - we have a workaround in
the code for that). But as long as there is an orphaned backend
around, the shared memory segment stays around.

We don't have "nattch". But we do have "nattch>0". Or something like that.

You can work around it if you find two different paths to the same
data directory (e.g .using junctions), but you are really actively
trying to break the system if you do that...


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> AFAIR we basically punted on those problems for the Windows port,
>> for lack of an equivalent to nattch.

> No, we spent a lot of time trying to *fix* it, and IIRC we did.

OK, in that case this isn't as interesting as I thought.

If we do go over to a file-locking-based solution on Unix, it might be
worthwhile changing to something similar on Windows.  But it would be
more about reducing coding differences between the platforms than
plugging any real holes.
        regards, tom lane


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Would Posix shmem help with that at all?  Why did you choose not to
>>> use the Posix API, anyway?
>
>> It seemed more complicated.  If we use the POSIX API, we've got to
>> have code to find a non-colliding name for the shm, and we've got to
>> arrange to clean it up at process exit.  Anonymous shm doesn't require
>> a name and goes away automatically when it's no longer in use.
>
> I see.  Those are pretty good reasons ...
>
>> With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
>> make it continue to use a full-sized sysv shm.
>
> Well, if the ultimate objective is to get out from under the SysV APIs
> entirely, we're not going to get there if we still have to have all that
> code for the EXEC_BACKEND case.  Maybe it's time to decide that we don't
> need to support EXEC_BACKEND on Unix.

I don't personally see a need to do anything that drastic at this
point.  Admittedly, I rarely compile with EXEC_BACKEND, but I don't
think it's bad to have the option available.  Adjusting shared memory
limits isn't really a big problem for PostgreSQL developers; what
we're trying to avoid is the need for PostgreSQL *users* to concern
themselves with it.  And surely anyone who is using EXEC_BACKEND on
Unix is a developer, not a user.

If and when we come up with a substitute for the nattch interlock,
then this might be worth thinking a bit harder about.  At that point,
if we still want to support EXEC_BACKEND on Unix, then we'd need the
EXEC_BACKEND case at least to use POSIX shm rather than anonymous
shared mmap.  Personally I think that would be not that hard and
probably worth doing, but there doesn't seem to be any point in
writing that code now, because for the simple case of just reducing
the amount of shm that we allocate, an anonymous mapping seems better
all around.

We shouldn't overthink this.  Our shared memory code has allocated a
bunch of crufty hacks over the years to work around various
platform-specific issues, but it's still not a lot of code, so I don't
see any reason to worry unduly about making a surgical fix without
having a master plan.  Nothing we want to do down the road will
require moving the earth.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Wed, Jun 27, 2012 at 9:52 AM, Stephen Frost <sfrost@snowman.net> wrote:
> What this all boils down to is- can you have a shm segment that goes
> away when no one is still attached to it, but actually give it a name
> and then detect if it already exists atomically on startup on
> Linux/Unixes?  If so, perhaps we could use the same mechanism on both..

As I understand it, no.  You can either have anonymous shared
mappings, which go away when no longer in use but do not have a name.
Or you can have POSIX or sysv shm, which have a name but do not
automatically go away when no longer in use.  There seems to be no
method for setting up a segment that both has a name and goes away
automatically.  POSIX shm in particular tries to "look like a file",
whereas anonymous memory tries to look more like malloc (except that
you can share the mapping with child processes).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
"A.M."
Date:
On Jun 27, 2012, at 7:34 AM, Robert Haas wrote:

> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> So, here's a patch.  Instead of using POSIX shmem, I just took the
>>> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
>>> memory.  The sysv shm is still allocated, but it's just a copy of
>>> PGShmemHeader; the "real" shared memory is the anonymous block.  This
>>> won't work if EXEC_BACKEND is defined so it just falls back on
>>> straight sysv shm in that case.
>>
>> Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
>> like a bit of a showstopper.  I would not like to give up the ability
>> to debug EXEC_BACKEND mode on Unixen.
>>
>> Would Posix shmem help with that at all?  Why did you choose not to
>> use the Posix API, anyway?
>
> It seemed more complicated.  If we use the POSIX API, we've got to
> have code to find a non-colliding name for the shm, and we've got to
> arrange to clean it up at process exit.  Anonymous shm doesn't require
> a name and goes away automatically when it's no longer in use.
>
> With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to
> make it continue to use a full-sized sysv shm.
>

I solved this by unlinking the posix shared memory segment immediately after creation. The file descriptor to the
sharedmemory is inherited, so, by definition, only the postmaster children can access the memory. This ensures that
sharedmemory cleanup is immediate after the postmaster and all children close, as well. The fcntl locking is not
requiredto protect the posix shared memory- it can protect itself. 

Cheers,
M





Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Would Posix shmem help with that at all?  Why did you choose not to
>>> use the Posix API, anyway?
>
>> It seemed more complicated.  If we use the POSIX API, we've got to
>> have code to find a non-colliding name for the shm, and we've got to
>> arrange to clean it up at process exit.  Anonymous shm doesn't require
>> a name and goes away automatically when it's no longer in use.
>
> I see.  Those are pretty good reasons ...

So, should we do it this way?

I did a little research and discovered that Linux 2.3.51 (released
3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
That combination is documented to work beginning in Linux 2.4.0.  How
worried should we be about people trying to run PostgreSQL 9.3 on
pre-2.4 kernels?  If we want to worry about it, we could try mapping a
one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
facility and try to allocate the whole segment plus a minimal sysv
shm.  If the single page allocation fails with EINVAL, we could fall
back to allocating the entire segment as sysv shm.

A related question is - if we do this - should we enable it only on
ports where we've verified that it works, or should we just turn it on
everywhere and fix breakage if/when it's reported?  I lean toward the
latter.

If we find that there are platforms where (a) mmap is not supported or
(b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could
either shut off this optimization on those platforms by fiat, or we
could test not only that the call succeeds, but that it works
properly: create a one-page mapping and fork a child process; in the
child, write to the mapping and exit; in the parent, wait for the
child to exit and then test that we can read back the correct
contents.  This would protect against a hypothetical system where the
flags are accepted but fail to produce the correct behavior.  I'm
inclined to think this is over-engineering in the absence of evidence
that there are platforms that work this way.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Magnus Hagander
Date:
On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Would Posix shmem help with that at all?  Why did you choose not to
>>>> use the Posix API, anyway?
>>
>>> It seemed more complicated.  If we use the POSIX API, we've got to
>>> have code to find a non-colliding name for the shm, and we've got to
>>> arrange to clean it up at process exit.  Anonymous shm doesn't require
>>> a name and goes away automatically when it's no longer in use.
>>
>> I see.  Those are pretty good reasons ...
>
> So, should we do it this way?
>
> I did a little research and discovered that Linux 2.3.51 (released
> 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
> That combination is documented to work beginning in Linux 2.4.0.  How
> worried should we be about people trying to run PostgreSQL 9.3 on
> pre-2.4 kernels?  If we want to worry about it, we could try mapping a
> one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
> works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
> facility and try to allocate the whole segment plus a minimal sysv
> shm.  If the single page allocation fails with EINVAL, we could fall
> back to allocating the entire segment as sysv shm.

Do we really need a runtime check for that? Isn't a configure check
enough? If they *do* deploy postgresql 9.3 on something that old,
they're building from source anyway...


> A related question is - if we do this - should we enable it only on
> ports where we've verified that it works, or should we just turn it on
> everywhere and fix breakage if/when it's reported?  I lean toward the
> latter.

Depends on the amount of expected breakage, but I'd lean towards teh
later as well.


> If we find that there are platforms where (a) mmap is not supported or
> (b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could
> either shut off this optimization on those platforms by fiat, or we
> could test not only that the call succeeds, but that it works
> properly: create a one-page mapping and fork a child process; in the
> child, write to the mapping and exit; in the parent, wait for the
> child to exit and then test that we can read back the correct
> contents.  This would protect against a hypothetical system where the
> flags are accepted but fail to produce the correct behavior.  I'm
> inclined to think this is over-engineering in the absence of evidence
> that there are platforms that work this way.

Could we actually turn *that* into a configure test, or will that be
too complex?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 7:05 AM, Magnus Hagander <magnus@hagander.net> wrote:
> Do we really need a runtime check for that? Isn't a configure check
> enough? If they *do* deploy postgresql 9.3 on something that old,
> they're building from source anyway...
[...]
>
> Could we actually turn *that* into a configure test, or will that be
> too complex?

I don't see why we *couldn't* make either of those things into a
configure test, but it seems more complicated than a runtime test and
less accurate, so I guess I'd be in favor of doing them at runtime or
not at all.

Actually, the try-a-one-page-mapping-and-see-if-you-get-EINVAL test is
so simple that I really can't see any reason not to insert that
defense.  The fork-and-check-whether-it-really-works test is probably
excess paranoia until we determine whether that's really a danger
anywhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Jon Nelson
Date:
On Thu, Jun 28, 2012 at 6:05 AM, Magnus Hagander <magnus@hagander.net> wrote:
> On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> Would Posix shmem help with that at all?  Why did you choose not to
>>>>> use the Posix API, anyway?
>>>
>>>> It seemed more complicated.  If we use the POSIX API, we've got to
>>>> have code to find a non-colliding name for the shm, and we've got to
>>>> arrange to clean it up at process exit.  Anonymous shm doesn't require
>>>> a name and goes away automatically when it's no longer in use.
>>>
>>> I see.  Those are pretty good reasons ...
>>
>> So, should we do it this way?
>>
>> I did a little research and discovered that Linux 2.3.51 (released
>> 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS.
>> That combination is documented to work beginning in Linux 2.4.0.  How
>> worried should we be about people trying to run PostgreSQL 9.3 on
>> pre-2.4 kernels?  If we want to worry about it, we could try mapping a
>> one-page shared MAP_SHARED|MAP_ANONYMOUS segment first.  If that
>> works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS
>> facility and try to allocate the whole segment plus a minimal sysv
>> shm.  If the single page allocation fails with EINVAL, we could fall
>> back to allocating the entire segment as sysv shm.

Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
seem to think that's what I did when I needed this functionality oh so
many moons ago.

--
Jon


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
> seem to think that's what I did when I needed this functionality oh so
> many moons ago.

From the reading I've done on this topic, that seems to be a trick
invented on Solaris that is considered grotty and awful by everyone
else.  The thing is that you want the mapping to be shared with the
processes that inherit the mapping from you.  You do *NOT* want the
mapping to be shared with EVERYONE who has mapped that file for any
reason, which is the usual meaning of MAP_SHARED on a file.  Maybe
this happens to work correctly on some or all platforms, but I would
want to have some convincing evidence that it's more widely supported
(with the correct semantics) than MAP_ANON before relying on it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> A related question is - if we do this - should we enable it only on
>> ports where we've verified that it works, or should we just turn it on
>> everywhere and fix breakage if/when it's reported? �I lean toward the
>> latter.

> Depends on the amount of expected breakage, but I'd lean towards teh
> later as well.

If we don't turn it on, we won't find out whether it works.  I'd say try
it first and then back off if that proves necessary.  I'd just as soon
not see us write any fallback logic without evidence that it's needed.

FWIW, even my pet dinosaur HP-UX 10.20 box appears to support
mmap(MAP_SHARED|MAP_ANONYMOUS) --- at least the mmap man page documents
both flags.  I find it really pretty hard to believe that there are any
machines out there that haven't got this and yet might be expected to
run PG 9.3+.  We should not go into it with an expectation of failure,
anyway.
        regards, tom lane


Re: Posix Shared Mem patch

From
Jon Nelson
Date:
On Thu, Jun 28, 2012 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>> Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)?  I
>> seem to think that's what I did when I needed this functionality oh so
>> many moons ago.
>
> From the reading I've done on this topic, that seems to be a trick
> invented on Solaris that is considered grotty and awful by everyone
> else.  The thing is that you want the mapping to be shared with the
> processes that inherit the mapping from you.  You do *NOT* want the
> mapping to be shared with EVERYONE who has mapped that file for any
> reason, which is the usual meaning of MAP_SHARED on a file.  Maybe
> this happens to work correctly on some or all platforms, but I would
> want to have some convincing evidence that it's more widely supported
> (with the correct semantics) than MAP_ANON before relying on it.

When I did this (I admit, it was on Linux but it was a long time ago)
only the inherited file descriptor + mmap structure mattered -
modifications were private to the process and it's children - other
apps always saw their "own" /dev/zero. A quick google suggests that -
according to qnx, sco, and some others - mmap'ing /dev/zero retains
the expected privacy. Given how /dev/zero works I'd be very surprised
if it was otherwise.

I would love to see links that suggest that /dev/zero is nasty (or, in
fact, in any way fundamentally different than mmap'ing /dev/zero) -
feel free to send them to me privately to avoid polluting the list.

--
Jon


Re: Posix Shared Mem patch

From
Tom Lane
Date:
... btw, I rather imagine that Robert has already noticed this, but OS X
(and presumably other BSDen) spells the flag "MAP_ANON" not
"MAP_ANONYMOUS".  I also find this rather interesting flag there:
    MAP_HASSEMAPHORE  Notify the kernel that the region may contain sema-                      phores and that special
handlingmay be necessary.
 

By "semaphore" I suspect they mean "spinlock", so we'd better turn this
flag on where it exists.
        regards, tom lane


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ... btw, I rather imagine that Robert has already noticed this, but OS X
> (and presumably other BSDen) spells the flag "MAP_ANON" not
> "MAP_ANONYMOUS".  I also find this rather interesting flag there:
>
>     MAP_HASSEMAPHORE  Notify the kernel that the region may contain sema-
>                       phores and that special handling may be necessary.
>
> By "semaphore" I suspect they mean "spinlock", so we'd better turn this
> flag on where it exists.

Sounds fine to me.  Since no one seems opposed to the basic approach,
and everyone (I assume) will be happier to reduce the impact of
dealing with shared memory limits, I went ahead and committed a
cleaned-up version of the previous patch.  Let's see what the
build-farm thinks.

Assuming things go well, there are a number of follow-on things that
we need to do finish this up:

1. Update the documentation.  I skipped this for now, because I think
that what we write there is going to be heavily dependent on how
portable this turns out to be, which we don't know yet.  Also, it's
not exactly clear to me what the documentation should say if this does
turn out to work everywhere.  Much of section 17.4 will become
irrelevant to most users, but I doubt we'd just want to remove it; it
could still matter for people running EXEC_BACKEND or running many
postmasters on the same machine or, of course, people running on
platforms where this just doesn't work, if there are any.

2. Update the HINT messages when shared memory allocation fails.
Maybe the new most-common-failure mode there will be too many
postmasters running on the same machine?  We might need to wait for
some field reports before adjusting this.

3. Consider adjusting the logic inside initdb.  If this works
everywhere, the code for determining how to set shared_buffers should
become pretty much irrelevant.  Even if it only works some places, we
could add 64MB or 128MB or whatever to the list of values we probe, so
that people won't get quite such a sucky configuration out of the box.Of course there's no number here that will be
goodfor everyone. 

and of course

4. Fix any platforms that are now horribly broken.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Thom Brown
Date:
On 28 June 2012 16:26, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 28, 2012 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ... btw, I rather imagine that Robert has already noticed this, but OS X
>> (and presumably other BSDen) spells the flag "MAP_ANON" not
>> "MAP_ANONYMOUS".  I also find this rather interesting flag there:
>>
>>     MAP_HASSEMAPHORE  Notify the kernel that the region may contain sema-
>>                       phores and that special handling may be necessary.
>>
>> By "semaphore" I suspect they mean "spinlock", so we'd better turn this
>> flag on where it exists.
>
> Sounds fine to me.  Since no one seems opposed to the basic approach,
> and everyone (I assume) will be happier to reduce the impact of
> dealing with shared memory limits, I went ahead and committed a
> cleaned-up version of the previous patch.  Let's see what the
> build-farm thinks.
>
> Assuming things go well, there are a number of follow-on things that
> we need to do finish this up:
>
> 1. Update the documentation.  I skipped this for now, because I think
> that what we write there is going to be heavily dependent on how
> portable this turns out to be, which we don't know yet.  Also, it's
> not exactly clear to me what the documentation should say if this does
> turn out to work everywhere.  Much of section 17.4 will become
> irrelevant to most users, but I doubt we'd just want to remove it; it
> could still matter for people running EXEC_BACKEND or running many
> postmasters on the same machine or, of course, people running on
> platforms where this just doesn't work, if there are any.
>
> 2. Update the HINT messages when shared memory allocation fails.
> Maybe the new most-common-failure mode there will be too many
> postmasters running on the same machine?  We might need to wait for
> some field reports before adjusting this.
>
> 3. Consider adjusting the logic inside initdb.  If this works
> everywhere, the code for determining how to set shared_buffers should
> become pretty much irrelevant.  Even if it only works some places, we
> could add 64MB or 128MB or whatever to the list of values we probe, so
> that people won't get quite such a sucky configuration out of the box.
>  Of course there's no number here that will be good for everyone.
>
> and of course
>
> 4. Fix any platforms that are now horribly broken.

On 64-bit Linux, if I allocate more shared buffers than the system is
capable of reserving, it doesn't start.  This is expected, but there's
no error logged anywhere (actually, nothing logged at all), and the
postmaster.pid file is left behind after this failure.

--
Thom


Re: Posix Shared Mem patch

From
Jeff Janes
Date:
On Thu, Jun 28, 2012 at 8:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> 3. Consider adjusting the logic inside initdb.  If this works
> everywhere, the code for determining how to set shared_buffers should
> become pretty much irrelevant.  Even if it only works some places, we
> could add 64MB or 128MB or whatever to the list of values we probe, so
> that people won't get quite such a sucky configuration out of the box.
>  Of course there's no number here that will be good for everyone.

This seems independent of the type of shared memory used and the
limits on it.  If it tried and 64MB or 128MB and discovered that it
couldn't obtain that much shared memory, it automatically climbs down
to smaller values until it finds one that works.  I think the
impediment to adopting larger defaults is not what happens if it can't
get that much shared memory, but rather what happens if the machine
doesn't have that much physical memory.  The test server will still
start (and so there will be no climb-down), leaving a default which is
valid but just has horrid performance.

Cheers,

Jeff


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote:
> On 64-bit Linux, if I allocate more shared buffers than the system is
> capable of reserving, it doesn't start.  This is expected, but there's
> no error logged anywhere (actually, nothing logged at all), and the
> postmaster.pid file is left behind after this failure.

Fixed.

However, I discovered something unpleasant.  With the new code, on
MacOS X, if you set shared_buffers to say 3200GB, the server happily
starts up.  Or at least the shared memory allocation goes through just
fine.  The postmaster then sits there apparently forever without
emitting any log messages, which I eventually discovered was because
it's busy initializing a billion or so spinlocks.

I'm pretty sure that this machine does not have >3TB of virtual
memory, even counting swap.  So that means that MacOS X has absolutely
no common sense whatsoever as far as anonymous shared memory
allocations go.  Not sure exactly what to do about that.  Linux is
more sensible, at least on the system I tested, and fails cleanly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Magnus Hagander
Date:
On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote:
>> On 64-bit Linux, if I allocate more shared buffers than the system is
>> capable of reserving, it doesn't start.  This is expected, but there's
>> no error logged anywhere (actually, nothing logged at all), and the
>> postmaster.pid file is left behind after this failure.
>
> Fixed.
>
> However, I discovered something unpleasant.  With the new code, on
> MacOS X, if you set shared_buffers to say 3200GB, the server happily
> starts up.  Or at least the shared memory allocation goes through just
> fine.  The postmaster then sits there apparently forever without
> emitting any log messages, which I eventually discovered was because
> it's busy initializing a billion or so spinlocks.
>
> I'm pretty sure that this machine does not have >3TB of virtual
> memory, even counting swap.  So that means that MacOS X has absolutely
> no common sense whatsoever as far as anonymous shared memory
> allocations go.  Not sure exactly what to do about that.  Linux is
> more sensible, at least on the system I tested, and fails cleanly.

What happens if you mlock() it into memory - does that fail quickly?
Is that not something we might want to do *anyway*?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Posix Shared Mem patch

From
Andres Freund
Date:
On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
> On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote:
> >> On 64-bit Linux, if I allocate more shared buffers than the system is
> >> capable of reserving, it doesn't start.  This is expected, but there's
> >> no error logged anywhere (actually, nothing logged at all), and the
> >> postmaster.pid file is left behind after this failure.
> > 
> > Fixed.
> > 
> > However, I discovered something unpleasant.  With the new code, on
> > MacOS X, if you set shared_buffers to say 3200GB, the server happily
> > starts up.  Or at least the shared memory allocation goes through just
> > fine.  The postmaster then sits there apparently forever without
> > emitting any log messages, which I eventually discovered was because
> > it's busy initializing a billion or so spinlocks.
> > 
> > I'm pretty sure that this machine does not have >3TB of virtual
> > memory, even counting swap.  So that means that MacOS X has absolutely
> > no common sense whatsoever as far as anonymous shared memory
> > allocations go.  Not sure exactly what to do about that.  Linux is
> > more sensible, at least on the system I tested, and fails cleanly.
> 
> What happens if you mlock() it into memory - does that fail quickly?
> Is that not something we might want to do *anyway*?
You normally can only mlock() mminor amounts of memory without changing 
settings. Requiring to change that setting (aside that mlocking would be a bad 
idea imo) would run contrary to the point of the patch, wouldn't it? ;)

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: Posix Shared Mem patch

From
Magnus Hagander
Date:
On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
>> On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote:
>> >> On 64-bit Linux, if I allocate more shared buffers than the system is
>> >> capable of reserving, it doesn't start.  This is expected, but there's
>> >> no error logged anywhere (actually, nothing logged at all), and the
>> >> postmaster.pid file is left behind after this failure.
>> >
>> > Fixed.
>> >
>> > However, I discovered something unpleasant.  With the new code, on
>> > MacOS X, if you set shared_buffers to say 3200GB, the server happily
>> > starts up.  Or at least the shared memory allocation goes through just
>> > fine.  The postmaster then sits there apparently forever without
>> > emitting any log messages, which I eventually discovered was because
>> > it's busy initializing a billion or so spinlocks.
>> >
>> > I'm pretty sure that this machine does not have >3TB of virtual
>> > memory, even counting swap.  So that means that MacOS X has absolutely
>> > no common sense whatsoever as far as anonymous shared memory
>> > allocations go.  Not sure exactly what to do about that.  Linux is
>> > more sensible, at least on the system I tested, and fails cleanly.
>>
>> What happens if you mlock() it into memory - does that fail quickly?
>> Is that not something we might want to do *anyway*?
> You normally can only mlock() mminor amounts of memory without changing
> settings. Requiring to change that setting (aside that mlocking would be a bad
> idea imo) would run contrary to the point of the patch, wouldn't it? ;)

It would. I wasn't aware of that limitation :)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
>>> What happens if you mlock() it into memory - does that fail quickly?
>>> Is that not something we might want to do *anyway*?

>> You normally can only mlock() mminor amounts of memory without changing
>> settings. Requiring to change that setting (aside that mlocking would be a bad
>> idea imo) would run contrary to the point of the patch, wouldn't it? ;)

> It would. I wasn't aware of that limitation :)

The OSX man page says that mlock should give EAGAIN for a permissions
failure (ie, exceeding the rlimit) but
    [ENOMEM]           Some portion of the indicated address range is not                       allocated.  There was
anerror faulting/mapping a                       page.
 

It might be helpful to try mlock (if available, which it isn't
everywhere) and complain about ENOMEM but not other errors.  If course,
if the kernel checks rlimit first, we won't learn anything ...

I think it *would* be a good idea to mlock if we could.  Setting shmem
large enough that it swaps has always been horrible for performance,
and in sysv-land there's no way to prevent that.  But we can't error
out on permissions failure.
        regards, tom lane


Re: Posix Shared Mem patch

From
Andres Freund
Date:
On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> >> On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
> >>> What happens if you mlock() it into memory - does that fail quickly?
> >>> Is that not something we might want to do *anyway*?
> >> 
> >> You normally can only mlock() mminor amounts of memory without changing
> >> settings. Requiring to change that setting (aside that mlocking would be
> >> a bad idea imo) would run contrary to the point of the patch, wouldn't
> >> it? ;)
> > 
> > It would. I wasn't aware of that limitation :)
> 
> The OSX man page says that mlock should give EAGAIN for a permissions
> failure (ie, exceeding the rlimit) but
> 
>      [ENOMEM]           Some portion of the indicated address range is not
>                         allocated.  There was an error faulting/mapping a
>                         page.
> 
> It might be helpful to try mlock (if available, which it isn't
> everywhere) and complain about ENOMEM but not other errors.  If course,
> if the kernel checks rlimit first, we won't learn anything ...
> 
> I think it *would* be a good idea to mlock if we could.  Setting shmem
> large enough that it swaps has always been horrible for performance,
> and in sysv-land there's no way to prevent that.  But we can't error
> out on permissions failure.
Its also a very good method to get into hard to diagnose OOM situations 
though. Unless the machine is setup very careful and only runs postgres I 
don't think its acceptable to do that.

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote:
>> I think it *would* be a good idea to mlock if we could.  Setting shmem
>> large enough that it swaps has always been horrible for performance,
>> and in sysv-land there's no way to prevent that.  But we can't error
>> out on permissions failure.

> Its also a very good method to get into hard to diagnose OOM situations 
> though. Unless the machine is setup very careful and only runs postgres I 
> don't think its acceptable to do that.

Well, the permissions angle is actually a good thing here.  There is
pretty much no risk of the mlock succeeding on a box that hasn't been
specially configured --- and, in most cases, I think you'd need root
cooperation to raise postgres' RLIMIT_MEMLOCK.  So I think we could try
to mlock without having any effect for 99% of users.  The 1% who are
smart enough to raise the rlimit to something suitable would get better,
or at least more predictable, performance.
        regards, tom lane


Re: Posix Shared Mem patch

From
Andres Freund
Date:
On Thursday, June 28, 2012 08:00:06 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote:
> >> I think it *would* be a good idea to mlock if we could.  Setting shmem
> >> large enough that it swaps has always been horrible for performance,
> >> and in sysv-land there's no way to prevent that.  But we can't error
> >> out on permissions failure.
> > 
> > Its also a very good method to get into hard to diagnose OOM situations
> > though. Unless the machine is setup very careful and only runs postgres I
> > don't think its acceptable to do that.
> 
> Well, the permissions angle is actually a good thing here.  There is
> pretty much no risk of the mlock succeeding on a box that hasn't been
> specially configured --- and, in most cases, I think you'd need root
> cooperation to raise postgres' RLIMIT_MEMLOCK.  So I think we could try
> to mlock without having any effect for 99% of users.  The 1% who are
> smart enough to raise the rlimit to something suitable would get better,
> or at least more predictable, performance.
The heightened limit might just as well target at another application and be 
setup a bit to widely. I agree that it is useful, but I think it requires its 
own setting, defaulting to off. Especially as there are no experiences with 
running a larger pg instance that way.

Greetings,

Andres, for once the conservative one, Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Thursday, June 28, 2012 08:00:06 PM Tom Lane wrote:
>> Well, the permissions angle is actually a good thing here.  There is
>> pretty much no risk of the mlock succeeding on a box that hasn't been
>> specially configured --- and, in most cases, I think you'd need root
>> cooperation to raise postgres' RLIMIT_MEMLOCK.  So I think we could try
>> to mlock without having any effect for 99% of users.  The 1% who are
>> smart enough to raise the rlimit to something suitable would get better,
>> or at least more predictable, performance.

> The heightened limit might just as well target at another application and be 
> setup a bit to widely. I agree that it is useful, but I think it requires its 
> own setting, defaulting to off. Especially as there are no experiences with 
> running a larger pg instance that way.

[ shrug... ]  I think you're inventing things to be afraid of, and
ignoring a very real problem that mlock could fix.
        regards, tom lane


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 1:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote:
>>>> What happens if you mlock() it into memory - does that fail quickly?
>>>> Is that not something we might want to do *anyway*?
>
>>> You normally can only mlock() mminor amounts of memory without changing
>>> settings. Requiring to change that setting (aside that mlocking would be a bad
>>> idea imo) would run contrary to the point of the patch, wouldn't it? ;)
>
>> It would. I wasn't aware of that limitation :)
>
> The OSX man page says that mlock should give EAGAIN for a permissions
> failure (ie, exceeding the rlimit) but
>
>     [ENOMEM]           Some portion of the indicated address range is not
>                        allocated.  There was an error faulting/mapping a
>                        page.
>
> It might be helpful to try mlock (if available, which it isn't
> everywhere) and complain about ENOMEM but not other errors.  If course,
> if the kernel checks rlimit first, we won't learn anything ...

I tried this.  At least on my fairly vanilla MacOS X desktop, an mlock
for a larger amount of memory than was conveniently on hand (4GB, on a
4GB box) neither succeeded nor failed in a timely fashion but instead
progressively hung the machine, apparently trying to progressively
push every available page out to swap.  After 5 minutes or so I could
no longer move the mouse.  After about 20 minutes I gave up and hit
the reset button.  So there's apparently no value to this as a
diagnostic tool, at least on this platform.

> I think it *would* be a good idea to mlock if we could.  Setting shmem
> large enough that it swaps has always been horrible for performance,
> and in sysv-land there's no way to prevent that.  But we can't error
> out on permissions failure.

I wouldn't mind having an option, but I think there'd have to be a way
to turn it off for people trying to cram as many lightly-used VMs as
possible onto a single server.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I tried this.  At least on my fairly vanilla MacOS X desktop, an mlock
> for a larger amount of memory than was conveniently on hand (4GB, on a
> 4GB box) neither succeeded nor failed in a timely fashion but instead
> progressively hung the machine, apparently trying to progressively
> push every available page out to swap.  After 5 minutes or so I could
> no longer move the mouse.  After about 20 minutes I gave up and hit
> the reset button.  So there's apparently no value to this as a
> diagnostic tool, at least on this platform.

Fun.  I wonder if other BSDen are as brain-dead as OSX on this point.

It'd probably at least be worth filing a bug report with Apple about it.
        regards, tom lane


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I tried this.  At least on my fairly vanilla MacOS X desktop, an mlock
>> for a larger amount of memory than was conveniently on hand (4GB, on a
>> 4GB box) neither succeeded nor failed in a timely fashion but instead
>> progressively hung the machine, apparently trying to progressively
>> push every available page out to swap.  After 5 minutes or so I could
>> no longer move the mouse.  After about 20 minutes I gave up and hit
>> the reset button.  So there's apparently no value to this as a
>> diagnostic tool, at least on this platform.
>
> Fun.  I wonder if other BSDen are as brain-dead as OSX on this point.
>
> It'd probably at least be worth filing a bug report with Apple about it.

Just for fun, I tried writing a program that does power-of-two-sized
malloc requests.

The first one that failed - on my 4GB Mac, remember - was for
140737488355328 bytes.  Yeah, that' s right: 128 TB.

According to the Google, there is absolutely no way of gettIng MacOS X
not to overcommit like crazy.  You can read the amount of system
memory by using sysctl() to fetch hw.memsize, but it's not really
clear how much that helps.  We could refuse to start up if the shared
memory allocation is >= hw.memsize, but even an amount slightly less
than that seems like enough to send the machine into a tailspin, so
I'm not sure that really gets us anywhere.

One idea I had was to LOG the size of the shared memory allocation
just before allocating it.  That way, if your system goes into the
tank, there will at least be something in the log.  But that would be
useless chatter for most users.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Josh Berkus
Date:
> According to the Google, there is absolutely no way of gettIng MacOS X
> not to overcommit like crazy.  

Well, this is one of a long list of broken things about OSX.  If you
want to see *real* breakage, do some IO performance testing of HFS+

FWIW, I have this issue with Mac desktop applications on my MacBook,
which will happily memory leak until I run out of swap space.

> You can read the amount of system
> memory by using sysctl() to fetch hw.memsize, but it's not really
> clear how much that helps.  We could refuse to start up if the shared
> memory allocation is >= hw.memsize, but even an amount slightly less
> than that seems like enough to send the machine into a tailspin, so
> I'm not sure that really gets us anywhere.

I still think it would help.  User errors in allocating shmmem are more
likely to be order-of-magnitude errors ("I meant 500MB, not 500GB!")
than be matters of 20% of RAM over.

> One idea I had was to LOG the size of the shared memory allocation
> just before allocating it.  That way, if your system goes into the
> tank, there will at least be something in the log.  But that would be
> useless chatter for most users.

Yes, but it would provide mailing list, IRC and StackExchange quick answers.

"I started up PostgreSQL and my MacBook crashed."

"Find the file postgres.log.  What's the last 10 lines?"

So neither of those things *fixes* the problem ... ultimately, it's
Apple's problem and we can't fix it ... but both of them make it
somewhat better.

The other thing which will avoid the problem for most Mac users is if we
simply allocate 10% of RAM at initdb as a default.  If we do that, then
90% of users will never touch Shmem themselves, and not have the
opportunity to mess up.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Posix Shared Mem patch

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> The other thing which will avoid the problem for most Mac users is if we
> simply allocate 10% of RAM at initdb as a default.  If we do that, then
> 90% of users will never touch Shmem themselves, and not have the
> opportunity to mess up.

If we could do that on *all* platforms, I might be for it, but we only
know how to get that number on some platforms.  There's also the issue
of whether we really want to assume that the machine is dedicated to
Postgres, which IMO is an implicit assumption of any default that scales
itself to physical RAM.

For the moment I think we should just allow initdb to scale up a little
bit more than where it is now, perhaps 128MB instead of 32.
        regards, tom lane


Re: Posix Shared Mem patch

From
Josh Berkus
Date:
Tom,

> If we could do that on *all* platforms, I might be for it, but we only
> know how to get that number on some platforms. 

I don't see what's wrong with using it where we can get it, and not
using it where we can't.

>  There's also the issue
> of whether we really want to assume that the machine is dedicated to
> Postgres, which IMO is an implicit assumption of any default that scales
> itself to physical RAM.

10% isn't assuming dedicated.  Assuming dedicated would be 20% or 25%.

I was thinking "10%, with a ceiling of 512MB".

> For the moment I think we should just allow initdb to scale up a little
> bit more than where it is now, perhaps 128MB instead of 32.

I wouldn't be opposed to that.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Posix Shared Mem patch

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> If we could do that on *all* platforms, I might be for it, but we only
>> know how to get that number on some platforms. 

> I don't see what's wrong with using it where we can get it, and not
> using it where we can't.

Because then we still need to define, and document, a sensible behavior
on the machines where we can't get it.  And document that we do it two
different ways, and document which machines we do it which way on.

>> There's also the issue
>> of whether we really want to assume that the machine is dedicated to
>> Postgres, which IMO is an implicit assumption of any default that scales
>> itself to physical RAM.

> 10% isn't assuming dedicated.

Really?
        regards, tom lane


Re: Posix Shared Mem patch

From
Josh Berkus
Date:
>> 10% isn't assuming dedicated.
> 
> Really?

Yes.  As I said, the allocation for dedicated PostgreSQL servers is
usually 20% to 25%, up to 8GB.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Posix Shared Mem patch

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>>> 10% isn't assuming dedicated.

>> Really?

> Yes.  As I said, the allocation for dedicated PostgreSQL servers is
> usually 20% to 25%, up to 8GB.

Any percentage is assuming dedicated, IMO.  25% might be the more common
number, but you're still assuming that you can have your pick of the
machine's resources.

My idea of "not dedicated" is "I can launch a dozen postmasters on this
machine, and other services too, and it'll be okay as long as they're
not doing too much".
        regards, tom lane


Re: Posix Shared Mem patch

From
Josh Berkus
Date:
> My idea of "not dedicated" is "I can launch a dozen postmasters on this
> machine, and other services too, and it'll be okay as long as they're
> not doing too much".

Oh, 128MB then?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com




Re: Posix Shared Mem patch

From
Andres Freund
Date:
Hi All,

In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed 
memory.
That gives around 9.5% performance benefit in a read-only pgbench run (-n -S -
j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).

It also saves a bunch of memory per process due to the smaller page table 
(shared_buffers 6GB):
cat /proc/$pid_of_pg_backend/status |grep VmPTE
VmPTE:        6252 kB
vs
VmPTE:          60 kB

Additionally it has the advantage that top/ps/... output under linux now looks 
like:                                                                         PID USER      PR  NI  VIRT  RES  SHR S
%CPU%MEM    TIME+  COMMAND 
 
10603 andres    20   0 6381m 4924 1952 R    21  0.0   0:28.04 postgres  

i.e. RES now actually shows something usable... Which is rather nice imo.

I don't have the time atm into making this something useable, maybe somebody 
else want to pick it up? Looks pretty worthwile investing some time.

Because of the required setup we sure cannot make this the default but...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Posix Shared Mem patch

From
Merlin Moncure
Date:
On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi All,
>
> In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed
> memory.
> That gives around 9.5% performance benefit in a read-only pgbench run (-n -S -
> j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).
>
> It also saves a bunch of memory per process due to the smaller page table
> (shared_buffers 6GB):
> cat /proc/$pid_of_pg_backend/status |grep VmPTE
> VmPTE:      6252 kB
> vs
> VmPTE:        60 kB
>
> Additionally it has the advantage that top/ps/... output under linux now looks
> like:
>  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
> 10603 andres    20   0 6381m 4924 1952 R    21  0.0   0:28.04 postgres
>
> i.e. RES now actually shows something usable... Which is rather nice imo.
>
> I don't have the time atm into making this something useable, maybe somebody
> else want to pick it up? Looks pretty worthwile investing some time.
>
> Because of the required setup we sure cannot make this the default but...

... those results are just spectacular (IMO). nice!

merlin


Re: Posix Shared Mem patch

From
Daniel Farina
Date:
On Fri, Jun 29, 2012 at 1:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Hi All,
>>
>> In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed
>> memory.
>> That gives around 9.5% performance benefit in a read-only pgbench run (-n -S -
>> j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).
>>
>> It also saves a bunch of memory per process due to the smaller page table
>> (shared_buffers 6GB):
>> cat /proc/$pid_of_pg_backend/status |grep VmPTE
>> VmPTE:      6252 kB
>> vs
>> VmPTE:        60 kB
> ... those results are just spectacular (IMO). nice!

That is super awesome.  Smallish databases with a high number of
connections actually spend a considerable fraction of their
otherwise-available-for-buffer-cache space on page tables in common
cases currently.

-- 
fdr


Re: Posix Shared Mem patch

From
Bruce Momjian
Date:
On Fri, Jun 29, 2012 at 04:03:40PM -0700, Daniel Farina wrote:
> On Fri, Jun 29, 2012 at 1:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> > On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> Hi All,
> >>
> >> In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed
> >> memory.
> >> That gives around 9.5% performance benefit in a read-only pgbench run (-n -S -
> >> j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem).
> >>
> >> It also saves a bunch of memory per process due to the smaller page table
> >> (shared_buffers 6GB):
> >> cat /proc/$pid_of_pg_backend/status |grep VmPTE
> >> VmPTE:      6252 kB
> >> vs
> >> VmPTE:        60 kB
> > ... those results are just spectacular (IMO). nice!
> 
> That is super awesome.  Smallish databases with a high number of
> connections actually spend a considerable fraction of their
> otherwise-available-for-buffer-cache space on page tables in common
> cases currently.

I thought newer Linux kernels did huge pages automatically?  What Linux
kernel is this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Fri, Jun 29, 2012 at 2:31 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> My idea of "not dedicated" is "I can launch a dozen postmasters on this
>> machine, and other services too, and it'll be okay as long as they're
>> not doing too much".
>
> Oh, 128MB then?

Proposed patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Thu, Jun 28, 2012 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Assuming things go well, there are a number of follow-on things that
> we need to do finish this up:
>
> 1. Update the documentation.  I skipped this for now, because I think
> that what we write there is going to be heavily dependent on how
> portable this turns out to be, which we don't know yet.  Also, it's
> not exactly clear to me what the documentation should say if this does
> turn out to work everywhere.  Much of section 17.4 will become
> irrelevant to most users, but I doubt we'd just want to remove it; it
> could still matter for people running EXEC_BACKEND or running many
> postmasters on the same machine or, of course, people running on
> platforms where this just doesn't work, if there are any.

Here's a patch that attempts to begin the work of adjusting the
documentation for this brave new world.  I am guessing that there may
be other places in the documentation that also require updating, and
this page probably needs more work, but it's a start.

> 2. Update the HINT messages when shared memory allocation fails.
> Maybe the new most-common-failure mode there will be too many
> postmasters running on the same machine?  We might need to wait for
> some field reports before adjusting this.

I think this is mostly a matter of removing the text that says "fix
this by reducing shme-related parameters" from the relevant hint
messages.

> 3. Consider adjusting the logic inside initdb.  If this works
> everywhere, the code for determining how to set shared_buffers should
> become pretty much irrelevant.  Even if it only works some places, we
> could add 64MB or 128MB or whatever to the list of values we probe, so
> that people won't get quite such a sucky configuration out of the box.
>  Of course there's no number here that will be good for everyone.

I posted a patch for this one last night.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Posix Shared Mem patch

From
Andres Freund
Date:
On Wednesday, June 27, 2012 05:28:14 AM Robert Haas wrote:
> On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Josh Berkus <josh@agliodbs.com> writes:
> >> So let's fix the 80% case with something we feel confident in, and then
> >> revisit the no-sysv interlock as a separate patch.  That way if we can't
> >> fix the interlock issues, we still have a reduced-shmem version of
> >> Postgres.
> > 
> > Yes.  Insisting that we have the whole change in one patch is a good way
> > to prevent any forward progress from happening.  As Alvaro noted, there
> > are plenty of issues to resolve without trying to change the interlock
> > mechanism at the same time.
> 
> So, here's a patch.  Instead of using POSIX shmem, I just took the
> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
> memory.  The sysv shm is still allocated, but it's just a copy of
> PGShmemHeader; the "real" shared memory is the anonymous block.  This
> won't work if EXEC_BACKEND is defined so it just falls back on
> straight sysv shm in that case.
> 
> There are obviously some portability issues here - this is documented
> not to work on Linux <= 2.4, but it's not clear whether it fails with
> some suitable error code or just pretends to work and does the wrong
> thing.  I tested that it does compile and work on both Linux 3.2.6 and
> MacOS X 10.6.8.  And the comments probably need work and... who knows
> what else is wrong.  But, thoughts?
Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared 
memory might be advantageous on some platforms. E.g. on freebsd there is the 
kern.ipc.shm_use_phys setting which prevents paging out shared memory and also 
seems to make tlb translation cheaper. There does not seem to exist an 
alternative for anonymous mmap.
So maybe we should make that a config option? 

Greetings,

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared 
> memory might be advantageous on some platforms. E.g. on freebsd there is the 
> kern.ipc.shm_use_phys setting which prevents paging out shared memory and also 
> seems to make tlb translation cheaper. There does not seem to exist an 
> alternative for anonymous mmap.

Isn't that mlock()?

> So maybe we should make that a config option? 

I'd really rather not.  If we're going to go in this direction, we
should just go there.
        regards, tom lane


Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Tue, Jul 3, 2012 at 11:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared
> memory might be advantageous on some platforms. E.g. on freebsd there is the
> kern.ipc.shm_use_phys setting which prevents paging out shared memory and also
> seems to make tlb translation cheaper. There does not seem to exist an
> alternative for anonymous mmap.
> So maybe we should make that a config option?

Yeah, I was noticing some notes to that effect in the documentation
this morning.  I think the alternative for anonymous mmap is mlock().
However, that can hit kernel limits of its own.  I'm not sure what the
best thing to do about this is.  I think most users will want mlock...
but maybe not all?  So we end up with one option for whether to use
mlock and another for whether to use more or less System V shm?
Sounds confusing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Posix Shared Mem patch

From
Magnus Hagander
Date:
On Tue, Jul 3, 2012 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On Wednesday, June 27, 2012 05:28:14 AM Robert Haas wrote:
>> On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Josh Berkus <josh@agliodbs.com> writes:
>> >> So let's fix the 80% case with something we feel confident in, and then
>> >> revisit the no-sysv interlock as a separate patch.  That way if we can't
>> >> fix the interlock issues, we still have a reduced-shmem version of
>> >> Postgres.
>> >
>> > Yes.  Insisting that we have the whole change in one patch is a good way
>> > to prevent any forward progress from happening.  As Alvaro noted, there
>> > are plenty of issues to resolve without trying to change the interlock
>> > mechanism at the same time.
>>
>> So, here's a patch.  Instead of using POSIX shmem, I just took the
>> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
>> memory.  The sysv shm is still allocated, but it's just a copy of
>> PGShmemHeader; the "real" shared memory is the anonymous block.  This
>> won't work if EXEC_BACKEND is defined so it just falls back on
>> straight sysv shm in that case.
>>
>> There are obviously some portability issues here - this is documented
>> not to work on Linux <= 2.4, but it's not clear whether it fails with
>> some suitable error code or just pretends to work and does the wrong
>> thing.  I tested that it does compile and work on both Linux 3.2.6 and
>> MacOS X 10.6.8.  And the comments probably need work and... who knows
>> what else is wrong.  But, thoughts?
> Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared
> memory might be advantageous on some platforms. E.g. on freebsd there is the
> kern.ipc.shm_use_phys setting which prevents paging out shared memory and also
> seems to make tlb translation cheaper. There does not seem to exist an
> alternative for anonymous mmap.
> So maybe we should make that a config option?

Interesting to see that FreeBSD does this - while at the same time
refusing to fix the use of sysv shared memory under their own jails
system (afaik, at least). They seem to be quite undecided on if it's a
feature to remove or a feature to expand on :O Not sure I'd trust that
to stick around...

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/


Re: Posix Shared Mem patch

From
Andres Freund
Date:
On Tuesday, July 03, 2012 05:41:09 PM Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv
> > shared memory might be advantageous on some platforms. E.g. on freebsd
> > there is the kern.ipc.shm_use_phys setting which prevents paging out
> > shared memory and also seems to make tlb translation cheaper. There does
> > not seem to exist an alternative for anonymous mmap.
> Isn't that mlock()?
Similar at least yes. I think it might also make the virtual/physical 
translation more direct but that ist just the impression of a very short 
search.

> > So maybe we should make that a config option?
> I'd really rather not.  If we're going to go in this direction, we
> should just go there.
I don't really care, just wanted to bring up that at least one experienced 
user would be disappointed ;). As the old implementation needs to stay around 
for EXEC_BACKEND anyway, the price doesn't seem to be too high.

Andres
-- 
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Posix Shared Mem patch

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On Tuesday, July 03, 2012 05:41:09 PM Tom Lane wrote:
>> I'd really rather not.  If we're going to go in this direction, we
>> should just go there.

> I don't really care, just wanted to bring up that at least one experienced 
> user would be disappointed ;). As the old implementation needs to stay around
> for EXEC_BACKEND anyway, the price doesn't seem to be too high.

Well, my feeling is that sooner or later, perhaps sooner, we are going
to want to be out from under SysV shmem (and semaphores) entirely.
The Linux kernel guys keep threatening to drop support for the feature.
So I think that exposing any knobs about this, or encouraging people
to rely on corner-case properties of SysV on their platform, is just
going to create more pain when we have to pull the plug.
        regards, tom lane


Re: Posix Shared Mem patch

From
Josh Kupershmidt
Date:
On Tue, Jul 3, 2012 at 6:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Here's a patch that attempts to begin the work of adjusting the
> documentation for this brave new world.  I am guessing that there may
> be other places in the documentation that also require updating, and
> this page probably needs more work, but it's a start.

I think the boilerplate warnings in config.sgml about needing to raise
the SysV parameters can go away; patch attached.

Josh

Attachment

Re: Posix Shared Mem patch

From
Robert Haas
Date:
On Tue, Jul 3, 2012 at 1:46 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:
> On Tue, Jul 3, 2012 at 6:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Here's a patch that attempts to begin the work of adjusting the
>> documentation for this brave new world.  I am guessing that there may
>> be other places in the documentation that also require updating, and
>> this page probably needs more work, but it's a start.
>
> I think the boilerplate warnings in config.sgml about needing to raise
> the SysV parameters can go away; patch attached.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company