Thread: Posix Shared Mem patch
Robert, all: Last I checked, we had a reasonably acceptable patch to use mostly Posix Shared mem with a very small sysv ram partition. Is there anything keeping this from going into 9.3? It would eliminate a major configuration headache for our users. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: > Robert, all: > > Last I checked, we had a reasonably acceptable patch to use mostly Posix > Shared mem with a very small sysv ram partition. Is there anything > keeping this from going into 9.3? It would eliminate a major > configuration headache for our users. I don't think that patch was all that reasonable. It needed work, and in any case it needs a rebase because it was pretty old. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: >> Robert, all: >> >> Last I checked, we had a reasonably acceptable patch to use mostly Posix >> Shared mem with a very small sysv ram partition. Is there anything >> keeping this from going into 9.3? It would eliminate a major >> configuration headache for our users. > > I don't think that patch was all that reasonable. It needed work, and > in any case it needs a rebase because it was pretty old. Yep, agreed. I'd like to get this fixed too, but it hasn't made it up to the top of my list of things to worry about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/26/12 2:13 PM, Robert Haas wrote: > On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera > <alvherre@commandprompt.com> wrote: >> Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: >>> Robert, all: >>> >>> Last I checked, we had a reasonably acceptable patch to use mostly Posix >>> Shared mem with a very small sysv ram partition. Is there anything >>> keeping this from going into 9.3? It would eliminate a major >>> configuration headache for our users. >> >> I don't think that patch was all that reasonable. It needed work, and >> in any case it needs a rebase because it was pretty old. > > Yep, agreed. > > I'd like to get this fixed too, but it hasn't made it up to the top of > my list of things to worry about. Was there a post-AgentM version of the patch, which incorporated the small SySV RAM partition? I'm not finding it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jun 26, 2012 at 2:18 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 6/26/12 2:13 PM, Robert Haas wrote: >> On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >>> Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: >>>> Robert, all: >>>> >>>> Last I checked, we had a reasonably acceptable patch to use mostly Posix >>>> Shared mem with a very small sysv ram partition. Is there anything >>>> keeping this from going into 9.3? It would eliminate a major >>>> configuration headache for our users. >>> >>> I don't think that patch was all that reasonable. It needed work, and >>> in any case it needs a rebase because it was pretty old. >> >> Yep, agreed. >> >> I'd like to get this fixed too, but it hasn't made it up to the top of >> my list of things to worry about. > > Was there a post-AgentM version of the patch, which incorporated the > small SySV RAM partition? I'm not finding it. On that, I used to be of the opinion that this is a good compromise (a small amount of interlock space, plus mostly posix shmem), but I've heard since then (I think via AgentM indirectly, but I'm not sure) that there are cases where even the small SysV segment can cause problems -- notably when other software tweaks shared memory settings on behalf of a user, but only leaves just-enough for the software being installed. This is most likely on platforms that don't have a high SysV shmem limit by default, so installers all feel the prerogative to increase the limit, but there's no great answer for how to compose a series of such installations. It only takes one installer that says "whatever, I'm just catenating stuff to sysctl.conf that works for me" to sabotage Postgres' ability to start. So there may be a benefit in finding a way to have no SysV memory at all. I wouldn't let perfect be the enemy of good to make progress here, but it appears this was a witnessed real problem, so it may be worth reconsidering if there is a way we can safely remove all SysV by finding an alternative to the nattach mechanic. -- fdr
On Tue, Jun 26, 2012 at 5:18 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 6/26/12 2:13 PM, Robert Haas wrote: >> On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera >> <alvherre@commandprompt.com> wrote: >>> Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012: >>>> Robert, all: >>>> >>>> Last I checked, we had a reasonably acceptable patch to use mostly Posix >>>> Shared mem with a very small sysv ram partition. Is there anything >>>> keeping this from going into 9.3? It would eliminate a major >>>> configuration headache for our users. >>> >>> I don't think that patch was all that reasonable. It needed work, and >>> in any case it needs a rebase because it was pretty old. >> >> Yep, agreed. >> >> I'd like to get this fixed too, but it hasn't made it up to the top of >> my list of things to worry about. > > Was there a post-AgentM version of the patch, which incorporated the > small SySV RAM partition? I'm not finding it. To my knowledge, no. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On that, I used to be of the opinion that this is a good compromise (a > small amount of interlock space, plus mostly posix shmem), but I've > heard since then (I think via AgentM indirectly, but I'm not sure) > that there are cases where even the small SysV segment can cause > problems -- notably when other software tweaks shared memory settings > on behalf of a user, but only leaves just-enough for the software > being installed. This is most likely on platforms that don't have a > high SysV shmem limit by default, so installers all feel the > prerogative to increase the limit, but there's no great answer for how > to compose a series of such installations. It only takes one > installer that says "whatever, I'm just catenating stuff to > sysctl.conf that works for me" to sabotage Postgres' ability to start. Personally, I see this as rather an extreme case, and aside from AgentM himself, have never run into it before. Certainly it would be useful to not need SysV RAM at all, but it's more important to get a working patch for 9.3. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Jun 26, 2012 at 5:44 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> On that, I used to be of the opinion that this is a good compromise (a >> small amount of interlock space, plus mostly posix shmem), but I've >> heard since then (I think via AgentM indirectly, but I'm not sure) >> that there are cases where even the small SysV segment can cause >> problems -- notably when other software tweaks shared memory settings >> on behalf of a user, but only leaves just-enough for the software >> being installed. This is most likely on platforms that don't have a >> high SysV shmem limit by default, so installers all feel the >> prerogative to increase the limit, but there's no great answer for how >> to compose a series of such installations. It only takes one >> installer that says "whatever, I'm just catenating stuff to >> sysctl.conf that works for me" to sabotage Postgres' ability to start. > > Personally, I see this as rather an extreme case, and aside from AgentM > himself, have never run into it before. Certainly it would be useful to > not need SysV RAM at all, but it's more important to get a working patch > for 9.3. +1. I'd sort of given up on finding a solution that doesn't involve system V shmem anyway, but now that I think about it... what about using a FIFO? The man page for open on MacOS X says: [ENXIO] O_NONBLOCK and O_WRONLY are set, the file is a FIFO, and no process has it openfor reading. And Linux says: ENXIO O_NONBLOCK | O_WRONLY is set, the named file is a FIFO and no process has the file open for reading. Or, the file is a device special file and no corresponding device exists. And HP/UX says: [ENXIO] O_NDELAY is set, the named file is a FIFO, O_WRONLY is set, and no processhas the file open for reading. So, what about keeping a FIFO in the data directory? When the postmaster starts up, it tries to open the file with O_NONBLOCK | O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather than O_NONBLOCK). If that succeeds, it bails out. If it fails with anything other than ENXIO, it bails out. If it fails with exactly ENXIO, then it opens the pipe with O_RDONLY and arranges to pass the file descriptor down to all of its children, so that a subsequent open will fail if it or any of its children are still alive. This might even be more reliable than what we do right now, because our current system appears not to be robust against the removal of postmaster.pid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012: > On that, I used to be of the opinion that this is a good compromise (a > small amount of interlock space, plus mostly posix shmem), but I've > heard since then (I think via AgentM indirectly, but I'm not sure) > that there are cases where even the small SysV segment can cause > problems -- notably when other software tweaks shared memory settings > on behalf of a user, but only leaves just-enough for the software > being installed. This argument is what killed the original patch. If you want to get anything done *at all* I think it needs to be dropped. Changing shmem implementation is already difficult enough --- you don't need to add the requirement that the interlocking mechanism be changed simultaneously. You (or whoever else) can always work on that as a followup patch. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, Jun 26, 2012 at 2:53 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012: > >> On that, I used to be of the opinion that this is a good compromise (a >> small amount of interlock space, plus mostly posix shmem), but I've >> heard since then (I think via AgentM indirectly, but I'm not sure) >> that there are cases where even the small SysV segment can cause >> problems -- notably when other software tweaks shared memory settings >> on behalf of a user, but only leaves just-enough for the software >> being installed. > > This argument is what killed the original patch. If you want to get > anything done *at all* I think it needs to be dropped. Changing shmem > implementation is already difficult enough --- you don't need to add the > requirement that the interlocking mechanism be changed simultaneously. > You (or whoever else) can always work on that as a followup patch. True, but then again, I did very intentionally write: > Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012: >> *I wouldn't let perfect be the enemy of good* to make progress >> here, but it appears this was a witnessed real problem, so it may >> be worth reconsidering if there is a way we can safely remove all >> SysV by finding an alternative to the nattach mechanic. (Emphasis mine). I don't think that -hackers at the time gave the zero-shmem rationale much weight (I also was not that happy about the safety mechanism of that patch), but upon more reflection (and taking into account *other* software that may mangle shmem settings) I think it's something at least worth thinking about again one more time. What killed the patch was an attachment to the deemed-less-safe stategy for avoiding bogus shmem attachments already in it, but I don't seem to recall anyone putting a whole lot of thought at the time into the zero-shmem case from what I could read on the list, because a small interlock with nattach seemed good-enough. I'm simply suggesting that for additional benefits it may be worth thinking about getting around nattach and thus SysV shmem, especially with regard to safety, in an open-ended way. Maybe there's a solution (like Robert's FIFO suggestion?) that is not too onerous and can satisfy everyone. -- fdr
On Jun 26, 2012, at 5:44 PM, Josh Berkus wrote: > >> On that, I used to be of the opinion that this is a good compromise (a >> small amount of interlock space, plus mostly posix shmem), but I've >> heard since then (I think via AgentM indirectly, but I'm not sure) >> that there are cases where even the small SysV segment can cause >> problems -- notably when other software tweaks shared memory settings >> on behalf of a user, but only leaves just-enough for the software >> being installed. This is most likely on platforms that don't have a >> high SysV shmem limit by default, so installers all feel the >> prerogative to increase the limit, but there's no great answer for how >> to compose a series of such installations. It only takes one >> installer that says "whatever, I'm just catenating stuff to >> sysctl.conf that works for me" to sabotage Postgres' ability to start. > > Personally, I see this as rather an extreme case, and aside from AgentM > himself, have never run into it before. Certainly it would be useful to > not need SysV RAM at all, but it's more important to get a working patch > for 9.3. This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially newer postgresqlwith a smaller SysV segment. This can occur with applications that bundle postgresql as part of the app. Cheers, M
> This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially newerpostgresql with a smaller SysV segment. This can occur with applications that bundle postgresql as part of the app. I'm not saying it doesn't happen at all. I'm saying it's not the 80% case. So let's fix the 80% case with something we feel confident in, and then revisit the no-sysv interlock as a separate patch. That way if we can't fix the interlock issues, we still have a reduced-shmem version of Postgres. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Robert Haas <robertmhaas@gmail.com> writes: > So, what about keeping a FIFO in the data directory? Hm, does that work if the data directory is on NFS? Or some other weird not-really-Unix file system? > When the > postmaster starts up, it tries to open the file with O_NONBLOCK | > O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather > than O_NONBLOCK). If that succeeds, it bails out. If it fails with > anything other than ENXIO, it bails out. If it fails with exactly > ENXIO, then it opens the pipe with O_RDONLY ... race condition here ... > and arranges to pass the > file descriptor down to all of its children, so that a subsequent open > will fail if it or any of its children are still alive. This might be made to work, but that doesn't sound quite right in detail. I remember we speculated about using an fcntl lock on some file in the data directory, but that fails because child processes don't inherit fcntl locks. In the modern world, it'd be really a step forward if the lock mechanism worked on shared storage, ie a data directory on NFS or similar could be locked against all comers not just those on the same node as the original postmaster. I don't know how to do that though. In the meantime, insisting that we solve this problem before we do anything is a good recipe for ensuring that nothing happens, just like it hasn't happened for the last half dozen years. (I see Alvaro just made the same point.) regards, tom lane
On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote: > > (Emphasis mine). > > I don't think that -hackers at the time gave the zero-shmem rationale > much weight (I also was not that happy about the safety mechanism of > that patch), but upon more reflection (and taking into account *other* > software that may mangle shmem settings) I think it's something at > least worth thinking about again one more time. What killed the patch > was an attachment to the deemed-less-safe stategy for avoiding bogus > shmem attachments already in it, but I don't seem to recall anyone > putting a whole lot of thought at the time into the zero-shmem case > from what I could read on the list, because a small interlock with > nattach seemed good-enough. > > I'm simply suggesting that for additional benefits it may be worth > thinking about getting around nattach and thus SysV shmem, especially > with regard to safety, in an open-ended way. Maybe there's a solution > (like Robert's FIFO suggestion?) that is not too onerous and can > satisfy everyone. I solved this via fcntl locking. I also set up gdb to break in critical regions to test the interlock and I found no flawin the design. More eyes would be welcome, of course. https://github.com/agentm/postgres/tree/posix_shmem Cheers, M
Josh Berkus <josh@agliodbs.com> writes: > So let's fix the 80% case with something we feel confident in, and then > revisit the no-sysv interlock as a separate patch. That way if we can't > fix the interlock issues, we still have a reduced-shmem version of Postgres. Yes. Insisting that we have the whole change in one patch is a good way to prevent any forward progress from happening. As Alvaro noted, there are plenty of issues to resolve without trying to change the interlock mechanism at the same time. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > In the meantime, insisting that we solve this problem before we do > anything is a good recipe for ensuring that nothing happens, just > like it hasn't happened for the last half dozen years. (I see > Alvaro just made the same point.) And now so has Josh. +1 from me, too. -Kevin
"A.M." <agentm@themactionfaction.com> writes: > This can be trivially reproduced if one runs an old (SysV shared memory-based) postgresql alongside a potentially newerpostgresql with a smaller SysV segment. This can occur with applications that bundle postgresql as part of the app. I don't believe that that case is a counterexample to what's being proposed (namely, grabbing a minimum-size shmem segment, perhaps 1K). It would only fail if the old postmaster ate up *exactly* SHMMAX worth of shmem, which is not real likely. As a data point, on my Mac laptop with SHMMAX set to 32MB, 9.2 will by default eat up 31624KB, leaving more than a meg available. Sure, that isn't enough to start another old-style postmaster, but it would be plenty of room for one that only wants 1K. Even if you actively try to configure the shmem settings to exactly fill shmmax (which I concede some installation scripts might do), it's going to be hard to do because of the 8K granularity of the main knob, shared_buffers. Moreover, a installation script that did that would soon learn not to, because of the fact that we don't worry too much about changing small details of shared memory consumption in minor releases. regards, tom lane
Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012: > Even if you actively try to configure the shmem settings to exactly > fill shmmax (which I concede some installation scripts might do), > it's going to be hard to do because of the 8K granularity of the main > knob, shared_buffers. Actually it's very easy -- just try to start postmaster on a system with not enough shmmax and it will tell you how much shmem it wants. Then copy that number verbatim in the config file. This might fail on picky systems such as MacOSX that require some exact multiple or power of some other parameter, but it works fine on Linux. I think the minimum you can request, at least on Linux, is 1 byte. > Moreover, a installation script that did that > would soon learn not to, because of the fact that we don't worry too > much about changing small details of shared memory consumption in minor > releases. +1 -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"A.M." <agentm@themactionfaction.com> writes: > On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote: >> I'm simply suggesting that for additional benefits it may be worth >> thinking about getting around nattach and thus SysV shmem, especially >> with regard to safety, in an open-ended way. > I solved this via fcntl locking. No, you didn't, because fcntl locks aren't inherited by child processes. Too bad, because they'd be a great solution otherwise. regards, tom lane
On 06/26/2012 07:30 PM, Tom Lane wrote: > "A.M." <agentm@themactionfaction.com> writes: >> On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote: >>> I'm simply suggesting that for additional benefits it may be worth >>> thinking about getting around nattach and thus SysV shmem, especially >>> with regard to safety, in an open-ended way. > >> I solved this via fcntl locking. > > No, you didn't, because fcntl locks aren't inherited by child processes. > Too bad, because they'd be a great solution otherwise. > You claimed this last time and I replied: http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php "I address this race condition by ensuring that a lock-holding violator is the postmaster or a postmaster child. If such as condition is detected, the child exits immediately without touching the shared memory. POSIX shmem is inherited via file descriptors." This is possible because the locking API allows one to request which PID violates the lock. The child expects the lock to be held and checks that the PID is the parent. If the lock is not held, that means that the postmaster is dead, so the child exits immediately. Cheers, M
On 06/26/2012 07:15 PM, Alvaro Herrera wrote: > > Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012: > >> Even if you actively try to configure the shmem settings to exactly >> fill shmmax (which I concede some installation scripts might do), >> it's going to be hard to do because of the 8K granularity of the main >> knob, shared_buffers. > > Actually it's very easy -- just try to start postmaster on a system with > not enough shmmax and it will tell you how much shmem it wants. Then > copy that number verbatim in the config file. This might fail on picky > systems such as MacOSX that require some exact multiple or power of some > other parameter, but it works fine on Linux. > Except that we have to account for other installers. A user can install an application in the future which clobbers the value and then the original application will fail to run. The options to get the first app working is: a) to re-install the first app (potentially preventing the second app from running) b) to have the first app detect the failure and readjust the value (guessing what it should be) and potentially forcing a reboot c) to have the the user manually adjust the value and potentially force a reboot The failure usually gets blamed on the first application. That's why we had to nuke SysV shmem. Cheers, M
On Tue, Jun 26, 2012 at 6:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> So, what about keeping a FIFO in the data directory? > > Hm, does that work if the data directory is on NFS? Or some other weird > not-really-Unix file system? I would expect NFS to work in general. We could test that. Of course, it's more than possible that there's some bizarre device out there that purports to be NFS but doesn't actually support mkfifo. It's difficult to prove a negative. >> When the >> postmaster starts up, it tries to open the file with O_NONBLOCK | >> O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather >> than O_NONBLOCK). If that succeeds, it bails out. If it fails with >> anything other than ENXIO, it bails out. If it fails with exactly >> ENXIO, then it opens the pipe with O_RDONLY > > ... race condition here ... Oh, if someone tries to start two postmasters at the same time? Hmm. >> and arranges to pass the >> file descriptor down to all of its children, so that a subsequent open >> will fail if it or any of its children are still alive. > > This might be made to work, but that doesn't sound quite right in > detail. > > I remember we speculated about using an fcntl lock on some file in the > data directory, but that fails because child processes don't inherit > fcntl locks. > > In the modern world, it'd be really a step forward if the lock mechanism > worked on shared storage, ie a data directory on NFS or similar could be > locked against all comers not just those on the same node as the > original postmaster. I don't know how to do that though. Well, I think that in theory that DOES work. But I also think it's often misconfigured. Which could also be said of NFS in general. > In the meantime, insisting that we solve this problem before we do > anything is a good recipe for ensuring that nothing happens, just > like it hasn't happened for the last half dozen years. (I see Alvaro > just made the same point.) Agreed all around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"A.M." <agentm@themactionfaction.com> writes: > On 06/26/2012 07:30 PM, Tom Lane wrote: >>> I solved this via fcntl locking. >> No, you didn't, because fcntl locks aren't inherited by child processes. >> Too bad, because they'd be a great solution otherwise. > You claimed this last time and I replied: > http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php > "I address this race condition by ensuring that a lock-holding violator > is the postmaster or a postmaster child. If such as condition is > detected, the child exits immediately without touching the shared > memory. POSIX shmem is inherited via file descriptors." > This is possible because the locking API allows one to request which PID > violates the lock. The child expects the lock to be held and checks that > the PID is the parent. If the lock is not held, that means that the > postmaster is dead, so the child exits immediately. OK, I went back and re-read the original patch, and I now agree that something like this is possible --- but I don't like the way you did it. The dependence on particular PIDs seems both unnecessary and risky. The key concept here seems to be that the postmaster first stakes a claim on the data directory by exclusive-locking a lock file. If successful, it reduces that lock to shared mode (which can be done atomically, according to the SUS fcntl specification), and then holds the shared lock until it exits. Spawned children will not initially have a lock, but what they can do is attempt to acquire shared lock on the lock file. If fail, exit. If successful, *check to see that the parent postmaster is still alive* (ie, getppid() != 1). If so, the parent must have been continuously holding the lock, and the child has successfully joined the pool of shared lock holders. Otherwise bail out without having changed anything. It is the "parent is still alive" check, not any test on individual PIDs, that makes this work. There are two concrete reasons why I don't care for the GetPIDHoldingLock() way. Firstly, the fact that you can get a blocking PID from F_GETLK isn't an essential part of the concept of file locking IMO --- it's just an incidental part of this particular API. May I remind you that the reason we're stuck on SysV shmem in the first place is that we decided to depend on an incidental part of that API, namely nattch? I would like to not require file locking to have any semantics more specific than "a process can hold an exclusive or a shared lock on a file, which is auto-released at process exit". Secondly, in an NFS world I don't believe that the returned l_pid value can be trusted for anything. If it's a PID from a different machine then it might accidentally conflict with one on our machine, or not. Reflecting on this further, it seems to me that the main remaining failure modes are (1) file locking doesn't work, or (2) idiot DBA manually removes the lock file. Both of these could be ameliorated with some refinements to the basic idea. For (1), I suggest that we tweak the startup process (only) to attempt to acquire exclusive lock on the lock file. If it succeeds, we know that file locking is broken, and we can complain. (This wouldn't help for cases where cross-machine locking is broken, but I see no practical way to detect that.) For (2), the problem really is that the proposed patch conflates the PID file with the lock file, but people are conditioned to think that PID files are removable. I suggest that we create a separate, permanently present file that serves only as the lock file and doesn't ever get modified (it need have no content other than the string "Don't remove this!"). It'd be created by initdb, not by individual postmaster runs; indeed the postmaster should fail if it doesn't find the lock file already present. The postmaster PID file should still exist with its current contents, but it would serve mostly as documentation and as server-contact information for pg_ctl; it would not be part of the data directory locking mechanism. I wonder whether this design can be adapted to Windows? IIRC we do not have a bulletproof data directory lock scheme for Windows. It seems like this makes few enough demands on the lock mechanism that there ought to be suitable primitives available there too. regards, tom lane
I wrote: > Reflecting on this further, it seems to me that the main remaining > failure modes are (1) file locking doesn't work, or (2) idiot DBA > manually removes the lock file. Oh, wait, I just remembered the really fatal problem here: to quote from the SUS fcntl spec, All locks associated with a file for a given process are removedwhen a file descriptor for that file is closed by that processorthe process holding that file descriptor terminates. That carefully says "a file descriptor", not "the file descriptor through which the lock was acquired". Any close() referencing the lock file will do. That means that it is possible for perfectly innocent code --- for example, something that scans all files in the data directory, as say pg_basebackup might do --- to cause a backend process to lose its lock. When we looked at this before, it seemed like a showstopper. Even if we carefully taught every directory-scanning loop in postgres not to touch the lock file, we cannot expect that for instance a pl/perl function wouldn't accidentally break things. And 99.999% of the time nobody would notice ... it would just be that last 0.001% of people that would be screwed. Still, this discussion has yielded a useful advance, which is that we now see how we might safely make use of lock mechanisms that don't inherit across fork(). We just need something less broken than fcntl(). regards, tom lane
On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> So let's fix the 80% case with something we feel confident in, and then >> revisit the no-sysv interlock as a separate patch. That way if we can't >> fix the interlock issues, we still have a reduced-shmem version of Postgres. > > Yes. Insisting that we have the whole change in one patch is a good way > to prevent any forward progress from happening. As Alvaro noted, there > are plenty of issues to resolve without trying to change the interlock > mechanism at the same time. So, here's a patch. Instead of using POSIX shmem, I just took the expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS memory. The sysv shm is still allocated, but it's just a copy of PGShmemHeader; the "real" shared memory is the anonymous block. This won't work if EXEC_BACKEND is defined so it just falls back on straight sysv shm in that case. There are obviously some portability issues here - this is documented not to work on Linux <= 2.4, but it's not clear whether it fails with some suitable error code or just pretends to work and does the wrong thing. I tested that it does compile and work on both Linux 3.2.6 and MacOS X 10.6.8. And the comments probably need work and... who knows what else is wrong. But, thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > So, here's a patch. Instead of using POSIX shmem, I just took the > expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS > memory. The sysv shm is still allocated, but it's just a copy of > PGShmemHeader; the "real" shared memory is the anonymous block. This > won't work if EXEC_BACKEND is defined so it just falls back on > straight sysv shm in that case. Um. I hadn't thought about the EXEC_BACKEND interaction, but that seems like a bit of a showstopper. I would not like to give up the ability to debug EXEC_BACKEND mode on Unixen. Would Posix shmem help with that at all? Why did you choose not to use the Posix API, anyway? regards, tom lane
On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> So, here's a patch. Instead of using POSIX shmem, I just took the >> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS >> memory. The sysv shm is still allocated, but it's just a copy of >> PGShmemHeader; the "real" shared memory is the anonymous block. This >> won't work if EXEC_BACKEND is defined so it just falls back on >> straight sysv shm in that case. > > Um. I hadn't thought about the EXEC_BACKEND interaction, but that seems > like a bit of a showstopper. I would not like to give up the ability > to debug EXEC_BACKEND mode on Unixen. > > Would Posix shmem help with that at all? Why did you choose not to > use the Posix API, anyway? It seemed more complicated. If we use the POSIX API, we've got to have code to find a non-colliding name for the shm, and we've got to arrange to clean it up at process exit. Anonymous shm doesn't require a name and goes away automatically when it's no longer in use. With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to make it continue to use a full-sized sysv shm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "A.M." <agentm@themactionfaction.com> writes: >> On 06/26/2012 07:30 PM, Tom Lane wrote: >>>> I solved this via fcntl locking. > >>> No, you didn't, because fcntl locks aren't inherited by child processes. >>> Too bad, because they'd be a great solution otherwise. > >> You claimed this last time and I replied: >> http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php > >> "I address this race condition by ensuring that a lock-holding violator >> is the postmaster or a postmaster child. If such as condition is >> detected, the child exits immediately without touching the shared >> memory. POSIX shmem is inherited via file descriptors." > >> This is possible because the locking API allows one to request which PID >> violates the lock. The child expects the lock to be held and checks that >> the PID is the parent. If the lock is not held, that means that the >> postmaster is dead, so the child exits immediately. > > OK, I went back and re-read the original patch, and I now agree that > something like this is possible --- but I don't like the way you did > it. The dependence on particular PIDs seems both unnecessary and risky. > > The key concept here seems to be that the postmaster first stakes a > claim on the data directory by exclusive-locking a lock file. If > successful, it reduces that lock to shared mode (which can be done > atomically, according to the SUS fcntl specification), and then holds > the shared lock until it exits. Spawned children will not initially > have a lock, but what they can do is attempt to acquire shared lock on > the lock file. If fail, exit. If successful, *check to see that the > parent postmaster is still alive* (ie, getppid() != 1). If so, the > parent must have been continuously holding the lock, and the child has > successfully joined the pool of shared lock holders. Otherwise bail > out without having changed anything. It is the "parent is still alive" > check, not any test on individual PIDs, that makes this work. > > There are two concrete reasons why I don't care for the > GetPIDHoldingLock() way. Firstly, the fact that you can get a blocking > PID from F_GETLK isn't an essential part of the concept of file locking > IMO --- it's just an incidental part of this particular API. May I > remind you that the reason we're stuck on SysV shmem in the first place > is that we decided to depend on an incidental part of that API, namely > nattch? I would like to not require file locking to have any semantics > more specific than "a process can hold an exclusive or a shared lock on > a file, which is auto-released at process exit". Secondly, in an NFS > world I don't believe that the returned l_pid value can be trusted for > anything. If it's a PID from a different machine then it might > accidentally conflict with one on our machine, or not. > > Reflecting on this further, it seems to me that the main remaining > failure modes are (1) file locking doesn't work, or (2) idiot DBA > manually removes the lock file. Both of these could be ameliorated > with some refinements to the basic idea. For (1), I suggest that > we tweak the startup process (only) to attempt to acquire exclusive lock > on the lock file. If it succeeds, we know that file locking is broken, > and we can complain. (This wouldn't help for cases where cross-machine > locking is broken, but I see no practical way to detect that.) > For (2), the problem really is that the proposed patch conflates the PID > file with the lock file, but people are conditioned to think that PID > files are removable. I suggest that we create a separate, permanently > present file that serves only as the lock file and doesn't ever get > modified (it need have no content other than the string "Don't remove > this!"). It'd be created by initdb, not by individual postmaster runs; > indeed the postmaster should fail if it doesn't find the lock file > already present. The postmaster PID file should still exist with its > current contents, but it would serve mostly as documentation and as > server-contact information for pg_ctl; it would not be part of the data > directory locking mechanism. > > I wonder whether this design can be adapted to Windows? IIRC we do > not have a bulletproof data directory lock scheme for Windows. > It seems like this makes few enough demands on the lock mechanism > that there ought to be suitable primitives available there too. I assume you're saying we need to make changes in the internal API, right? Because we alreayd have a windows native implementation of shared memory that AFAIK works, so if the new Unix stuff can be done with the same internal APIs, it shouldn't nede to be changed. (Sorry, haven't followed the thread in detail) If so - can we define exactly what properties it is we *need*? (A native API worth looking at is e.g. http://msdn.microsoft.com/en-us/library/windows/desktop/aa365203(v=vs.85).aspx - but there are probably others as well if that one doesn't do) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wonder whether this design can be adapted to Windows? �IIRC we do >> not have a bulletproof data directory lock scheme for Windows. >> It seems like this makes few enough demands on the lock mechanism >> that there ought to be suitable primitives available there too. > I assume you're saying we need to make changes in the internal API, > right? Because we alreayd have a windows native implementation of > shared memory that AFAIK works, Right, but does it provide honest protection against starting two postmasters in the same data directory? Or more to the point, does it prevent starting a new postmaster when the old postmaster crashed but there are still orphaned backends making changes? AFAIR we basically punted on those problems for the Windows port, for lack of an equivalent to nattch. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Would Posix shmem help with that at all? Why did you choose not to >> use the Posix API, anyway? > It seemed more complicated. If we use the POSIX API, we've got to > have code to find a non-colliding name for the shm, and we've got to > arrange to clean it up at process exit. Anonymous shm doesn't require > a name and goes away automatically when it's no longer in use. I see. Those are pretty good reasons ... > With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to > make it continue to use a full-sized sysv shm. Well, if the ultimate objective is to get out from under the SysV APIs entirely, we're not going to get there if we still have to have all that code for the EXEC_BACKEND case. Maybe it's time to decide that we don't need to support EXEC_BACKEND on Unix. regards, tom lane
All, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Would Posix shmem help with that at all? Why did you choose not to > >> use the Posix API, anyway? > > > It seemed more complicated. If we use the POSIX API, we've got to > > have code to find a non-colliding name for the shm, and we've got to > > arrange to clean it up at process exit. Anonymous shm doesn't require > > a name and goes away automatically when it's no longer in use. > > I see. Those are pretty good reasons ... After talking to Magnus a bit this morning regarding this, it sounds like what we're doing on Windows is closer to Anonymous shm, except that they use an intentionally specific name, which also allows them to detect if any children are still alive by using a "create-if-not-exists" approach on the shm segment and failing if it still exists. There were some corner cases around restarts due to it taking a few seconds for the Windows kernel to pick up on the fact that all the children are dead and that the shm segment should go away, but they were able to work around that, and failure to start is surely much better than possible corruption. What this all boils down to is- can you have a shm segment that goes away when no one is still attached to it, but actually give it a name and then detect if it already exists atomically on startup on Linux/Unixes? If so, perhaps we could use the same mechanism on both.. Thanks, Stephen
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Right, but does it provide honest protection against starting two > postmasters in the same data directory? Or more to the point, > does it prevent starting a new postmaster when the old postmaster > crashed but there are still orphaned backends making changes? > AFAIR we basically punted on those problems for the Windows port, > for lack of an equivalent to nattch. See my other mail, but, after talking to Magnus, it's my understanding that we had that problem initially, but it was later solved by using a named shared memory segment which the kernel will clean up when all children are gone. That, combined with a 'create-if-exists' call, allows detection of lost children to be done. Thanks, Stephen
On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Wed, Jun 27, 2012 at 3:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I wonder whether this design can be adapted to Windows? IIRC we do >>> not have a bulletproof data directory lock scheme for Windows. >>> It seems like this makes few enough demands on the lock mechanism >>> that there ought to be suitable primitives available there too. > >> I assume you're saying we need to make changes in the internal API, >> right? Because we alreayd have a windows native implementation of >> shared memory that AFAIK works, > > Right, but does it provide honest protection against starting two > postmasters in the same data directory? Or more to the point, > does it prevent starting a new postmaster when the old postmaster > crashed but there are still orphaned backends making changes? > AFAIR we basically punted on those problems for the Windows port, > for lack of an equivalent to nattch. No, we spent a lot of time trying to *fix* it, and IIRC we did. We create a shared memory segment with a fixed name based on the data directory. This shared memory segment is inherited by all children. It will automatically go away only when all processes that have an open handle to it go away (in fact, it can even take a second or two more, if they go away by crash and not by cleanup - we have a workaround in the code for that). But as long as there is an orphaned backend around, the shared memory segment stays around. We don't have "nattch". But we do have "nattch>0". Or something like that. You can work around it if you find two different paths to the same data directory (e.g .using junctions), but you are really actively trying to break the system if you do that... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Wed, Jun 27, 2012 at 3:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> AFAIR we basically punted on those problems for the Windows port, >> for lack of an equivalent to nattch. > No, we spent a lot of time trying to *fix* it, and IIRC we did. OK, in that case this isn't as interesting as I thought. If we do go over to a file-locking-based solution on Unix, it might be worthwhile changing to something similar on Windows. But it would be more about reducing coding differences between the platforms than plugging any real holes. regards, tom lane
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Would Posix shmem help with that at all? Why did you choose not to >>> use the Posix API, anyway? > >> It seemed more complicated. If we use the POSIX API, we've got to >> have code to find a non-colliding name for the shm, and we've got to >> arrange to clean it up at process exit. Anonymous shm doesn't require >> a name and goes away automatically when it's no longer in use. > > I see. Those are pretty good reasons ... > >> With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to >> make it continue to use a full-sized sysv shm. > > Well, if the ultimate objective is to get out from under the SysV APIs > entirely, we're not going to get there if we still have to have all that > code for the EXEC_BACKEND case. Maybe it's time to decide that we don't > need to support EXEC_BACKEND on Unix. I don't personally see a need to do anything that drastic at this point. Admittedly, I rarely compile with EXEC_BACKEND, but I don't think it's bad to have the option available. Adjusting shared memory limits isn't really a big problem for PostgreSQL developers; what we're trying to avoid is the need for PostgreSQL *users* to concern themselves with it. And surely anyone who is using EXEC_BACKEND on Unix is a developer, not a user. If and when we come up with a substitute for the nattch interlock, then this might be worth thinking a bit harder about. At that point, if we still want to support EXEC_BACKEND on Unix, then we'd need the EXEC_BACKEND case at least to use POSIX shm rather than anonymous shared mmap. Personally I think that would be not that hard and probably worth doing, but there doesn't seem to be any point in writing that code now, because for the simple case of just reducing the amount of shm that we allocate, an anonymous mapping seems better all around. We shouldn't overthink this. Our shared memory code has allocated a bunch of crufty hacks over the years to work around various platform-specific issues, but it's still not a lot of code, so I don't see any reason to worry unduly about making a surgical fix without having a master plan. Nothing we want to do down the road will require moving the earth. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 27, 2012 at 9:52 AM, Stephen Frost <sfrost@snowman.net> wrote: > What this all boils down to is- can you have a shm segment that goes > away when no one is still attached to it, but actually give it a name > and then detect if it already exists atomically on startup on > Linux/Unixes? If so, perhaps we could use the same mechanism on both.. As I understand it, no. You can either have anonymous shared mappings, which go away when no longer in use but do not have a name. Or you can have POSIX or sysv shm, which have a name but do not automatically go away when no longer in use. There seems to be no method for setting up a segment that both has a name and goes away automatically. POSIX shm in particular tries to "look like a file", whereas anonymous memory tries to look more like malloc (except that you can share the mapping with child processes). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun 27, 2012, at 7:34 AM, Robert Haas wrote: > On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> So, here's a patch. Instead of using POSIX shmem, I just took the >>> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS >>> memory. The sysv shm is still allocated, but it's just a copy of >>> PGShmemHeader; the "real" shared memory is the anonymous block. This >>> won't work if EXEC_BACKEND is defined so it just falls back on >>> straight sysv shm in that case. >> >> Um. I hadn't thought about the EXEC_BACKEND interaction, but that seems >> like a bit of a showstopper. I would not like to give up the ability >> to debug EXEC_BACKEND mode on Unixen. >> >> Would Posix shmem help with that at all? Why did you choose not to >> use the Posix API, anyway? > > It seemed more complicated. If we use the POSIX API, we've got to > have code to find a non-colliding name for the shm, and we've got to > arrange to clean it up at process exit. Anonymous shm doesn't require > a name and goes away automatically when it's no longer in use. > > With respect to EXEC_BACKEND, I wasn't proposing to kill it, just to > make it continue to use a full-sized sysv shm. > I solved this by unlinking the posix shared memory segment immediately after creation. The file descriptor to the sharedmemory is inherited, so, by definition, only the postmaster children can access the memory. This ensures that sharedmemory cleanup is immediate after the postmaster and all children close, as well. The fcntl locking is not requiredto protect the posix shared memory- it can protect itself. Cheers, M
On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Would Posix shmem help with that at all? Why did you choose not to >>> use the Posix API, anyway? > >> It seemed more complicated. If we use the POSIX API, we've got to >> have code to find a non-colliding name for the shm, and we've got to >> arrange to clean it up at process exit. Anonymous shm doesn't require >> a name and goes away automatically when it's no longer in use. > > I see. Those are pretty good reasons ... So, should we do it this way? I did a little research and discovered that Linux 2.3.51 (released 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS. That combination is documented to work beginning in Linux 2.4.0. How worried should we be about people trying to run PostgreSQL 9.3 on pre-2.4 kernels? If we want to worry about it, we could try mapping a one-page shared MAP_SHARED|MAP_ANONYMOUS segment first. If that works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS facility and try to allocate the whole segment plus a minimal sysv shm. If the single page allocation fails with EINVAL, we could fall back to allocating the entire segment as sysv shm. A related question is - if we do this - should we enable it only on ports where we've verified that it works, or should we just turn it on everywhere and fix breakage if/when it's reported? I lean toward the latter. If we find that there are platforms where (a) mmap is not supported or (b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could either shut off this optimization on those platforms by fiat, or we could test not only that the call succeeds, but that it works properly: create a one-page mapping and fork a child process; in the child, write to the mapping and exit; in the parent, wait for the child to exit and then test that we can read back the correct contents. This would protect against a hypothetical system where the flags are accepted but fail to produce the correct behavior. I'm inclined to think this is over-engineering in the absence of evidence that there are platforms that work this way. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Would Posix shmem help with that at all? Why did you choose not to >>>> use the Posix API, anyway? >> >>> It seemed more complicated. If we use the POSIX API, we've got to >>> have code to find a non-colliding name for the shm, and we've got to >>> arrange to clean it up at process exit. Anonymous shm doesn't require >>> a name and goes away automatically when it's no longer in use. >> >> I see. Those are pretty good reasons ... > > So, should we do it this way? > > I did a little research and discovered that Linux 2.3.51 (released > 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS. > That combination is documented to work beginning in Linux 2.4.0. How > worried should we be about people trying to run PostgreSQL 9.3 on > pre-2.4 kernels? If we want to worry about it, we could try mapping a > one-page shared MAP_SHARED|MAP_ANONYMOUS segment first. If that > works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS > facility and try to allocate the whole segment plus a minimal sysv > shm. If the single page allocation fails with EINVAL, we could fall > back to allocating the entire segment as sysv shm. Do we really need a runtime check for that? Isn't a configure check enough? If they *do* deploy postgresql 9.3 on something that old, they're building from source anyway... > A related question is - if we do this - should we enable it only on > ports where we've verified that it works, or should we just turn it on > everywhere and fix breakage if/when it's reported? I lean toward the > latter. Depends on the amount of expected breakage, but I'd lean towards teh later as well. > If we find that there are platforms where (a) mmap is not supported or > (b) MAP_SHARED|MAP_ANON works but has the wrong semantics, we could > either shut off this optimization on those platforms by fiat, or we > could test not only that the call succeeds, but that it works > properly: create a one-page mapping and fork a child process; in the > child, write to the mapping and exit; in the parent, wait for the > child to exit and then test that we can read back the correct > contents. This would protect against a hypothetical system where the > flags are accepted but fail to produce the correct behavior. I'm > inclined to think this is over-engineering in the absence of evidence > that there are platforms that work this way. Could we actually turn *that* into a configure test, or will that be too complex? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Thu, Jun 28, 2012 at 7:05 AM, Magnus Hagander <magnus@hagander.net> wrote: > Do we really need a runtime check for that? Isn't a configure check > enough? If they *do* deploy postgresql 9.3 on something that old, > they're building from source anyway... [...] > > Could we actually turn *that* into a configure test, or will that be > too complex? I don't see why we *couldn't* make either of those things into a configure test, but it seems more complicated than a runtime test and less accurate, so I guess I'd be in favor of doing them at runtime or not at all. Actually, the try-a-one-page-mapping-and-see-if-you-get-EINVAL test is so simple that I really can't see any reason not to insert that defense. The fork-and-check-whether-it-really-works test is probably excess paranoia until we determine whether that's really a danger anywhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 28, 2012 at 6:05 AM, Magnus Hagander <magnus@hagander.net> wrote: > On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jun 27, 2012 at 9:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> On Wed, Jun 27, 2012 at 12:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>> Would Posix shmem help with that at all? Why did you choose not to >>>>> use the Posix API, anyway? >>> >>>> It seemed more complicated. If we use the POSIX API, we've got to >>>> have code to find a non-colliding name for the shm, and we've got to >>>> arrange to clean it up at process exit. Anonymous shm doesn't require >>>> a name and goes away automatically when it's no longer in use. >>> >>> I see. Those are pretty good reasons ... >> >> So, should we do it this way? >> >> I did a little research and discovered that Linux 2.3.51 (released >> 3/11/2000) apparently returns EINVAL for MAP_SHARED|MAP_ANONYMOUS. >> That combination is documented to work beginning in Linux 2.4.0. How >> worried should we be about people trying to run PostgreSQL 9.3 on >> pre-2.4 kernels? If we want to worry about it, we could try mapping a >> one-page shared MAP_SHARED|MAP_ANONYMOUS segment first. If that >> works, we could assume that we have a working MAP_SHARED|MAP_ANONYMOUS >> facility and try to allocate the whole segment plus a minimal sysv >> shm. If the single page allocation fails with EINVAL, we could fall >> back to allocating the entire segment as sysv shm. Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)? I seem to think that's what I did when I needed this functionality oh so many moons ago. -- Jon
On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)? I > seem to think that's what I did when I needed this functionality oh so > many moons ago. From the reading I've done on this topic, that seems to be a trick invented on Solaris that is considered grotty and awful by everyone else. The thing is that you want the mapping to be shared with the processes that inherit the mapping from you. You do *NOT* want the mapping to be shared with EVERYONE who has mapped that file for any reason, which is the usual meaning of MAP_SHARED on a file. Maybe this happens to work correctly on some or all platforms, but I would want to have some convincing evidence that it's more widely supported (with the correct semantics) than MAP_ANON before relying on it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Magnus Hagander <magnus@hagander.net> writes: > On Thu, Jun 28, 2012 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> A related question is - if we do this - should we enable it only on >> ports where we've verified that it works, or should we just turn it on >> everywhere and fix breakage if/when it's reported? �I lean toward the >> latter. > Depends on the amount of expected breakage, but I'd lean towards teh > later as well. If we don't turn it on, we won't find out whether it works. I'd say try it first and then back off if that proves necessary. I'd just as soon not see us write any fallback logic without evidence that it's needed. FWIW, even my pet dinosaur HP-UX 10.20 box appears to support mmap(MAP_SHARED|MAP_ANONYMOUS) --- at least the mmap man page documents both flags. I find it really pretty hard to believe that there are any machines out there that haven't got this and yet might be expected to run PG 9.3+. We should not go into it with an expectation of failure, anyway. regards, tom lane
On Thu, Jun 28, 2012 at 8:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 28, 2012 at 9:47 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> Why not just mmap /dev/zero (MAP_SHARED but not MAP_ANONYMOUS)? I >> seem to think that's what I did when I needed this functionality oh so >> many moons ago. > > From the reading I've done on this topic, that seems to be a trick > invented on Solaris that is considered grotty and awful by everyone > else. The thing is that you want the mapping to be shared with the > processes that inherit the mapping from you. You do *NOT* want the > mapping to be shared with EVERYONE who has mapped that file for any > reason, which is the usual meaning of MAP_SHARED on a file. Maybe > this happens to work correctly on some or all platforms, but I would > want to have some convincing evidence that it's more widely supported > (with the correct semantics) than MAP_ANON before relying on it. When I did this (I admit, it was on Linux but it was a long time ago) only the inherited file descriptor + mmap structure mattered - modifications were private to the process and it's children - other apps always saw their "own" /dev/zero. A quick google suggests that - according to qnx, sco, and some others - mmap'ing /dev/zero retains the expected privacy. Given how /dev/zero works I'd be very surprised if it was otherwise. I would love to see links that suggest that /dev/zero is nasty (or, in fact, in any way fundamentally different than mmap'ing /dev/zero) - feel free to send them to me privately to avoid polluting the list. -- Jon
... btw, I rather imagine that Robert has already noticed this, but OS X (and presumably other BSDen) spells the flag "MAP_ANON" not "MAP_ANONYMOUS". I also find this rather interesting flag there: MAP_HASSEMAPHORE Notify the kernel that the region may contain sema- phores and that special handlingmay be necessary. By "semaphore" I suspect they mean "spinlock", so we'd better turn this flag on where it exists. regards, tom lane
On Thu, Jun 28, 2012 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > ... btw, I rather imagine that Robert has already noticed this, but OS X > (and presumably other BSDen) spells the flag "MAP_ANON" not > "MAP_ANONYMOUS". I also find this rather interesting flag there: > > MAP_HASSEMAPHORE Notify the kernel that the region may contain sema- > phores and that special handling may be necessary. > > By "semaphore" I suspect they mean "spinlock", so we'd better turn this > flag on where it exists. Sounds fine to me. Since no one seems opposed to the basic approach, and everyone (I assume) will be happier to reduce the impact of dealing with shared memory limits, I went ahead and committed a cleaned-up version of the previous patch. Let's see what the build-farm thinks. Assuming things go well, there are a number of follow-on things that we need to do finish this up: 1. Update the documentation. I skipped this for now, because I think that what we write there is going to be heavily dependent on how portable this turns out to be, which we don't know yet. Also, it's not exactly clear to me what the documentation should say if this does turn out to work everywhere. Much of section 17.4 will become irrelevant to most users, but I doubt we'd just want to remove it; it could still matter for people running EXEC_BACKEND or running many postmasters on the same machine or, of course, people running on platforms where this just doesn't work, if there are any. 2. Update the HINT messages when shared memory allocation fails. Maybe the new most-common-failure mode there will be too many postmasters running on the same machine? We might need to wait for some field reports before adjusting this. 3. Consider adjusting the logic inside initdb. If this works everywhere, the code for determining how to set shared_buffers should become pretty much irrelevant. Even if it only works some places, we could add 64MB or 128MB or whatever to the list of values we probe, so that people won't get quite such a sucky configuration out of the box.Of course there's no number here that will be goodfor everyone. and of course 4. Fix any platforms that are now horribly broken. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 28 June 2012 16:26, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 28, 2012 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ... btw, I rather imagine that Robert has already noticed this, but OS X >> (and presumably other BSDen) spells the flag "MAP_ANON" not >> "MAP_ANONYMOUS". I also find this rather interesting flag there: >> >> MAP_HASSEMAPHORE Notify the kernel that the region may contain sema- >> phores and that special handling may be necessary. >> >> By "semaphore" I suspect they mean "spinlock", so we'd better turn this >> flag on where it exists. > > Sounds fine to me. Since no one seems opposed to the basic approach, > and everyone (I assume) will be happier to reduce the impact of > dealing with shared memory limits, I went ahead and committed a > cleaned-up version of the previous patch. Let's see what the > build-farm thinks. > > Assuming things go well, there are a number of follow-on things that > we need to do finish this up: > > 1. Update the documentation. I skipped this for now, because I think > that what we write there is going to be heavily dependent on how > portable this turns out to be, which we don't know yet. Also, it's > not exactly clear to me what the documentation should say if this does > turn out to work everywhere. Much of section 17.4 will become > irrelevant to most users, but I doubt we'd just want to remove it; it > could still matter for people running EXEC_BACKEND or running many > postmasters on the same machine or, of course, people running on > platforms where this just doesn't work, if there are any. > > 2. Update the HINT messages when shared memory allocation fails. > Maybe the new most-common-failure mode there will be too many > postmasters running on the same machine? We might need to wait for > some field reports before adjusting this. > > 3. Consider adjusting the logic inside initdb. If this works > everywhere, the code for determining how to set shared_buffers should > become pretty much irrelevant. Even if it only works some places, we > could add 64MB or 128MB or whatever to the list of values we probe, so > that people won't get quite such a sucky configuration out of the box. > Of course there's no number here that will be good for everyone. > > and of course > > 4. Fix any platforms that are now horribly broken. On 64-bit Linux, if I allocate more shared buffers than the system is capable of reserving, it doesn't start. This is expected, but there's no error logged anywhere (actually, nothing logged at all), and the postmaster.pid file is left behind after this failure. -- Thom
On Thu, Jun 28, 2012 at 8:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: > 3. Consider adjusting the logic inside initdb. If this works > everywhere, the code for determining how to set shared_buffers should > become pretty much irrelevant. Even if it only works some places, we > could add 64MB or 128MB or whatever to the list of values we probe, so > that people won't get quite such a sucky configuration out of the box. > Of course there's no number here that will be good for everyone. This seems independent of the type of shared memory used and the limits on it. If it tried and 64MB or 128MB and discovered that it couldn't obtain that much shared memory, it automatically climbs down to smaller values until it finds one that works. I think the impediment to adopting larger defaults is not what happens if it can't get that much shared memory, but rather what happens if the machine doesn't have that much physical memory. The test server will still start (and so there will be no climb-down), leaving a default which is valid but just has horrid performance. Cheers, Jeff
On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote: > On 64-bit Linux, if I allocate more shared buffers than the system is > capable of reserving, it doesn't start. This is expected, but there's > no error logged anywhere (actually, nothing logged at all), and the > postmaster.pid file is left behind after this failure. Fixed. However, I discovered something unpleasant. With the new code, on MacOS X, if you set shared_buffers to say 3200GB, the server happily starts up. Or at least the shared memory allocation goes through just fine. The postmaster then sits there apparently forever without emitting any log messages, which I eventually discovered was because it's busy initializing a billion or so spinlocks. I'm pretty sure that this machine does not have >3TB of virtual memory, even counting swap. So that means that MacOS X has absolutely no common sense whatsoever as far as anonymous shared memory allocations go. Not sure exactly what to do about that. Linux is more sensible, at least on the system I tested, and fails cleanly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote: >> On 64-bit Linux, if I allocate more shared buffers than the system is >> capable of reserving, it doesn't start. This is expected, but there's >> no error logged anywhere (actually, nothing logged at all), and the >> postmaster.pid file is left behind after this failure. > > Fixed. > > However, I discovered something unpleasant. With the new code, on > MacOS X, if you set shared_buffers to say 3200GB, the server happily > starts up. Or at least the shared memory allocation goes through just > fine. The postmaster then sits there apparently forever without > emitting any log messages, which I eventually discovered was because > it's busy initializing a billion or so spinlocks. > > I'm pretty sure that this machine does not have >3TB of virtual > memory, even counting swap. So that means that MacOS X has absolutely > no common sense whatsoever as far as anonymous shared memory > allocations go. Not sure exactly what to do about that. Linux is > more sensible, at least on the system I tested, and fails cleanly. What happens if you mlock() it into memory - does that fail quickly? Is that not something we might want to do *anyway*? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote: > On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote: > >> On 64-bit Linux, if I allocate more shared buffers than the system is > >> capable of reserving, it doesn't start. This is expected, but there's > >> no error logged anywhere (actually, nothing logged at all), and the > >> postmaster.pid file is left behind after this failure. > > > > Fixed. > > > > However, I discovered something unpleasant. With the new code, on > > MacOS X, if you set shared_buffers to say 3200GB, the server happily > > starts up. Or at least the shared memory allocation goes through just > > fine. The postmaster then sits there apparently forever without > > emitting any log messages, which I eventually discovered was because > > it's busy initializing a billion or so spinlocks. > > > > I'm pretty sure that this machine does not have >3TB of virtual > > memory, even counting swap. So that means that MacOS X has absolutely > > no common sense whatsoever as far as anonymous shared memory > > allocations go. Not sure exactly what to do about that. Linux is > > more sensible, at least on the system I tested, and fails cleanly. > > What happens if you mlock() it into memory - does that fail quickly? > Is that not something we might want to do *anyway*? You normally can only mlock() mminor amounts of memory without changing settings. Requiring to change that setting (aside that mlocking would be a bad idea imo) would run contrary to the point of the patch, wouldn't it? ;) Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote: >> On Thu, Jun 28, 2012 at 7:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> > On Thu, Jun 28, 2012 at 12:13 PM, Thom Brown <thom@linux.com> wrote: >> >> On 64-bit Linux, if I allocate more shared buffers than the system is >> >> capable of reserving, it doesn't start. This is expected, but there's >> >> no error logged anywhere (actually, nothing logged at all), and the >> >> postmaster.pid file is left behind after this failure. >> > >> > Fixed. >> > >> > However, I discovered something unpleasant. With the new code, on >> > MacOS X, if you set shared_buffers to say 3200GB, the server happily >> > starts up. Or at least the shared memory allocation goes through just >> > fine. The postmaster then sits there apparently forever without >> > emitting any log messages, which I eventually discovered was because >> > it's busy initializing a billion or so spinlocks. >> > >> > I'm pretty sure that this machine does not have >3TB of virtual >> > memory, even counting swap. So that means that MacOS X has absolutely >> > no common sense whatsoever as far as anonymous shared memory >> > allocations go. Not sure exactly what to do about that. Linux is >> > more sensible, at least on the system I tested, and fails cleanly. >> >> What happens if you mlock() it into memory - does that fail quickly? >> Is that not something we might want to do *anyway*? > You normally can only mlock() mminor amounts of memory without changing > settings. Requiring to change that setting (aside that mlocking would be a bad > idea imo) would run contrary to the point of the patch, wouldn't it? ;) It would. I wasn't aware of that limitation :) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote: >>> What happens if you mlock() it into memory - does that fail quickly? >>> Is that not something we might want to do *anyway*? >> You normally can only mlock() mminor amounts of memory without changing >> settings. Requiring to change that setting (aside that mlocking would be a bad >> idea imo) would run contrary to the point of the patch, wouldn't it? ;) > It would. I wasn't aware of that limitation :) The OSX man page says that mlock should give EAGAIN for a permissions failure (ie, exceeding the rlimit) but [ENOMEM] Some portion of the indicated address range is not allocated. There was anerror faulting/mapping a page. It might be helpful to try mlock (if available, which it isn't everywhere) and complain about ENOMEM but not other errors. If course, if the kernel checks rlimit first, we won't learn anything ... I think it *would* be a good idea to mlock if we could. Setting shmem large enough that it swaps has always been horrible for performance, and in sysv-land there's no way to prevent that. But we can't error out on permissions failure. regards, tom lane
On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: > > On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote: > >>> What happens if you mlock() it into memory - does that fail quickly? > >>> Is that not something we might want to do *anyway*? > >> > >> You normally can only mlock() mminor amounts of memory without changing > >> settings. Requiring to change that setting (aside that mlocking would be > >> a bad idea imo) would run contrary to the point of the patch, wouldn't > >> it? ;) > > > > It would. I wasn't aware of that limitation :) > > The OSX man page says that mlock should give EAGAIN for a permissions > failure (ie, exceeding the rlimit) but > > [ENOMEM] Some portion of the indicated address range is not > allocated. There was an error faulting/mapping a > page. > > It might be helpful to try mlock (if available, which it isn't > everywhere) and complain about ENOMEM but not other errors. If course, > if the kernel checks rlimit first, we won't learn anything ... > > I think it *would* be a good idea to mlock if we could. Setting shmem > large enough that it swaps has always been horrible for performance, > and in sysv-land there's no way to prevent that. But we can't error > out on permissions failure. Its also a very good method to get into hard to diagnose OOM situations though. Unless the machine is setup very careful and only runs postgres I don't think its acceptable to do that. Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote: >> I think it *would* be a good idea to mlock if we could. Setting shmem >> large enough that it swaps has always been horrible for performance, >> and in sysv-land there's no way to prevent that. But we can't error >> out on permissions failure. > Its also a very good method to get into hard to diagnose OOM situations > though. Unless the machine is setup very careful and only runs postgres I > don't think its acceptable to do that. Well, the permissions angle is actually a good thing here. There is pretty much no risk of the mlock succeeding on a box that hasn't been specially configured --- and, in most cases, I think you'd need root cooperation to raise postgres' RLIMIT_MEMLOCK. So I think we could try to mlock without having any effect for 99% of users. The 1% who are smart enough to raise the rlimit to something suitable would get better, or at least more predictable, performance. regards, tom lane
On Thursday, June 28, 2012 08:00:06 PM Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On Thursday, June 28, 2012 07:43:16 PM Tom Lane wrote: > >> I think it *would* be a good idea to mlock if we could. Setting shmem > >> large enough that it swaps has always been horrible for performance, > >> and in sysv-land there's no way to prevent that. But we can't error > >> out on permissions failure. > > > > Its also a very good method to get into hard to diagnose OOM situations > > though. Unless the machine is setup very careful and only runs postgres I > > don't think its acceptable to do that. > > Well, the permissions angle is actually a good thing here. There is > pretty much no risk of the mlock succeeding on a box that hasn't been > specially configured --- and, in most cases, I think you'd need root > cooperation to raise postgres' RLIMIT_MEMLOCK. So I think we could try > to mlock without having any effect for 99% of users. The 1% who are > smart enough to raise the rlimit to something suitable would get better, > or at least more predictable, performance. The heightened limit might just as well target at another application and be setup a bit to widely. I agree that it is useful, but I think it requires its own setting, defaulting to off. Especially as there are no experiences with running a larger pg instance that way. Greetings, Andres, for once the conservative one, Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On Thursday, June 28, 2012 08:00:06 PM Tom Lane wrote: >> Well, the permissions angle is actually a good thing here. There is >> pretty much no risk of the mlock succeeding on a box that hasn't been >> specially configured --- and, in most cases, I think you'd need root >> cooperation to raise postgres' RLIMIT_MEMLOCK. So I think we could try >> to mlock without having any effect for 99% of users. The 1% who are >> smart enough to raise the rlimit to something suitable would get better, >> or at least more predictable, performance. > The heightened limit might just as well target at another application and be > setup a bit to widely. I agree that it is useful, but I think it requires its > own setting, defaulting to off. Especially as there are no experiences with > running a larger pg instance that way. [ shrug... ] I think you're inventing things to be afraid of, and ignoring a very real problem that mlock could fix. regards, tom lane
On Thu, Jun 28, 2012 at 1:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Thu, Jun 28, 2012 at 7:27 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>> On Thursday, June 28, 2012 07:19:46 PM Magnus Hagander wrote: >>>> What happens if you mlock() it into memory - does that fail quickly? >>>> Is that not something we might want to do *anyway*? > >>> You normally can only mlock() mminor amounts of memory without changing >>> settings. Requiring to change that setting (aside that mlocking would be a bad >>> idea imo) would run contrary to the point of the patch, wouldn't it? ;) > >> It would. I wasn't aware of that limitation :) > > The OSX man page says that mlock should give EAGAIN for a permissions > failure (ie, exceeding the rlimit) but > > [ENOMEM] Some portion of the indicated address range is not > allocated. There was an error faulting/mapping a > page. > > It might be helpful to try mlock (if available, which it isn't > everywhere) and complain about ENOMEM but not other errors. If course, > if the kernel checks rlimit first, we won't learn anything ... I tried this. At least on my fairly vanilla MacOS X desktop, an mlock for a larger amount of memory than was conveniently on hand (4GB, on a 4GB box) neither succeeded nor failed in a timely fashion but instead progressively hung the machine, apparently trying to progressively push every available page out to swap. After 5 minutes or so I could no longer move the mouse. After about 20 minutes I gave up and hit the reset button. So there's apparently no value to this as a diagnostic tool, at least on this platform. > I think it *would* be a good idea to mlock if we could. Setting shmem > large enough that it swaps has always been horrible for performance, > and in sysv-land there's no way to prevent that. But we can't error > out on permissions failure. I wouldn't mind having an option, but I think there'd have to be a way to turn it off for people trying to cram as many lightly-used VMs as possible onto a single server. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I tried this. At least on my fairly vanilla MacOS X desktop, an mlock > for a larger amount of memory than was conveniently on hand (4GB, on a > 4GB box) neither succeeded nor failed in a timely fashion but instead > progressively hung the machine, apparently trying to progressively > push every available page out to swap. After 5 minutes or so I could > no longer move the mouse. After about 20 minutes I gave up and hit > the reset button. So there's apparently no value to this as a > diagnostic tool, at least on this platform. Fun. I wonder if other BSDen are as brain-dead as OSX on this point. It'd probably at least be worth filing a bug report with Apple about it. regards, tom lane
On Thu, Jun 28, 2012 at 2:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I tried this. At least on my fairly vanilla MacOS X desktop, an mlock >> for a larger amount of memory than was conveniently on hand (4GB, on a >> 4GB box) neither succeeded nor failed in a timely fashion but instead >> progressively hung the machine, apparently trying to progressively >> push every available page out to swap. After 5 minutes or so I could >> no longer move the mouse. After about 20 minutes I gave up and hit >> the reset button. So there's apparently no value to this as a >> diagnostic tool, at least on this platform. > > Fun. I wonder if other BSDen are as brain-dead as OSX on this point. > > It'd probably at least be worth filing a bug report with Apple about it. Just for fun, I tried writing a program that does power-of-two-sized malloc requests. The first one that failed - on my 4GB Mac, remember - was for 140737488355328 bytes. Yeah, that' s right: 128 TB. According to the Google, there is absolutely no way of gettIng MacOS X not to overcommit like crazy. You can read the amount of system memory by using sysctl() to fetch hw.memsize, but it's not really clear how much that helps. We could refuse to start up if the shared memory allocation is >= hw.memsize, but even an amount slightly less than that seems like enough to send the machine into a tailspin, so I'm not sure that really gets us anywhere. One idea I had was to LOG the size of the shared memory allocation just before allocating it. That way, if your system goes into the tank, there will at least be something in the log. But that would be useless chatter for most users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> According to the Google, there is absolutely no way of gettIng MacOS X > not to overcommit like crazy. Well, this is one of a long list of broken things about OSX. If you want to see *real* breakage, do some IO performance testing of HFS+ FWIW, I have this issue with Mac desktop applications on my MacBook, which will happily memory leak until I run out of swap space. > You can read the amount of system > memory by using sysctl() to fetch hw.memsize, but it's not really > clear how much that helps. We could refuse to start up if the shared > memory allocation is >= hw.memsize, but even an amount slightly less > than that seems like enough to send the machine into a tailspin, so > I'm not sure that really gets us anywhere. I still think it would help. User errors in allocating shmmem are more likely to be order-of-magnitude errors ("I meant 500MB, not 500GB!") than be matters of 20% of RAM over. > One idea I had was to LOG the size of the shared memory allocation > just before allocating it. That way, if your system goes into the > tank, there will at least be something in the log. But that would be > useless chatter for most users. Yes, but it would provide mailing list, IRC and StackExchange quick answers. "I started up PostgreSQL and my MacBook crashed." "Find the file postgres.log. What's the last 10 lines?" So neither of those things *fixes* the problem ... ultimately, it's Apple's problem and we can't fix it ... but both of them make it somewhat better. The other thing which will avoid the problem for most Mac users is if we simply allocate 10% of RAM at initdb as a default. If we do that, then 90% of users will never touch Shmem themselves, and not have the opportunity to mess up. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > The other thing which will avoid the problem for most Mac users is if we > simply allocate 10% of RAM at initdb as a default. If we do that, then > 90% of users will never touch Shmem themselves, and not have the > opportunity to mess up. If we could do that on *all* platforms, I might be for it, but we only know how to get that number on some platforms. There's also the issue of whether we really want to assume that the machine is dedicated to Postgres, which IMO is an implicit assumption of any default that scales itself to physical RAM. For the moment I think we should just allow initdb to scale up a little bit more than where it is now, perhaps 128MB instead of 32. regards, tom lane
Tom, > If we could do that on *all* platforms, I might be for it, but we only > know how to get that number on some platforms. I don't see what's wrong with using it where we can get it, and not using it where we can't. > There's also the issue > of whether we really want to assume that the machine is dedicated to > Postgres, which IMO is an implicit assumption of any default that scales > itself to physical RAM. 10% isn't assuming dedicated. Assuming dedicated would be 20% or 25%. I was thinking "10%, with a ceiling of 512MB". > For the moment I think we should just allow initdb to scale up a little > bit more than where it is now, perhaps 128MB instead of 32. I wouldn't be opposed to that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> If we could do that on *all* platforms, I might be for it, but we only >> know how to get that number on some platforms. > I don't see what's wrong with using it where we can get it, and not > using it where we can't. Because then we still need to define, and document, a sensible behavior on the machines where we can't get it. And document that we do it two different ways, and document which machines we do it which way on. >> There's also the issue >> of whether we really want to assume that the machine is dedicated to >> Postgres, which IMO is an implicit assumption of any default that scales >> itself to physical RAM. > 10% isn't assuming dedicated. Really? regards, tom lane
>> 10% isn't assuming dedicated. > > Really? Yes. As I said, the allocation for dedicated PostgreSQL servers is usually 20% to 25%, up to 8GB. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >>> 10% isn't assuming dedicated. >> Really? > Yes. As I said, the allocation for dedicated PostgreSQL servers is > usually 20% to 25%, up to 8GB. Any percentage is assuming dedicated, IMO. 25% might be the more common number, but you're still assuming that you can have your pick of the machine's resources. My idea of "not dedicated" is "I can launch a dozen postmasters on this machine, and other services too, and it'll be okay as long as they're not doing too much". regards, tom lane
> My idea of "not dedicated" is "I can launch a dozen postmasters on this > machine, and other services too, and it'll be okay as long as they're > not doing too much". Oh, 128MB then? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hi All, In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed memory. That gives around 9.5% performance benefit in a read-only pgbench run (-n -S - j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem). It also saves a bunch of memory per process due to the smaller page table (shared_buffers 6GB): cat /proc/$pid_of_pg_backend/status |grep VmPTE VmPTE: 6252 kB vs VmPTE: 60 kB Additionally it has the advantage that top/ps/... output under linux now looks like: PID USER PR NI VIRT RES SHR S %CPU%MEM TIME+ COMMAND 10603 andres 20 0 6381m 4924 1952 R 21 0.0 0:28.04 postgres i.e. RES now actually shows something usable... Which is rather nice imo. I don't have the time atm into making this something useable, maybe somebody else want to pick it up? Looks pretty worthwile investing some time. Because of the required setup we sure cannot make this the default but... Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi All, > > In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed > memory. > That gives around 9.5% performance benefit in a read-only pgbench run (-n -S - > j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem). > > It also saves a bunch of memory per process due to the smaller page table > (shared_buffers 6GB): > cat /proc/$pid_of_pg_backend/status |grep VmPTE > VmPTE: 6252 kB > vs > VmPTE: 60 kB > > Additionally it has the advantage that top/ps/... output under linux now looks > like: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 10603 andres 20 0 6381m 4924 1952 R 21 0.0 0:28.04 postgres > > i.e. RES now actually shows something usable... Which is rather nice imo. > > I don't have the time atm into making this something useable, maybe somebody > else want to pick it up? Looks pretty worthwile investing some time. > > Because of the required setup we sure cannot make this the default but... ... those results are just spectacular (IMO). nice! merlin
On Fri, Jun 29, 2012 at 1:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Hi All, >> >> In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed >> memory. >> That gives around 9.5% performance benefit in a read-only pgbench run (-n -S - >> j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem). >> >> It also saves a bunch of memory per process due to the smaller page table >> (shared_buffers 6GB): >> cat /proc/$pid_of_pg_backend/status |grep VmPTE >> VmPTE: 6252 kB >> vs >> VmPTE: 60 kB > ... those results are just spectacular (IMO). nice! That is super awesome. Smallish databases with a high number of connections actually spend a considerable fraction of their otherwise-available-for-buffer-cache space on page tables in common cases currently. -- fdr
On Fri, Jun 29, 2012 at 04:03:40PM -0700, Daniel Farina wrote: > On Fri, Jun 29, 2012 at 1:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > > On Fri, Jun 29, 2012 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> Hi All, > >> > >> In a *very* quick patch I tested using huge pages/MAP_HUGETLB for the mmap'ed > >> memory. > >> That gives around 9.5% performance benefit in a read-only pgbench run (-n -S - > >> j 64 -c 64 -T 10 -M prepared, scale 200, 6GB s_b, 8 cores, 24GB mem). > >> > >> It also saves a bunch of memory per process due to the smaller page table > >> (shared_buffers 6GB): > >> cat /proc/$pid_of_pg_backend/status |grep VmPTE > >> VmPTE: 6252 kB > >> vs > >> VmPTE: 60 kB > > ... those results are just spectacular (IMO). nice! > > That is super awesome. Smallish databases with a high number of > connections actually spend a considerable fraction of their > otherwise-available-for-buffer-cache space on page tables in common > cases currently. I thought newer Linux kernels did huge pages automatically? What Linux kernel is this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Jun 29, 2012 at 2:31 PM, Josh Berkus <josh@agliodbs.com> wrote: >> My idea of "not dedicated" is "I can launch a dozen postmasters on this >> machine, and other services too, and it'll be okay as long as they're >> not doing too much". > > Oh, 128MB then? Proposed patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Jun 28, 2012 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Assuming things go well, there are a number of follow-on things that > we need to do finish this up: > > 1. Update the documentation. I skipped this for now, because I think > that what we write there is going to be heavily dependent on how > portable this turns out to be, which we don't know yet. Also, it's > not exactly clear to me what the documentation should say if this does > turn out to work everywhere. Much of section 17.4 will become > irrelevant to most users, but I doubt we'd just want to remove it; it > could still matter for people running EXEC_BACKEND or running many > postmasters on the same machine or, of course, people running on > platforms where this just doesn't work, if there are any. Here's a patch that attempts to begin the work of adjusting the documentation for this brave new world. I am guessing that there may be other places in the documentation that also require updating, and this page probably needs more work, but it's a start. > 2. Update the HINT messages when shared memory allocation fails. > Maybe the new most-common-failure mode there will be too many > postmasters running on the same machine? We might need to wait for > some field reports before adjusting this. I think this is mostly a matter of removing the text that says "fix this by reducing shme-related parameters" from the relevant hint messages. > 3. Consider adjusting the logic inside initdb. If this works > everywhere, the code for determining how to set shared_buffers should > become pretty much irrelevant. Even if it only works some places, we > could add 64MB or 128MB or whatever to the list of values we probe, so > that people won't get quite such a sucky configuration out of the box. > Of course there's no number here that will be good for everyone. I posted a patch for this one last night. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wednesday, June 27, 2012 05:28:14 AM Robert Haas wrote: > On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Josh Berkus <josh@agliodbs.com> writes: > >> So let's fix the 80% case with something we feel confident in, and then > >> revisit the no-sysv interlock as a separate patch. That way if we can't > >> fix the interlock issues, we still have a reduced-shmem version of > >> Postgres. > > > > Yes. Insisting that we have the whole change in one patch is a good way > > to prevent any forward progress from happening. As Alvaro noted, there > > are plenty of issues to resolve without trying to change the interlock > > mechanism at the same time. > > So, here's a patch. Instead of using POSIX shmem, I just took the > expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS > memory. The sysv shm is still allocated, but it's just a copy of > PGShmemHeader; the "real" shared memory is the anonymous block. This > won't work if EXEC_BACKEND is defined so it just falls back on > straight sysv shm in that case. > > There are obviously some portability issues here - this is documented > not to work on Linux <= 2.4, but it's not clear whether it fails with > some suitable error code or just pretends to work and does the wrong > thing. I tested that it does compile and work on both Linux 3.2.6 and > MacOS X 10.6.8. And the comments probably need work and... who knows > what else is wrong. But, thoughts? Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared memory might be advantageous on some platforms. E.g. on freebsd there is the kern.ipc.shm_use_phys setting which prevents paging out shared memory and also seems to make tlb translation cheaper. There does not seem to exist an alternative for anonymous mmap. So maybe we should make that a config option? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared > memory might be advantageous on some platforms. E.g. on freebsd there is the > kern.ipc.shm_use_phys setting which prevents paging out shared memory and also > seems to make tlb translation cheaper. There does not seem to exist an > alternative for anonymous mmap. Isn't that mlock()? > So maybe we should make that a config option? I'd really rather not. If we're going to go in this direction, we should just go there. regards, tom lane
On Tue, Jul 3, 2012 at 11:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared > memory might be advantageous on some platforms. E.g. on freebsd there is the > kern.ipc.shm_use_phys setting which prevents paging out shared memory and also > seems to make tlb translation cheaper. There does not seem to exist an > alternative for anonymous mmap. > So maybe we should make that a config option? Yeah, I was noticing some notes to that effect in the documentation this morning. I think the alternative for anonymous mmap is mlock(). However, that can hit kernel limits of its own. I'm not sure what the best thing to do about this is. I think most users will want mlock... but maybe not all? So we end up with one option for whether to use mlock and another for whether to use more or less System V shm? Sounds confusing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 3, 2012 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On Wednesday, June 27, 2012 05:28:14 AM Robert Haas wrote: >> On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Josh Berkus <josh@agliodbs.com> writes: >> >> So let's fix the 80% case with something we feel confident in, and then >> >> revisit the no-sysv interlock as a separate patch. That way if we can't >> >> fix the interlock issues, we still have a reduced-shmem version of >> >> Postgres. >> > >> > Yes. Insisting that we have the whole change in one patch is a good way >> > to prevent any forward progress from happening. As Alvaro noted, there >> > are plenty of issues to resolve without trying to change the interlock >> > mechanism at the same time. >> >> So, here's a patch. Instead of using POSIX shmem, I just took the >> expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS >> memory. The sysv shm is still allocated, but it's just a copy of >> PGShmemHeader; the "real" shared memory is the anonymous block. This >> won't work if EXEC_BACKEND is defined so it just falls back on >> straight sysv shm in that case. >> >> There are obviously some portability issues here - this is documented >> not to work on Linux <= 2.4, but it's not clear whether it fails with >> some suitable error code or just pretends to work and does the wrong >> thing. I tested that it does compile and work on both Linux 3.2.6 and >> MacOS X 10.6.8. And the comments probably need work and... who knows >> what else is wrong. But, thoughts? > Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv shared > memory might be advantageous on some platforms. E.g. on freebsd there is the > kern.ipc.shm_use_phys setting which prevents paging out shared memory and also > seems to make tlb translation cheaper. There does not seem to exist an > alternative for anonymous mmap. > So maybe we should make that a config option? Interesting to see that FreeBSD does this - while at the same time refusing to fix the use of sysv shared memory under their own jails system (afaik, at least). They seem to be quite undecided on if it's a feature to remove or a feature to expand on :O Not sure I'd trust that to stick around... -- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/
On Tuesday, July 03, 2012 05:41:09 PM Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Btw, RhodiumToad/Andrew Gierth on irc talked about a reason why sysv > > shared memory might be advantageous on some platforms. E.g. on freebsd > > there is the kern.ipc.shm_use_phys setting which prevents paging out > > shared memory and also seems to make tlb translation cheaper. There does > > not seem to exist an alternative for anonymous mmap. > Isn't that mlock()? Similar at least yes. I think it might also make the virtual/physical translation more direct but that ist just the impression of a very short search. > > So maybe we should make that a config option? > I'd really rather not. If we're going to go in this direction, we > should just go there. I don't really care, just wanted to bring up that at least one experienced user would be disappointed ;). As the old implementation needs to stay around for EXEC_BACKEND anyway, the price doesn't seem to be too high. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On Tuesday, July 03, 2012 05:41:09 PM Tom Lane wrote: >> I'd really rather not. If we're going to go in this direction, we >> should just go there. > I don't really care, just wanted to bring up that at least one experienced > user would be disappointed ;). As the old implementation needs to stay around > for EXEC_BACKEND anyway, the price doesn't seem to be too high. Well, my feeling is that sooner or later, perhaps sooner, we are going to want to be out from under SysV shmem (and semaphores) entirely. The Linux kernel guys keep threatening to drop support for the feature. So I think that exposing any knobs about this, or encouraging people to rely on corner-case properties of SysV on their platform, is just going to create more pain when we have to pull the plug. regards, tom lane
On Tue, Jul 3, 2012 at 6:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Here's a patch that attempts to begin the work of adjusting the > documentation for this brave new world. I am guessing that there may > be other places in the documentation that also require updating, and > this page probably needs more work, but it's a start. I think the boilerplate warnings in config.sgml about needing to raise the SysV parameters can go away; patch attached. Josh
Attachment
On Tue, Jul 3, 2012 at 1:46 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote: > On Tue, Jul 3, 2012 at 6:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Here's a patch that attempts to begin the work of adjusting the >> documentation for this brave new world. I am guessing that there may >> be other places in the documentation that also require updating, and >> this page probably needs more work, but it's a start. > > I think the boilerplate warnings in config.sgml about needing to raise > the SysV parameters can go away; patch attached. Thanks, committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company