Thread: postgresql in FreeBSD jails: proposal
Here (@sophos.com) we run machine cluster tests using FreeBSD jails. A jail is halfway between a chroot and a VM. Jails blow a number of assumptions about a unix environment: sysv ipc's are global to all jails; but a process can only "see" other processes also running in the jail. In fact, the quickest way to tell whether you're running in a jail is to test for process 1. PGSharedMemoryCreate chooses/reuses an ipc key in a reasonable way to cover previous postmasters crashing and leaving a shm seg behind, possibly with some backends still running. Unfortunately, with multiple jails running PG servers and (due to app limitations) all servers having same PGPORT, you get the situation that when jail#2 (,jail#3,...) server comes up, it: - detects that there is a shm seg with ipc key 5432001 - checks whether the associated postmaster process exists (with kill -0) - overwrites the segment created and being used by jail #1 There's a workaround (there always is) other than this patch, involving NAT translation so that the postmasters listen on different ports, but the outside world sees them each listening on 5432. But that seems somewhat circuitous. I've hacked sysv_shmem.c (in PG 8.0.9) to handle this problem. Given the trouble that postmaster goes to, to stop shm seg leakage, I'd like to solicit any opinions on the wisdom of this edge case. If this patch IS useful, what would be the right level of compile-time restriction ("#ifdef __FreeBSD__" ???) @@ -319,7 +319,8 @@ if (makePrivate) /* a standalone backend shouldn't do this */ continue; - + /* In a FreeBSD jail, you can't "kill -0" a postmaster + * running in a different jail, so the shm seg might + * still be in use. Safer to test nattch ? + */ + if (kill(1,0) && errno == ESRCH && !PGSharedMemoryIsInUse(0,NextShmSegID)) + continue; if ((memAddress = PGSharedMemoryAttach(NextShmemSegID, &shmid)) == NULL) continue; /* can't attach, not one of mine */ End of Patch.
Mischa Sandberg <mischa_sandberg@telus.net> writes: > + /* In a FreeBSD jail, you can't "kill -0" a postmaster > + * running in a different jail, so the shm seg might > + * still be in use. Safer to test nattch ? > + */ > + if (kill(1,0) && errno == ESRCH && !PGSharedMemoryIsInUse(0,NextShmSegID)) > + continue; Isn't the last part of that test backward? If it isn't, I don't understand what it's for at all. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Mischa Sandberg <mischa_sandberg@telus.net> writes: > > + /* In a FreeBSD jail, you can't "kill -0" a > postmaster > > + * running in a different jail, so the shm seg > might > > + * still be in use. Safer to test nattch ? > > + */ > > + if (kill(1,0) && errno == ESRCH && > PGSharedMemoryIsInUse(0,NextShmemSegID)) > > + continue; > > Isn't the last part of that test backward? If it isn't, I don't > understand what it's for at all. Serious blush here. Yes.
Mischa Sandberg <mischa_sandberg@telus.net> writes: > Quoting Tom Lane <tgl@sss.pgh.pa.us>: >> Mischa Sandberg <mischa_sandberg@telus.net> writes: >>> + if (kill(1,0) && errno == ESRCH && PGSharedMemoryIsInUse(0,NextShmemSegID)) >>> + continue; >> >> Isn't the last part of that test backward? If it isn't, I don't >> understand what it's for at all. > Serious blush here. Yes. Actually, after re-reading what PGSharedMemoryIsInUse does, I don't think you want to use it: it goes to considerable lengths to avoid returning a false positive, whereas in this context I believe we *do* need to avoid segments that belong to other data directories. So you probably need a separate chunk of code that only does the nattch test. regards, tom lane
* Mischa Sandberg (mischa_sandberg@telus.net) wrote: > Here (@sophos.com) we run machine cluster tests using FreeBSD jails. A > jail is halfway between a chroot and a VM. Jails blow a number of > assumptions about a unix environment: sysv ipc's are global to all > jails; but a process can only "see" other processes also running in the > jail. In fact, the quickest way to tell whether you're running in a jail > is to test for process 1. I've got a couple of concerns about this- #1: Having the shared memory be global is a rather large problem when it comes to something like PG which can have a fairbit of data goingthrough that area that could be sensitive. #2: Isn't there already a uid check that's done? Wouldn't this makemore sense anyway (and hopefully minimize the impactof a bad persongetting control of the PG database/user in a given jail)? #3: At least in the linux-equivilant to jails (linux-vservers, imvanyway), they started w/o an init process and eventuallydecided itmade sense to have one, so I'm not sure that this test will alwayswork and the result might catch someoneby suprise at some laterdate. Is there a better/more explicit test? Thanks, Stephen
Stephen Frost <sfrost@snowman.net> writes: > I've got a couple of concerns about this- > #1: Having the shared memory be global is a rather large problem when it > comes to something like PG which can have a fair bit of data going > through that area that could be sensitive. Well, you'd have to talk to the FreeBSD kernel hackers about changing that, but I imagine it's still true that userid permissions checking applies. Whether to run the postmasters that are in different jails under different userids is a separate questions. > #3: At least in the linux-equivilant to jails (linux-vservers, imv > anyway), they started w/o an init process and eventually decided it > made sense to have one, so I'm not sure that this test will always > work and the result might catch someone by suprise at some later > date. Is there a better/more explicit test? We could just leave out the kill(1,0) part. In fact I wonder whether we shouldn't do something like this on all platforms not only FreeBSD. Quite aside from any considerations of jails, it seems like a pretty bad idea to try to zap a shmem segment that has any attached processes. Consider a system that normally runs multiple postmasters, in which one postmaster has died but left orphaned backends behind, and we are trying to start an unrelated postmaster. The current code seems capable of deciding to zap the segment with those orphaned backends attached. This'd require a shmem key collision which seems pretty improbable given our key assignments, but not quite impossible. If it did happen then the net effect would be to clear the segment's ID (since it can't actually go away till the connected processes do). The bad thing about that is that if the dead postmaster were then restarted, it wouldn't recognize the segment as being its own, and would happily start up despite the orphaned backends. Result: exactly the kind of conflicts and data corruption that all these interlocks are trying to prevent. So unless I'm missing something here, adding a check for nattch = 0 is a good idea, quite aside from making FreeBSD jails safer. I think the worrisome question that follows on from Stephen's is really whether FreeBSD will ever decide to lie about nattch (ie, exclude processes in other jails from that count). regards, tom lane
mischa_sandberg@telus.net (Mischa Sandberg) writes: >Unfortunately, with multiple jails running PG servers and (due to app >limitations) all servers having same PGPORT, you get the situation that >when jail#2 (,jail#3,...) server comes up, it: >- detects that there is a shm seg with ipc key 5432001 >- checks whether the associated postmaster process exists (with kill -0) >- overwrites the segment created and being used by jail #1 Easiest fix: change the UID of the user running the postmaster (ie. pgsql) so that each runs as a distinct UID (instead of distinct PGPORT) ... been doing this since moving to FreeBSD 6.x ... no patches required ... -- ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664
"Marc G. Fournier" <scrappy@hub.org> writes: > mischa_sandberg@telus.net (Mischa Sandberg) writes: >> Unfortunately, with multiple jails running PG servers and (due to app >> limitations) all servers having same PGPORT, you get the situation that >> when jail#2 (,jail#3,...) server comes up, it: >> - detects that there is a shm seg with ipc key 5432001 >> - checks whether the associated postmaster process exists (with kill -0) >> - overwrites the segment created and being used by jail #1 > Easiest fix: change the UID of the user running the postmaster (ie. pgsql) so > that each runs as a distinct UID (instead of distinct PGPORT) ... been doing > this since moving to FreeBSD 6.x ... no patches required ... Sure, but in the spirit of "belt and suspenders too", I'd think that doing that *and* something like Mischa's proposal wouldn't be bad. (BTW, as far as I saw the original post only went to -hackers ... there's something messed up about your reply.) regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 - --On Thursday, January 17, 2008 01:12:54 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Marc G. Fournier" <scrappy@hub.org> writes: >> mischa_sandberg@telus.net (Mischa Sandberg) writes: >>> Unfortunately, with multiple jails running PG servers and (due to app >>> limitations) all servers having same PGPORT, you get the situation that >>> when jail#2 (,jail#3,...) server comes up, it: >>> - detects that there is a shm seg with ipc key 5432001 >>> - checks whether the associated postmaster process exists (with kill -0) >>> - overwrites the segment created and being used by jail #1 > >> Easiest fix: change the UID of the user running the postmaster (ie. pgsql) so >> that each runs as a distinct UID (instead of distinct PGPORT) ... been doing >> this since moving to FreeBSD 6.x ... no patches required ... > > Sure, but in the spirit of "belt and suspenders too", I'd think that > doing that *and* something like Mischa's proposal wouldn't be bad. No arguments here, just pointing out that changing PGPORT isn't/wasnt' the only way of addressing this problem ... if we can do something more 'internal', it would definitely make life alot easier ... - ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFHjvMR4QvfyHIvDvMRAjJuAKCAYGkyvDOMkA6wqeQ7nAqoA1mkRQCdG+5n aD1uG+zUtevdJGJ3BsqeDAs= =Y0DY -----END PGP SIGNATURE-----
Cc: "pgadmin-devteam@postgresql.org.pgadmin-hackers@postgresql.org.pgadmin-support@postgresql.org.pgsql-admin@postgresql.org.pgsql-advocacy@postgresql.org.pgsql-announce@postgresql.org.pgsql-benchmarks@postgresql.org.pgsql-bugs@postgresql.org.pgsql-chat"@post Hey, this is exactly the sort of weird "Cc:" line I saw in the recent spam surge. Since I suspect you are using the news server to post, I suggest you take a long and careful look at the gateway's configuration. It seems there's something very broken here. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 17/01/2008, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > Cc: "pgadmin-devteam@postgresql.org.pgadmin-hackers@postgresql.org.pgadmin-support@postgresql.org.pgsql-admin@postgresql.org.pgsql-advocacy@postgresql.org.pgsql-announce@postgresql.org.pgsql-benchmarks@postgresql.org.pgsql-bugs@postgresql.org.pgsql-chat"@post > > Hey, this is exactly the sort of weird "Cc:" line I saw in the recent > spam surge. Since I suspect you are using the news server to post, I > suggest you take a long and careful look at the gateway's configuration. > It seems there's something very broken here. I sure hope that pgadmin-devteam isn't going out on the newserver - thats the pgAdmin equivalent to pgsql-core :-O /D
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > "Marc G. Fournier" <scrappy@hub.org> writes: > > Easiest fix: change the UID of the user running the postmaster (ie. pgsql) so > > that each runs as a distinct UID (instead of distinct PGPORT) ... been doing > > this since moving to FreeBSD 6.x ... no patches required ... > > Sure, but in the spirit of "belt and suspenders too", I'd think that > doing that *and* something like Mischa's proposal wouldn't be bad. I agree that we should try to be careful about stepping on segments that might still be in use, but I would also discourage jail users from using the same uid for multiple PG clusters since the jail doesn't protect the shmem segment. We use seperate uids even w/ linux-vservers where shmem and everything *is* seperate, following the same 'belt and suspenders too' spirit for security. Thanks, Stephen
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 - --On Thursday, January 17, 2008 13:58:36 +0000 Dave Page <dpage@postgresql.org> wrote: > On 17/01/2008, Alvaro Herrera <alvherre@commandprompt.com> wrote: >> >> Cc: >> "pgadmin-devteam@postgresql.org.pgadmin-hackers@postgresql.org.pgadmin-suppo >> rt@postgresql.org.pgsql-admin@postgresql.org.pgsql-advocacy@postgresql.org.p >> gsql-announce@postgresql.org.pgsql-benchmarks@postgresql.org.pgsql-bugs@post >> gresql.org.pgsql-chat"@post >> >> Hey, this is exactly the sort of weird "Cc:" line I saw in the recent >> spam surge. Since I suspect you are using the news server to post, I >> suggest you take a long and careful look at the gateway's configuration. >> It seems there's something very broken here. > > I sure hope that pgadmin-devteam isn't going out on the newserver - > thats the pgAdmin equivalent to pgsql-core :-O Just checked the subscriber list for that list, and I don't see news listed on it ...and no such newsgroup either: %grep pgadmin db/active pgsql.interfaces.pgadmin.hackers 0000004826 0000000001 y pgsql.interfaces.pgadmin.support 0000002946 0000000001 y - ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4 (FreeBSD) iD8DBQFHj9Ri4QvfyHIvDvMRAq/SAJ433rjmQjHG5OiR1PJ3BOq/93kPBwCg4an3 QaqGiypV6Jp0Bovi/O7EADs= =KKZB -----END PGP SIGNATURE-----
Quoting Stephen Frost <sfrost@snowman.net>: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > > "Marc G. Fournier" <scrappy@hub.org> writes: > > > Easiest fix: change the UID of the user running the postmaster > (ie. pgsql) so > > > that each runs as a distinct UID (instead of distinct PGPORT) ... > been doing > > > this since moving to FreeBSD 6.x ... no patches required ... > > > > Sure, but in the spirit of "belt and suspenders too", I'd think > that > > doing that *and* something like Mischa's proposal wouldn't be bad. > > I agree that we should try to be careful about stepping on > segments that might still be in use, but I would also discourage > jail users from using the same uid for multiple PG clusters > since the jail doesn't protect the shmem segment. > We use seperate uids even w/ linux-vservers where shmem > and everything *is* seperate, following the same > 'belt and suspenders too' spirit for security. Thanks for all the input. Fixing freebsd might get answered on a different channel :-) Unfortunately, different uid's is not even an option here; but serious security in this sitch is not relevant, either. We have a freebsd core guy here, and he says that there's no pressing incentive for jails to handle sysv ipc, given mmap and file locking :-( And given his other comments, I wouldn't consider jails a "secure" environment, just a modest and convenient way to emulate multiple machines with caveats. ......................................................... So, given Tom's comment, that it's antisocial to zap a shm seg that other processes have attached ... I'm going to skip the kill(1,0) test and depend on nattch only, with a function that PGSharedMemoryIsInUse() can also use. (For a healthy server, nattch is never less than 2, right?) If no unpleasant edge cases come out of this in our test framework, I'd like to submit that as a patch. Talked with our Linux guys about vserver, and they see no issues. Mr. Solaris here is currently a long way ooto ... opinions? Afaics the change in behaviour is, if a degraded server exited with some backend hanging, the second server will create a new segment after bumping the ipc key; if system shm limits do not allow for two such shm segments, the second server will bail. For production systems, ensuring no orphan shm segs is not left to heuristic clean-up by server re-start. Hope that makes sense for the generic Postgres world. If anyone is interested in creating hung backends, you can create a named pipe, and tell the server to COPY from it. --- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Mischa Sandberg <mischa_sandberg@telus.net> writes: > If anyone is interested in creating hung backends, you can > create a named pipe, and tell the server to COPY from it. That won't create a problematic situation though, until/unless you SIGQUIT the parent postmaster. Personally I think of this as "what happens after someone kill -9's a postmaster that has live children". regards, tom lane
Mischa Sandberg <mischa_sandberg@telus.net> writes: > I'm going to skip the kill(1,0) test and depend on nattch only, > with a function that PGSharedMemoryIsInUse() can also use. > (For a healthy server, nattch is never less than 2, right?) Oh, forgot to mention: healthy servers are not the point here. You should make the code keep its hands off any segment with nonzero nattch, because even one orphaned backend is enough to cause trouble. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Mischa Sandberg <mischa_sandberg@telus.net> writes: > > I'm going to skip the kill(1,0) test and depend on nattch only, > > with a function that PGSharedMemoryIsInUse() can also use. > > (For a healthy server, nattch is never less than 2, right?) > > Oh, forgot to mention: healthy servers are not the point here. > You should make the code keep its hands off any segment with > nonzero nattch, because even one orphaned backend is enough > to cause trouble. Note taken. Worth putting a warning in the log, too? Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection.
Added to TODO: * Improve detection of shared memory segments being used by other FreeBSD jails http://archives.postgresql.org/pgsql-hackers/2008-01/msg00656.php --------------------------------------------------------------------------- Mischa Sandberg wrote: > Here (@sophos.com) we run machine cluster tests using FreeBSD jails. A > jail is halfway between a chroot and a VM. Jails blow a number of > assumptions about a unix environment: sysv ipc's are global to all > jails; but a process can only "see" other processes also running in the > jail. In fact, the quickest way to tell whether you're running in a jail > is to test for process 1. > > PGSharedMemoryCreate chooses/reuses an ipc key in a reasonable way to > cover previous postmasters crashing and leaving a shm seg behind, > possibly with some backends still running. > > Unfortunately, with multiple jails running PG servers and (due to app > limitations) all servers having same PGPORT, you get the situation that > when jail#2 (,jail#3,...) server comes up, it: > - detects that there is a shm seg with ipc key 5432001 > - checks whether the associated postmaster process exists (with kill -0) > - overwrites the segment created and being used by jail #1 > > There's a workaround (there always is) other than this patch, involving > NAT translation so that the postmasters listen on different ports, but > the outside world sees them each listening on 5432. But that seems > somewhat circuitous. > > I've hacked sysv_shmem.c (in PG 8.0.9) to handle this problem. Given the > trouble that postmaster goes to, to stop shm seg leakage, I'd like to > solicit any opinions on the wisdom of this edge case. If this patch IS > useful, what would be the right level of compile-time restriction > ("#ifdef __FreeBSD__" ???) > > @@ -319,7 +319,8 @@ > > if (makePrivate) /* a standalone backend > shouldn't do this */ > continue; > - > + /* In a FreeBSD jail, you can't "kill -0" a postmaster > + * running in a different jail, so the shm seg might > + * still be in use. Safer to test nattch ? > + */ > + if (kill(1,0) && errno == ESRCH && > !PGSharedMemoryIsInUse(0,NextShmSegID)) > + continue; > if ((memAddress = PGSharedMemoryAttach(NextShmemSegID, > &shmid)) == NULL) > continue; /* can't attach, > not one of mine */ > > End of Patch. > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > > Added to TODO: > > * Improve detection of shared memory segments being used by other > FreeBSD jails > > http://archives.postgresql.org/pgsql-hackers/2008-01/msg00656.php There's a bit more than that to it -- see http://archives.postgresql.org/pgsql-hackers/2008-01/msg00673.php In short, it's not just a FreeBSD issue, but something a bit more general. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Bruce Momjian wrote: > > > > Added to TODO: > > > > * Improve detection of shared memory segments being used by other > > FreeBSD jails > > > > http://archives.postgresql.org/pgsql-hackers/2008-01/msg00656.php > > There's a bit more than that to it -- see > http://archives.postgresql.org/pgsql-hackers/2008-01/msg00673.php > > In short, it's not just a FreeBSD issue, but something a bit more > general. Added to TODO: * Improve detection of shared memory segments being used by others by checking the SysV shared memory field 'nattch' http://archives.postgresql.org/pgsql-hackers/2008-01/msg00656.php http://archives.postgresql.org/pgsql-hackers/2008-01/msg00673.php -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +