Thread: POSIX shared memory support
Recapitulating and addressing some of the issues with my previous attempt at this feature: PostgreSQL currently uses the System V shared memory APIs to access shared memory. On Mac OS X and other BSDs, the default System V shared memory limits are often very low and require adjustment for acceptable performance. Par- ticularly, when Postgres is included as part of larger end-user friendly software products, these kernel settings are often difficult to change programatically. The System V shared memory facilities provide a method to determine who is attached to a shared memory segment. This is used to prevent backends that were orphaned by crashed or killed database processes from corrupting the data- base as it is restarted. The same effect can be achieved with using the POSIX APIs, but since the POSIX library does not have a way to check who is attached to a segment, atomic segment creation must be used to ensure exclusive access to the database. In order for this to work, the key name used to open and create the shared memory segment must be unique for each data directory. This is done by using a strong hash of the canonical form of the data directory’s pathname. This also re- moves any risk of other applications, or other databases’ memory segments colliding with the current shared memory segment, which conveniently simplifies the logic. The algorithm changes are described by the before and after flowchart PDF included in the tarball below. There is also a Windows version of this patch included, which can replace the current SysV-to-Win32 shared memory layer as it currently does not check for orphaned backends in the database. If this is used, src/backend/port/win32/shmem.c and its makefile reference can be removed. Included is a patch for the configure.in file to opt-in all of the supported platforms to use this POSIX shared memory layer except those that are explicitly excluded (of course, this could be done in reverse if desired). Notably, NetBSD and OpenBSD do not support the POSIX shared memory calls and their template patches are included. Other platforms should be tested and excluded as needed. In addition, the documentation section 16.4.1 ‘Shared Memory and Sema- phores’ could be largely pruned and updated to reflect use of the POSIX calls. The patches are available here (the postings were being silently dropped when I attached this large of a file): http://homepage.mac.com/cmarcellino/postgres-posix-shmem.tar Please let me know if there is a better way to post this. Thanks for your feedback, Chris Marcellino
Chris Marcellino <cmarcellino@apple.com> writes: > The System V shared memory facilities provide a method to determine > who is attached to a shared memory segment. > This is used to prevent backends that were orphaned by crashed or > killed database processes from corrupting the data- > base as it is restarted. The same effect can be achieved with using > the POSIX APIs, ... except that it can't ... > but since the POSIX library does not > have a way to check who is attached to a segment, atomic segment > creation must be used to ensure exclusive access to > the database. How does that fix the problem? If you can't actually tell whether someone is attached to an existing segment, then you're still up against the basic rock-and-a-hard-place issue: either you assume there is no one there (and corrupt your database if you're wrong) or you assume there is someone there (and force manual intervention by the DBA to recover after postmaster crashes). Neither of these alternatives is really acceptable. > In order for this to work, the key name used to open and create the > shared memory segment must be unique for each > data directory. This is done by using a strong hash of the canonical > form of the data directory�s pathname. "Strong hash" is not a guarantee, even if you could promise that you could get a unique canonical path, which I doubt you can. In any case this fails if the DBA decides to rename the directory on the fly (don't laugh; not only are there instances of that in our archives, there are people opining that we need to allow it --- even with the postmaster still running). > This also re- > moves any risk of other applications, or other databases� memory > segments colliding with the current shared memory > segment, which conveniently simplifies the logic. How exactly does it remove that risk? I think you're wishfully-thinking that if you are creating an unreadable hash value then there will never be any collisions against someone else with the same touching faith that *his* unreadable hash values will never collide with anyone else's. Doesn't give me a lot of comfort. Not that it matters, since the approach is broken even if this specific assumption were sustainable. regards, tom lane
On Feb 26, 2007, at 10:43 PM, Tom Lane wrote: > Chris Marcellino <cmarcellino@apple.com> writes: >> The System V shared memory facilities provide a method to determine >> who is attached to a shared memory segment. >> This is used to prevent backends that were orphaned by crashed or >> killed database processes from corrupting the data- >> base as it is restarted. The same effect can be achieved with using >> the POSIX APIs, > > ... except that it can't ... > >> but since the POSIX library does not >> have a way to check who is attached to a segment, atomic segment >> creation must be used to ensure exclusive access to >> the database. > > How does that fix the problem? If you can't actually tell whether > someone is attached to an existing segment, then you're still up > against > the basic rock-and-a-hard-place issue: either you assume there is > no one > there (and corrupt your database if you're wrong) or you assume > there is > someone there (and force manual intervention by the DBA to recover > after > postmaster crashes). Neither of these alternatives is really > acceptable. Ignoring the case where backends are still alive in the database, since they would require intervention or patience either way, there are two options: 1) There is a postmaster/backend still running and you try to start another postmaster: the unique segment cannot be closed and atomically recreated and will fail as it does in the current implementation. 2) There are no errant processes still in the database: the segment can be closed and atomically recreated. Try making a build with the patch, then start a postmaster for a given folder, delete the lock file and start another postmaster (on a different port) in that folder. Please let me know if I am overlooking something. > >> In order for this to work, the key name used to open and create the >> shared memory segment must be unique for each >> data directory. This is done by using a strong hash of the canonical >> form of the data directory’s pathname. > > "Strong hash" is not a guarantee, even if you could promise that you > could get a unique canonical path, which I doubt you can. In any case > this fails if the DBA decides to rename the directory on the fly > (don't > laugh; not only are there instances of that in our archives, there are > people opining that we need to allow it --- even with the postmaster > still running). Strong hash is an effective guarantee that many computing paradigms are based upon. The collision rate is astronomically small, and can be made astronomically smaller with longer hashes. (For MD5 there would need to be 10^15 postmasters on a server before a collision is likely, and they all would need to have crashed and left backends in the database, etc. ) True, renaming is a problem that I had had not anticipated at all. Now that you mention it, hard links might be an issue on some machines that don't canonicalize them to a unique path, since that isn't required by the POSIX docs. Oh, the horrible degenerate cases. Good point though. Perhaps there is some other unique identifying feature of a given database. A per-database persistent UUID would fit nicely here. It could just be the shmem key. > >> This also re- >> moves any risk of other applications, or other databases’ memory >> segments colliding with the current shared memory >> segment, which conveniently simplifies the logic. > > How exactly does it remove that risk? This is fruitless due to the renaming issue, but the hash isn't an issue. I'm not sure that a hex string beginning with \pg_xxxxx is any less readable than the shmem id integers that are generated ad-hoc by the current implementation. > I think you're wishfully-thinking > that if you are creating an unreadable hash value then there will > never > be any collisions against someone else with the same touching faith > that > *his* unreadable hash values will never collide with anyone else's. I'm flattered that you hold my coding abilities with such devout conviction, but I assure you that cryptography, even in this limited use, is based in rational thought :). In addition, the astronomically unlikely collision isn't a risk as the database can't be damaged. The admin would then need to clear the lockfile, after he won the lottery twice and was stuck by lightning in his overturned car. > Doesn't give me a lot of comfort. > Not that it matters, since the > approach is broken even if this specific assumption were sustainable. Postmasters failing to load don't give me much comfort either, and that isn't a pipe dream. I suppose that the renaming issue relegates this patch to situations where the database cannot be renamed or hard linked to and started more than once, yet require this to start up databases without restarting and needing to control how many other databases are consuming shmem on the same box. Thanks for the reply, Chris Marcellino > > regards, tom lane
On Mon, Feb 26, 2007 at 09:00:09PM -0800, Chris Marcellino wrote: > > There is also a Windows version of this patch included, which can > replace the current SysV-to-Win32 shared memory > layer as it currently does not check for orphaned backends in the > database. If this is used, > src/backend/port/win32/shmem.c and its makefile reference can be > removed. This code is pretty close to the one I've been working on. Didn't reliase you would be working on a win32 specific version, thought you were doing POSIX only :-O Anyway. My version does not use hashing of the name, it just puts the name all the way in there. What are peoples feeling about that - win32 supports 32768 characters (though those are UTF16, so maybe the real value is 16384 - still, if someone has his data directory that deep down in a path, he deserves not to have it work). Should we be using hashing or just plaintext there? I can see the argument for plaintext being that you can then see in process explorer what data directory a backend is connected to. But I'm open to arguments as to why a hash would be better :-) While looking at this, I noticed that Windows 2003 SP1 adds a parameter SEC_LARGE_PAGES that enables "large pages to be used when mapping images or backing from the pagefile". I would assume this is for performance ;-) Does anybody know anything more about this? Worth testing out to see if it increases performance? > The patches are available here (the postings were being silently > dropped when I attached this large of a file): Actually, I don't think it's becauseo f the size of the file. Last time I checked we still had a configuration which had the lists silently drop certain file extensions, including .tar, .zip and .tar.gz. I've had several patches dropped that way. I've had no confirmation it has been fixed, so I'm assuming it's still a problem. //Magnus
The Win32 version didn't materialize until very recently. The Win32 calls are similar semantically to the POSIX ones, so it was somewhat straightforward. Plaintext is nice if you can fit it, since Windows permits you to have slashes and all sorts of other non-filename characters in them, unlike POSIX shmem. Also for POSIX, certain platforms (ahem Darwin has 30 chars) have very small segment name limits. As Tom said, the renaming-while-running issue might be a deal breaker, but I'm not sure that is a problem on Windows. It sounds like you are on the right track. I just mentioned this to Tom, but a solution for POSIX might be to use the device and inode combination for the data directory, since that is constant across renames and is more certain to be unique. Like replacing the GenerateIPCName() function with something this: static void GenerateIPCName(char destIPCName[IPCSegNameLength]) { struct stat statbuf; /* Get the data directory's device and inode */ if (stat(DataDir, &statbuf) < 0) ereport(FATAL, (errcode_for_file_access(), errmsg("could not stat data directory \"%s\": %m", DataDir))); /* * POSIX requires that shared memory names begin with a single slash. * They should not have any others slashes or any non-alphanumerics to * maintain the broadest assumption of what is permitted in a filename. * Also, case sensitivity should not be presumed. */ snprintf(destIPCName, IPCSegNameLength, "/PostgreSQL.%jx.%jx", (intmax_t) statbuf.st_dev, (intmax_t) statbuf.st_ino); } Does Windows have a method to get a unique ID number for a given data directory, or a token file in that directory? It would need to be constant while the database is open. Perhaps GetFileInformationByHandle? It returns a struct with a nFileIndex value that seems to be that, although I'm not certain. This might make it easier to avoid the complexity of fitting the filename in the segment name, and avoid the rename problem, Thanks, Chris Marcellino On Feb 27, 2007, at 12:56 AM, Magnus Hagander wrote: > On Mon, Feb 26, 2007 at 09:00:09PM -0800, Chris Marcellino wrote: > >> >> There is also a Windows version of this patch included, which can >> replace the current SysV-to-Win32 shared memory >> layer as it currently does not check for orphaned backends in the >> database. If this is used, >> src/backend/port/win32/shmem.c and its makefile reference can be >> removed. > > This code is pretty close to the one I've been working on. Didn't > reliase you would be working on a win32 specific version, thought you > were doing POSIX only :-O > > Anyway. My version does not use hashing of the name, it just puts the > name all the way in there. What are peoples feeling about that - win32 > supports 32768 characters (though those are UTF16, so maybe the real > value is 16384 - still, if someone has his data directory that deep > down > in a path, he deserves not to have it work). Should we be using > hashing > or just plaintext there? I can see the argument for plaintext being > that > you can then see in process explorer what data directory a backend is > connected to. But I'm open to arguments as to why a hash would be > better > :-) > > > While looking at this, I noticed that Windows 2003 SP1 adds a > parameter > SEC_LARGE_PAGES that enables "large pages to be used when mapping > images > or backing from the pagefile". I would assume this is for performance > ;-) Does anybody know anything more about this? Worth testing out > to see > if it increases performance? > > >> The patches are available here (the postings were being silently >> dropped when I attached this large of a file): > > Actually, I don't think it's becauseo f the size of the file. Last > time > I checked we still had a configuration which had the lists silently > drop > certain file extensions, including .tar, .zip and .tar.gz. I've had > several patches dropped that way. I've had no confirmation it has been > fixed, so I'm assuming it's still a problem. > > //Magnus
On Tue, Feb 27, 2007 at 01:09:46AM -0800, Chris Marcellino wrote: > The Win32 version didn't materialize until very recently. The Win32 > calls are similar semantically to the POSIX ones, so it was somewhat > straightforward. > > Plaintext is nice if you can fit it, since Windows permits you to > have slashes and all sorts of other non-filename characters in them, > unlike POSIX shmem. Also for POSIX, certain platforms (ahem Darwin > has 30 chars) have very small segment name limits. As Tom said, the > renaming-while-running issue might be a deal breaker, but I'm not > sure that is a problem on Windows. It sounds like you are on the > right track. This cannot happen on windows - you cannot rename a directory which someone has open files in. The infamous sharing violation error. > Does Windows have a method to get a unique ID number for a given data > directory, or a token file in that directory? It would need to be > constant while the database is open. Perhaps > GetFileInformationByHandle? It returns a struct with a nFileIndex > value that seems to be that, although I'm not certain. > This might make it easier to avoid the complexity of fitting the > filename in the segment name, and avoid the rename problem, Yes, you could use the fileindex value. You need that one and the volume serial number, total of 64+32 bits of data. So yeah, we cuold use that instead of the full path name if we want to. The advantage of this one is that it's shorter, the advantage of the full path name is that you can see where the backend is from. However, in most cases you will be able to see where the backend is from anyway, because it is likely to have some other file open in the data directory, so maybe that isn't such a big point after all? //Magnus
The comment on that method vexes me: "This value is useful only while the file is open by at least one process. If no processes have it open, the index may change the next time the file is opened." I wonder how this applies to directories. I.e. is a directory open if a file in it is open? You could use the path of an arbitrary file in that directory (like postgresql.conf) if this wasn't the case. We would need to be sure it was already opened by postgres somehow (which it should be). Chris On Feb 27, 2007, at 1:30 AM, Magnus Hagander wrote: > On Tue, Feb 27, 2007 at 01:09:46AM -0800, Chris Marcellino wrote: >> The Win32 version didn't materialize until very recently. The Win32 >> calls are similar semantically to the POSIX ones, so it was somewhat >> straightforward. >> >> Plaintext is nice if you can fit it, since Windows permits you to >> have slashes and all sorts of other non-filename characters in them, >> unlike POSIX shmem. Also for POSIX, certain platforms (ahem Darwin >> has 30 chars) have very small segment name limits. As Tom said, the >> renaming-while-running issue might be a deal breaker, but I'm not >> sure that is a problem on Windows. It sounds like you are on the >> right track. > > This cannot happen on windows - you cannot rename a directory which > someone has open files in. The infamous sharing violation error. > >> Does Windows have a method to get a unique ID number for a given data >> directory, or a token file in that directory? It would need to be >> constant while the database is open. Perhaps >> GetFileInformationByHandle? It returns a struct with a nFileIndex >> value that seems to be that, although I'm not certain. >> This might make it easier to avoid the complexity of fitting the >> filename in the segment name, and avoid the rename problem, > > Yes, you could use the fileindex value. You need that one and the > volume > serial number, total of 64+32 bits of data. > > So yeah, we cuold use that instead of the full path name if we want > to. > The advantage of this one is that it's shorter, the advantage of the > full path name is that you can see where the backend is from. > > However, in most cases you will be able to see where the backend is > from > anyway, because it is likely to have some other file open in the data > directory, so maybe that isn't such a big point after all? > > //Magnus
On Tue, Feb 27, 2007 at 10:30:15AM +0100, Magnus Hagander wrote: > > Does Windows have a method to get a unique ID number for a given data > > directory, or a token file in that directory? It would need to be > > constant while the database is open. Perhaps > > GetFileInformationByHandle? It returns a struct with a nFileIndex > > value that seems to be that, although I'm not certain. > > This might make it easier to avoid the complexity of fitting the > > filename in the segment name, and avoid the rename problem, > > Yes, you could use the fileindex value. You need that one and the volume > serial number, total of 64+32 bits of data. > > So yeah, we cuold use that instead of the full path name if we want to. > The advantage of this one is that it's shorter, the advantage of the > full path name is that you can see where the backend is from. > > However, in most cases you will be able to see where the backend is from > anyway, because it is likely to have some other file open in the data > directory, so maybe that isn't such a big point after all? Actually, I'm not sure we can. It's only stable as long as someone has the file open. It will change if it's closed and re-opened later. Given that we don't actually open the directory, and only files inside it, I don't know how that works. //Magnus
I believe that all we need is the ID to be constant and unique while the postmaster or its associated backends are running. If anything from a given generation has the database open, it will remain constant before any new process can connect to it successfully. Would it be feasible to lookup the ID of an important file in the DataDir? As far as the POSIX version goes, here is an updated patch. I changed it to use the inode/device ID instead of the filename to avoid the renaming cases. It also no longer needs the more clunky than necessary hash stuff. (The NetBSD/OpenBSD patches are to force those to build with the SysV shmem calls, since they are notably without POSIX shmem support.) Thanks, Chris Marcellino On Feb 27, 2007, at 1:40 AM, Magnus Hagander wrote: > On Tue, Feb 27, 2007 at 10:30:15AM +0100, Magnus Hagander wrote: >>> Does Windows have a method to get a unique ID number for a given >>> data >>> directory, or a token file in that directory? It would need to be >>> constant while the database is open. Perhaps >>> GetFileInformationByHandle? It returns a struct with a nFileIndex >>> value that seems to be that, although I'm not certain. >>> This might make it easier to avoid the complexity of fitting the >>> filename in the segment name, and avoid the rename problem, >> >> Yes, you could use the fileindex value. You need that one and the >> volume >> serial number, total of 64+32 bits of data. >> >> So yeah, we cuold use that instead of the full path name if we >> want to. >> The advantage of this one is that it's shorter, the advantage of the >> full path name is that you can see where the backend is from. >> >> However, in most cases you will be able to see where the backend >> is from >> anyway, because it is likely to have some other file open in the data >> directory, so maybe that isn't such a big point after all? > > Actually, I'm not sure we can. It's only stable as long as someone has > the file open. It will change if it's closed and re-opened later. > > Given that we don't actually open the directory, and only files inside > it, I don't know how that works. > > //Magnus > > ---------------------------(end of > broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
Attachment
Chris Marcellino <cmarcellino@apple.com> writes: > Ignoring the case where backends are still alive in the database, I grow weary of explaining this, but: that is EXACTLY the case you cannot ignore. regards, tom lane
In case you haven't had enough, here is another version of the code to make Postgres use POSIX shared memory. Along with the issues that have already been addressed, this version ensures that orphaned backends are not in the database when restarting Postgres by using a single 1 byte SysV segment to see who is attached to the segment using shmctl/IPC_STAT/nattach. This effectively frees Postgres from the SHMMAX and SHMALL limits. Since this still takes one SysV segment, SHMMNI can still be reached on most platforms if a ton of databases are opened simultaneously (i.e. 32 on Mac OS X, 256 on Linux and Solaris). If you have the need to ship a product with Postgres embedded in it and are unable to change kernel settings (like myself), this might be of use to you. I have tested all of the failure situations I could think of by various combinations of deleting lockfiles while in use, changing the PID inside the lockfile and trying to restart and run more than one postmaster simultaneously. Of course, this since this requires both POSIX and SysV shared memory, this doesn't increase the portability of Postgres which might make it less appropriate for mass distribution; I thought I would put it out there for any feedback either way. Thanks again, Chris Marcellino
Attachment
> If you have the need to ship a product with Postgres embedded in it and > are unable to change kernel settings (like myself), this might be of use > to you. I have tested all of the failure situations I could think of by > various combinations of deleting lockfiles while in use, changing the > PID inside the lockfile and trying to restart and run more than one > postmaster simultaneously. > > Of course, this since this requires both POSIX and SysV shared memory, > this doesn't increase the portability of Postgres which might make it > less appropriate for mass distribution; I thought I would put it out > there for any feedback either way. Well that depends, what systems don't use (or have) POSIX shared memory? This sounds very interesting to me. Oddly enough I went to do some digging on what various differences and I came up with: http://www.nabble.com/POSIX-shared-memory-support-t3298386.html Which happens to be you ;) Sincerely, Joshua D. Drake > > Thanks again, > Chris Marcellino > > > > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
AFAIK, the only systems supported by Postgres that this patch won't work on are NetBSD and OpenBSD. The POSIX calls free the user from the SHMMAX and SHMALL limitations of the SysV shared memory calls on platforms that support it. Since this still takes one SysV segment, SHMMNI can still be reached on some platforms if a ton of databases are opened simultaneously (i.e. 256 on Linux and Solaris, 100 on SCO Unix, 512 on HP-UX, 32 on Mac OS X, unlimited on FreeBSD). This is the case without the patch anyhow. Chris Marcellino On Mar 3, 2007, at 9:09 AM, Joshua D. Drake wrote: > >> If you have the need to ship a product with Postgres embedded in >> it and >> are unable to change kernel settings (like myself), this might be >> of use >> to you. I have tested all of the failure situations I could think >> of by >> various combinations of deleting lockfiles while in use, changing the >> PID inside the lockfile and trying to restart and run more than one >> postmaster simultaneously. >> >> Of course, this since this requires both POSIX and SysV shared >> memory, >> this doesn't increase the portability of Postgres which might make it >> less appropriate for mass distribution; I thought I would put it out >> there for any feedback either way. > > Well that depends, what systems don't use (or have) POSIX shared > memory? > This sounds very interesting to me. Oddly enough I went to do some > digging on what various differences and I came up with: > > http://www.nabble.com/POSIX-shared-memory-support-t3298386.html > > Which happens to be you ;) > > Sincerely, > > Joshua D. Drake > > > >> >> Thanks again, >> Chris Marcellino >> >> >> >> >> --------------------------------------------------------------------- >> --- >> >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 2: Don't 'kill -9' the postmaster > > > -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 > Providing the most comprehensive PostgreSQL solutions since 1997 > http://www.commandprompt.com/ > > Donate to the PostgreSQL Project: http://www.postgresql.org/about/ > donate > PostgreSQL Replication: http://www.commandprompt.com/products/ > > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: explain analyze is your friend
Newest version added: Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Chris Marcellino wrote: > In case you haven't had enough, here is another version of the code > to make Postgres use POSIX shared memory. Along with the issues that > have already been addressed, this version ensures that orphaned > backends are not in the database when restarting Postgres by using a > single 1 byte SysV segment to see who is attached to the segment > using shmctl/IPC_STAT/nattach. > > This effectively frees Postgres from the SHMMAX and SHMALL limits. > Since this still takes one SysV segment, SHMMNI can still be reached > on most platforms if a ton of databases are opened simultaneously > (i.e. 32 on Mac OS X, 256 on Linux and Solaris). > > If you have the need to ship a product with Postgres embedded in it > and are unable to change kernel settings (like myself), this might be > of use to you. I have tested all of the failure situations I could > think of by various combinations of deleting lockfiles while in use, > changing the PID inside the lockfile and trying to restart and run > more than one postmaster simultaneously. > > Of course, this since this requires both POSIX and SysV shared > memory, this doesn't increase the portability of Postgres which might > make it less appropriate for mass distribution; I thought I would put > it out there for any feedback either way. > > Thanks again, > Chris Marcellino > [ Attachment, skipping... ] > > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +