Thread: fork/exec patch
This patch is the next step towards (re)allowing fork/exec. Bruce, I've cleaned up the parts we discussed, and, pending objections from anyone else, it is ready for application to HEAD. Cheers, Claudio --- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
Attachment
Let me provide a summary of this patch because I reviewed the first version. The patch basically is a slight rearrangement of the code to allow fork/exec on Unix, with the ultimate goal of doing CreateProcess on Win32. The changes are: o Write out postmaster global variables and per-backend variables to be read by the exec'ed backend o Mark some static variables as global when exec is used so then can be dumped from postmaster.c, marked NON_EXEC_STATIC o Remove value passing with -p now that we have per-backend file o Move some pointer storage out of shared memory for easier dumping. o Modified pgsql_temp directory cleanup to handle per-database directories and the backend exec directory under datadir. Let me add that Claudio is doing a fantastic job on this. The changes are minimal and clean. I think the writing of a per-backend temp file has allowed this patch to be smaller than it might have been. --------------------------------------------------------------------------- Claudio Natoli wrote: > > This patch is the next step towards (re)allowing fork/exec. > > Bruce, I've cleaned up the parts we discussed, and, pending objections from > anyone else, it is ready for application to HEAD. > > Cheers, > Claudio > > --- > Certain disclaimers and policies apply to all email sent from Memetrics. > For the full text of these disclaimers and policies see > <a > href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em > ailpolicy.html</a> > > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Let me add that Claudio is doing a fantastic job on this. The > changes are minimal and clean. I think the writing of a per-backend > temp file has allowed this patch to be smaller than it might have > been. Did we REALLY conclude that the best way to work around the lack of fork() on Win32 is by writing variables out to disk and reading them back in? Frankly, that seems like an enormous kludge. For example, couldn't we write this data into a particular location in shared memory, and then pass that location to the child? That is still ugly, slow, and prone to failure (shmem being statically sized), but ISTM that the proposed implementation already possesses those attributes :-) (/me goes off to re-read the archives on this issue...) -Neil
Neil Conway wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Let me add that Claudio is doing a fantastic job on this. The > > changes are minimal and clean. I think the writing of a per-backend > > temp file has allowed this patch to be smaller than it might have > > been. > > Did we REALLY conclude that the best way to work around the lack of > fork() on Win32 is by writing variables out to disk and reading them > back in? Frankly, that seems like an enormous kludge. > > For example, couldn't we write this data into a particular location in > shared memory, and then pass that location to the child? That is still > ugly, slow, and prone to failure (shmem being statically sized), but > ISTM that the proposed implementation already possesses those > attributes :-) I don't think we ever discussed it, but it seemed logical and a minimal change to the code. We already have a GUC write of non-default values for exec and no one had issues with that. Of course, this one is per-backend. Yea, we could use shared memory for this too, but I don't see a problem with using the file system. Older releases of PostgreSQL read from postgresql.conf or pg_hba.conf or other files for every connection so I don't see that using the file system is going to be that slow. Of course, we removed those file reads because they were slow, but they were also much larger and more complex in requiring parsing and stuff, while this is just a list of binary values. We also have a GUC dump file that is read by exec too. The downside of shared memory is that you would need the postmaster to write into shared memory and you have to allocate enough shared memory for all backends, but clearly it could be done. You could just pass the backend slot number to the child and the child could read from the offset. Not sure how to cleanly do the GUC dump file because it is of variable length depending on the number of GUC variables changed. I guess the big question is whether it is worth doing in shared memory. We also would have to pass the shared memory address to the child, and make sure the child knew the proper offset in shared memory to find its values. Of course, stuff that is variable length would be a problem in shared memory, and we have some of those for the GUC defaults. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I don't think we ever discussed it, but it seemed logical and a minimal > change to the code. We already have a GUC write of non-default values > for exec and no one had issues with that. For the record, I think that is ugly as well :-) Anyway, I'm not necessarily arguing that using shmem is the right way to go here -- that was merely an off-the-cuff suggestion. I'm just saying that whatever solution we end up with, ISTM we can do better than writing out + reading in a file for /every/ new connection. -Neil
On Sun, 14 Dec 2003, Bruce Momjian wrote: > change to the code. We already have a GUC write of non-default values > for exec and no one had issues with that. Of course, this one is > per-backend. > > Yea, we could use shared memory for this too, but I don't see a problem > with using the file system. Why not use an anonymous pipe to send data from the parent to the child process? That is a common way to handle this problem in win32 (and in unix by the way). The parent sets up the pipe and the child process inherits the handle, and after that the child and parent can excange information in private. -- /Dennis
Dennis Bjorklund wrote: > On Sun, 14 Dec 2003, Bruce Momjian wrote: > > > change to the code. We already have a GUC write of non-default values > > for exec and no one had issues with that. Of course, this one is > > per-backend. > > > > Yea, we could use shared memory for this too, but I don't see a problem > > with using the file system. > > Why not use an anonymous pipe to send data from the parent to the child > process? That is a common way to handle this problem in win32 (and in unix > by the way). The parent sets up the pipe and the child process inherits > the handle, and after that the child and parent can excange information in > private. Doesn't that require the postmaster to stay around to feed that information into the pipe or can the postmaster just shove the data and continue on, and how do the old pipes get cleaned up? Seems messy. Also has to work on Unix too for testing. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Sun, 14 Dec 2003, Bruce Momjian wrote: > > Why not use an anonymous pipe to send data from the parent to the child > > process? > > Doesn't that require the postmaster to stay around to feed that > information into the pipe or can the postmaster just shove the data and > continue on, and how do the old pipes get cleaned up? I think that one can just output the data and close that end of the pipe. But i've not looked at win32 the last 5 years or so, I could be wrong. > Seems messy. Maybe, but to me the solution where you write to files are much more ugly. If one does not like pipes, there are other ipc mechanisms that does not involve creating, reading and deleting a file on each connect. Does windows have a temp filesystem where the temp files are not actually written out on disk? It's still ugly but better then hitting a disk all the time. > Also has to work on Unix too for testing. Everything can not work in unix, CreateProcess() and fork() are different. However, the pipe solution can be mimiced in unix, but it will not be the same code since the api's are different. So that does not give much. -- /Dennis
Hi all, Dennis Bjorklund wrote: > > Also has to work on Unix too for testing. > > Everything can not work in unix, CreateProcess() and fork() > are different. True (but CreateProcess and "fork followed by exec" are pretty close). I think what Bruce is implying is that, ideally, we'd like to keep things as close as possible between Unix fork/exec and Windows code bases, so that: * it remains possible to advance the Windows port under a *nix dev environment and * should (when!) issues arise in the Windows implementation, it will be easier to identify and debug Neil Conway wrote: > For example, couldn't we write this data into a particular location in > shared memory, and then pass that location to the child? That is still > ugly, slow, and prone to failure (shmem being statically sized), but > ISTM that the proposed implementation already possesses those > attributes :-) I agree that this is a better implementation. Bruce, do we implement this now, or just hold it as something to revisit down the track? I'm all for leaving it as is. Moreover, in general, how do we handle things like this? IMHO, I'd rather live with a few kludges (that don't impact the *nix code) until the Windows port is actually a reality, and then reiterate (having the discussions as we go along, however, is necessary). Perhaps adding a TO_REVISIT section to your Win32 Status Report page? Or do people have strong leanings towards "fix as you go along"? Just feels like that way could see us getting bogged down making things "perfect" instead of advancing the port... Comments? Cheers, Claudio --- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
On Mon, 15 Dec 2003, Claudio Natoli wrote: > Moreover, in general, how do we handle things like this? IMHO, I'd rather > live with a few kludges (that don't impact the *nix code) until the Windows > port is actually a reality As long as it does not hurt the unix code it's not a big problem as I see it. The usual open source solution is that since no one else writes the code, you can do it the way you think works the best. To change this in the future does not mean that everything else has to be rewritten which is good. It does also not mean that one can not discuss the implementation. A fair amount of discussion is always good. -- /Dennis
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I don't think we ever discussed it, but it seemed logical and a minimal > change to the code. We already have a GUC write of non-default values > for exec and no one had issues with that. You can hardly claim that "no one had issues with that". I complained about it and I think other people did too. It's a messy, ugly approach; moreover we have no field experience that says it's reliable. It may be the least messy, ugly approach available, but I concur with Neil's wish to at least look for other answers. regards, tom lane
On Sun, Dec 14, 2003 at 06:53:22PM -0500, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I don't think we ever discussed it, but it seemed logical and a minimal > > change to the code. We already have a GUC write of non-default values > > for exec and no one had issues with that. > > You can hardly claim that "no one had issues with that". I complained > about it and I think other people did too. It's a messy, ugly approach; > moreover we have no field experience that says it's reliable. Don't the FSM and the system catalog cache use a similar mechanism? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Limítate a mirar... y algun día veras"
Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > On Sun, Dec 14, 2003 at 06:53:22PM -0500, Tom Lane wrote: >> You can hardly claim that "no one had issues with that". > Don't the FSM and the system catalog cache use a similar mechanism? FSM uses a backing file to hold information over a database shutdown (write once during shutdown, read once during startup). That's a little different from once per backend fork. Also, there are no race conditions to worry about, and finally the system does not *require* the backing file to be either present or correct. The catalog cache uses a file that typically gets updated once per VACUUM. Again, the file does not have to be present, nor correct. There are mechanisms in place to deal with the cases (including race conditions) where it's broken or obsolete. I have not looked at the proposed fork/exec code in any detail, but IIUC it will be *necessary* that the intermediate file be present, and correct. I think a minimum requirement for accepting this solution is a sketch of how race conditions will be dealt with (ie, postmaster changes its own value of some variable immediately after making the temp file). I don't necessarily say that the first-cut patch has to get it right, but we'd better understand how we will get to where it is right. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > On Sun, Dec 14, 2003 at 06:53:22PM -0500, Tom Lane wrote: > >> You can hardly claim that "no one had issues with that". > > > Don't the FSM and the system catalog cache use a similar mechanism? > > FSM uses a backing file to hold information over a database shutdown > (write once during shutdown, read once during startup). That's a little > different from once per backend fork. Also, there are no race > conditions to worry about, and finally the system does not *require* the > backing file to be either present or correct. > > The catalog cache uses a file that typically gets updated once per > VACUUM. Again, the file does not have to be present, nor correct. > There are mechanisms in place to deal with the cases (including race > conditions) where it's broken or obsolete. > > I have not looked at the proposed fork/exec code in any detail, but > IIUC it will be *necessary* that the intermediate file be present, and > correct. I think a minimum requirement for accepting this solution is a > sketch of how race conditions will be dealt with (ie, postmaster changes > its own value of some variable immediately after making the temp file). > I don't necessarily say that the first-cut patch has to get it right, > but we'd better understand how we will get to where it is right. Agreed, added to the Win32 status page: * remove per-backend parameter file and move into shared memory -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Claudio Natoli wrote: > > For example, couldn't we write this data into a particular location in > > shared memory, and then pass that location to the child? That is still > > ugly, slow, and prone to failure (shmem being statically sized), but > > ISTM that the proposed implementation already possesses those > > attributes :-) > > I agree that this is a better implementation. > > Bruce, do we implement this now, or just hold it as something to revisit > down the track? I'm all for leaving it as is. > > Moreover, in general, how do we handle things like this? IMHO, I'd rather > live with a few kludges (that don't impact the *nix code) until the Windows > port is actually a reality, and then reiterate (having the discussions as we > go along, however, is necessary). Perhaps adding a TO_REVISIT section to > your Win32 Status Report page? > > Or do people have strong leanings towards "fix as you go along"? Just feels > like that way could see us getting bogged down making things "perfect" > instead of advancing the port... Let's get it working first. I have added an item to the Win32 status page so we will not forget it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > I don't think we ever discussed it, but it seemed logical and a minimal > > change to the code. We already have a GUC write of non-default values > > for exec and no one had issues with that. > > You can hardly claim that "no one had issues with that". I complained > about it and I think other people did too. It's a messy, ugly approach; > moreover we have no field experience that says it's reliable. > > It may be the least messy, ugly approach available, but I concur with > Neil's wish to at least look for other answers. Absolutely. I am not happy with the GUC file either, but can't see a better way right now. I have already documented your concern about the GUC race condition issue on the Win32 status page so we will not forget about it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Agreed, added to the Win32 status page: > * remove per-backend parameter file and move into shared memory [itch] I'm not sure that's an answer either; see my comments about how the postmaster shouldn't depend on the contents of shared memory being valid. We could get away with the postmaster having a write-only relationship to shared memory (put value of variable X into predetermined location Y), but I don't think that helps. It doesn't work for variable-size values --- we certainly don't want the postmaster dependent on memory allocation structures being valid within shared memory --- and what about locks? Do you want the postmaster writing shared values without taking a lock, or relying on shared-memory lock structures to be valid enough to not lock it up or crash it? My answer to either of those is "no way, Jose" ... Writing temp files may actually be a cleaner solution than writing shared memory, once we take these considerations into account. My gripe about race conditions was "I want to see how you solve this", and wasn't intended to mean "I don't think that is soluble". regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Agreed, added to the Win32 status page: > > * remove per-backend parameter file and move into shared memory > > [itch] I'm not sure that's an answer either; see my comments about how > the postmaster shouldn't depend on the contents of shared memory being > valid. > > We could get away with the postmaster having a write-only relationship > to shared memory (put value of variable X into predetermined location > Y), but I don't think that helps. It doesn't work for variable-size > values --- we certainly don't want the postmaster dependent on memory > allocation structures being valid within shared memory --- and what > about locks? Do you want the postmaster writing shared values without > taking a lock, or relying on shared-memory lock structures to be valid > enough to not lock it up or crash it? My answer to either of those is > "no way, Jose" ... > > Writing temp files may actually be a cleaner solution than writing > shared memory, once we take these considerations into account. My gripe > about race conditions was "I want to see how you solve this", and wasn't > intended to mean "I don't think that is soluble". Read my idea that shared memory for signals might be required, and a separate shared memory segment might be used for parameter passing too. I added a question mark to the win32 TODO item, so we can keep as an open item. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> > > Why not use an anonymous pipe to send data from the parent to the > > > child process? > > > > Doesn't that require the postmaster to stay around to feed that > > information into the pipe or can the postmaster just shove the data > > and continue on, and how do the old pipes get cleaned up? > > I think that one can just output the data and close that end > of the pipe. > But i've not looked at win32 the last 5 years or so, I could be wrong. For anonymous pipes, large writes will block (pipes created with CreatePipe()). For named pipes (which can be given unique names - see CreateNamedPipe()), you can use Overlapped IO (MS speak for async IO), and then forget about it. Or rather, register a callback that will do a CloseHandle() on the pipe, so you don't leak handles. (Of course, the child process has to do CloseHandle() on the other end of the pipe). A way to do this using named pipes would be: a) Have the postmaster listen on a named pipe always (along with general connections). b) Have the clients use CallNamedPipe() to hit the postmasters known pipe name (the actual pipe name can be passed on the commandline). The postmaster then sends everything down the pipe. c) The postmaster only closes the pipe when it shuts down. The same pipe endpoint is reused all the time. > Does windows have a temp filesystem where the temp files are > not actually > written out on disk? It's still ugly but better then hitting > a disk all > the time. You can specify the FILE_ATTRIBUTE_TEMPORARY parameter to CreateFile(). This does not *guarantee* it will not go to disk, but it allows the system to store it in RAM. Small files will never hit disk. Large ones will when the memory manager figures it needs the space for something else. > > Also has to work on Unix too for testing. > > Everything can not work in unix, CreateProcess() and fork() > are different. > However, the pipe solution can be mimiced in unix, but it > will not be the > same code since the api's are different. So that does not give much. If you want to use any of the ways that "windows were made to use", they probably won't be compatible with Unix. Might be better off starting with something simple and once everything else works move to something more Windows specific (which can then be tested isolated from all the other changes). FWIW, the most common way to do this on windows is the "create anonymous memory mapped region, duplicate with INHERITABLE flag, and use it as shared memory". You then call DuplicateHandle() specifying read-only access to make sure the child process cannot write in the parent process' memory. If you want to use unique memory regions for each process (everytime you fork), do: CreateFileMapping(ALL_ACCESS, ANONYMOUS) MapViewOfFile() --> write all data to mapped region UnmapViewOfFile() DuplicateHandle(DUPLICATE_CLOSE_SOURCE, READ_ACCESS) --> exec and specify handle Child process does MapViewOfFile() to read. When done, UnmapViewOfFile() and CloseHandle(). On that CloseHandle(), the memory will be freed. If you want shared memory, just don't specify DUPLICATE_CLOSE_SOURCE, instead have the postmaster shut it down. Another option is to used named shared memory (specifying a name and security attribute, still mapping the pagefile so you don't need an actual file), in which case the child processes just use OpenFileMapping() (replaces DuplicateHandle()). DuplicateHandle() is the better way unelss you need to access it from a non-child process, though. //Magnus
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches I will try to apply it within the next 48 hours. --------------------------------------------------------------------------- Claudio Natoli wrote: > > This patch is the next step towards (re)allowing fork/exec. > > Bruce, I've cleaned up the parts we discussed, and, pending objections from > anyone else, it is ready for application to HEAD. > > Cheers, > Claudio > > --- > Certain disclaimers and policies apply to all email sent from Memetrics. > For the full text of these disclaimers and policies see > <a > href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em > ailpolicy.html</a> > > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Claudio Natoli <claudio.natoli@memetrics.com> writes: > This patch is the next step towards (re)allowing fork/exec. I've included a few comments on the patch below. Unfortunately I only had time to briefly look over the code... Why did you change ShmemIndexLock from an LWLock to a spinlock? The number of "FIXME" or "This is ugly" comments doesn't ease my mind about this patch :-) How many of these issues do you plan to resolve? Index: src/backend/tcop/postgres.c =================================================================== RCS file: /projects/cvsroot/pgsql-server/src/backend/tcop/postgres.c,v retrieving revision 1.379 diff -c -w -r1.379 postgres.c *** src/backend/tcop/postgres.c 1 Dec 2003 22:15:37 -0000 1.379 --- src/backend/tcop/postgres.c 13 Dec 2003 06:18:44 -0000 *************** *** 2133,2139 **** case 'D': /* PGDATA directory */ if (secure) potential_DataDir = optarg; - break; case 'd': /* debug level */ { Why was this change made? If you actually intend to fall through the case here, please make it clear via a comment. + /* + * The following need to be available to the read/write_backend_variables + * functions + */ + extern XLogRecPtr RedoRecPtr; + extern XLogwrtResult LogwrtResult; + extern slock_t *ShmemLock; + extern slock_t *ShmemIndexLock; + extern void *ShmemIndexAlloc; + typedef struct LWLock LWLock; + extern LWLock *LWLockArray; + extern slock_t *ProcStructLock; + extern int pgStatSock; I wonder whether it is cleaner to make these properly public, rather than using the NON_EXEC_STATIC #ifdef... (I'm not necessarily saying it is, I'm just tossing it out there). + #define get_tmp_backend_var_file_name(buf,id) \ + Assert(DataDir); \ + sprintf((buf), \ + "%s/%s/%s.backend_var.%d", \ + DataDir, \ + PG_TEMP_FILES_DIR, \ + PG_TEMP_FILE_PREFIX, \ + (id)) This macro needs to be enclosed in a do {} while (0) block. Also, what does the "var" in the name of this macro refer to? + #define get_tmp_backend_var_dir(buf) \ + sprintf((buf),"%s/%s",DataDir,PG_TEMP_FILES_DIR) This seems a little pointless, IMHO. + static void + write_backend_variables(pid_t pid, Port *port) + { + char filename[MAXPGPATH]; + FILE *fp; + get_tmp_backend_var_file_name(filename,pid); + + /* Open file */ + fp = AllocateFile(filename, PG_BINARY_W); + if (!fp) + { + /* As per OpenTemporaryFile... */ + char dirname[MAXPGPATH]; + get_tmp_backend_var_dir(dirname); + mkdir(dirname, S_IRWXU); + + fp = AllocateFile(filename, PG_BINARY_W); + if (!fp) + { + ereport(FATAL, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", filename))); + return; + } + } ereport(FATAL) seems too strong, doesn't it? + read_var(LWLockArray,fp); + read_var(ProcStructLock,fp); + read_var(pgStatSock,fp); + + /* Release file */ + FreeFile(fp); + unlink(filename); You should probably check the return value of unlink(). (circa line 335 of ipc/shmem.c:) /* * If the shmem index doesn't exist, we are bootstrapping: we must * be trying to init the shmem index itself. * ! * Notice that the ShmemLock is held until the shmem index has * been completely initialized. */ Doesn't this function still acquire ShmemIndexLock? (i.e. why was this comment changed?) -Neil
Neil Conway wrote: > + /* > + * The following need to be available to the read/write_backend_variables > + * functions > + */ > + extern XLogRecPtr RedoRecPtr; > + extern XLogwrtResult LogwrtResult; > + extern slock_t *ShmemLock; > + extern slock_t *ShmemIndexLock; > + extern void *ShmemIndexAlloc; > + typedef struct LWLock LWLock; > + extern LWLock *LWLockArray; > + extern slock_t *ProcStructLock; > + extern int pgStatSock; > > I wonder whether it is cleaner to make these properly public, rather > than using the NON_EXEC_STATIC #ifdef... (I'm not necessarily saying > it is, I'm just tossing it out there). This was my idea. Rather than making statics as globals, they are global only for exec() builds. This seemed safest. They need to be extern for exec() because we need to reference them in a function that writes all the postmaster globals to a file. We could have had a write function in every C file that needed it and call the write function from postmaster.c, but that seems like too much bloat. If we make them extern in all builds we lose the safety of a static. Your other comments are good and I will wait for Claudio to respond. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Neil Conway <neilc@samurai.com> writes: > Why did you change ShmemIndexLock from an LWLock to a spinlock? That one I can answer --- it's a bootstrapping issue: we can't use LWLocks until we have a PGProc (*MyProc), and we can't set that up without looking in the ShmemIndex for the related data structures. So ShmemIndex needs to use a more primitive lock type. This is actually the way the code used to do it; I changed the lock type when the opportunity presented itself, but if we're going to support fork/exec again then we have to go back to how it was done before. Your other comments seem pretty germane to me, except for > I wonder whether it is cleaner to make these properly public, rather > than using the NON_EXEC_STATIC #ifdef... (I'm not necessarily saying > it is, I'm just tossing it out there). I don't want to make these things public, because we don't really want any other modules accessing them. NON_EXEC_STATIC is pretty ugly, but it does guarantee that we will soon find out about any unintended cross-module references, because they won't compile on Unix systems. If you've got an idea about a cleaner way to do it, I'm all ears ... regards, tom lane
[Thanks to Tom + Bruce] For the remaining comments... > The number of "FIXME" or "This is ugly" comments doesn't ease my mind > about this patch :-) How many of these issues do you plan to resolve? All of them, as we go along. Treat this as a first step. > - break; > > case 'd': /* debug level >*/ > { > >Why was this change made? If you actually intend to fall through the > case here, please make it clear via a comment. I can't believe that got through! It is an editing mistake, pure and simple. Thank you for catching it. [bashes head against wall] > + #define get_tmp_backend_var_file_name(buf,id) \ > + Assert(DataDir); \ > + sprintf((buf), \ > + "%s/%s/%s.backend_var.%d", \ > + DataDir, \ > + PG_TEMP_FILES_DIR, \ > + PG_TEMP_FILE_PREFIX, \ > + (id)) > > This macro needs to be enclosed in a do {} while (0) block. Also, > what does the "var" in the name of this macro refer to? "var" refers to "variable". Will add a do while block in a following patch. > + #define get_tmp_backend_var_dir(buf) \ > + sprintf((buf),"%s/%s",DataDir,PG_TEMP_FILES_DIR) > > This seems a little pointless, IMHO. True. > + static void > [snip] > ereport(FATAL) seems too strong, doesn't it? Possibly. > + read_var(LWLockArray,fp); > + read_var(ProcStructLock,fp); > + read_var(pgStatSock,fp); > + > + /* Release file */ > + FreeFile(fp); > + unlink(filename); > > You should probably check the return value of unlink(). No. This isn't necessary (and what action would it take in any case?). If it doesn't unlink the file, tough. A backend crash could also leave these files on the disk. Like the other pgtmp files they'll get cleaned up on postmaster restart. > (circa line 335 of ipc/shmem.c:) > [snip] > Doesn't this function still acquire ShmemIndexLock? (i.e. why was this comment changed?) AFAICS this is just whitespace differences. With the exception of that missing "break" (Bruce, I guess it goes without saying, but could you please remove that line from the patch before applying... and again "Thank you Neil"), these are stylistic/cosmetic and effect the EXEC_BACKEND code only. Would a follow-up patch to correct these, along with the next step of the fork/exec changes, be acceptable? Cheers, Claudio --- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
Tom Lane <tgl@sss.pgh.pa.us> writes: > That one I can answer --- it's a bootstrapping issue: we can't use > LWLocks until we have a PGProc (*MyProc), and we can't set that up > without looking in the ShmemIndex for the related data structures. > So ShmemIndex needs to use a more primitive lock type. Fair enough. My next question would have been to ask whether switching to a spinlock here will be a performance problem. In looking at the code, it seems we only hold the ShmemIndexLock for a long time (hundreds of instructions & multiple system calls) while bootstrapping the shmem index hash table itself. Otherwise, the lock is acquired and released quickly, and even then it is only done during backend initialization, so there shouldn't be much contention for it. Is this analysis correct? > I don't want to make these things public, because we don't really > want any other modules accessing them. I agree, both ways are non-optimal for different reasons. Can anyone else see a better way to do this? -Neil
Claudio Natoli <claudio.natoli@memetrics.com> writes: >> You should probably check the return value of unlink(). > > No. This isn't necessary (and what action would it take in any > case?). It should write a log message. I'm not sure why this /shouldn't/ be done: if an operation fails, we should log that failure. This is standard practise. >> Doesn't this function still acquire ShmemIndexLock? (i.e. why was >> this comment changed?) > > AFAICS this is just whitespace differences. Read it again. Here's the whole diff hunk: *** 320,340 **** strncpy(item.key, name, SHMEM_INDEX_KEYSIZE); item.location = BAD_LOCATION; ! LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE); if (!ShmemIndex) { /* * If the shmem index doesn't exist, we are bootstrapping: we must * be trying to init the shmem index itself. * ! * Notice that the ShmemIndexLock is held until the shmem index has * been completely initialized. */ Assert(strcmp(name, "ShmemIndex") == 0); Assert(ShmemBootstrap); *foundPtr = FALSE; ! return ShmemAlloc(size); } /* look it up in the shmem index */ --- 335,367 ---- strncpy(item.key, name, SHMEM_INDEX_KEYSIZE); item.location = BAD_LOCATION; ! SpinLockAcquire(ShmemIndexLock); if (!ShmemIndex) { + if (IsUnderPostmaster) + { + /* Must be initializing a (non-standalone) backend */ + Assert(strcmp(name, "ShmemIndex") == 0); + Assert(ShmemBootstrap); + Assert(ShmemIndexAlloc); + *foundPtr = TRUE; + } + else + { /* * If the shmem index doesn't exist, we are bootstrapping: we must * be trying to init the shmem index itself. * ! * Notice that the ShmemLock is held until the shmem index has * been completely initialized. */ Assert(strcmp(name, "ShmemIndex") == 0); Assert(ShmemBootstrap); *foundPtr = FALSE; ! ShmemIndexAlloc = ShmemAlloc(size); ! } ! return ShmemIndexAlloc; } The code acquires ShmemIndexLock, performs some computations, and then notes that "ShmemLock is held" in a comment before returning. ISTM that is plainly wrong. > [ the only other suggested changes are ] stylistic/cosmetic and > effect the EXEC_BACKEND code only. Yeah, my apologies for nitpicking... -Neil
Hi Neil, > > No. This isn't necessary (and what action would it take in any > > case?). > > It should write a log message. I'm not sure why this /shouldn't/ be > done: if an operation fails, we should log that failure. This is > standard practise. Fair enough. Will do (although, I'd point out that there are more than a few places in the existing code where unlink is called without error checking, but that isn't justification for not doing it here). > >> Doesn't this function still acquire ShmemIndexLock? (i.e. why was > >> this comment changed?) > > > > AFAICS this is just whitespace differences. > > Read it again. Here's the whole diff hunk: > [snip] > The code acquires ShmemIndexLock, performs some computations, and then > notes that "ShmemLock is held" in a comment before returning. ISTM > that is plainly wrong. [I did, again. Twice just now. And still didn't see what you were trying to point out. And then...] Ah. Yep. Typo. Will fix (I was experimenting with using the ShmemLock, instead of creating a new ShmemIndexLock, and forgot to change the comment back). > > [ the only other suggested changes are ] stylistic/cosmetic and > > effect the EXEC_BACKEND code only. > > Yeah, my apologies for nitpicking... Not at all. I want this done well. Thank you for any input. Cheers, Claudio --- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
Neil Conway <neilc@samurai.com> writes: > My next question would have been to ask whether switching to a > spinlock here will be a performance problem. In looking at the code, > it seems we only hold the ShmemIndexLock for a long time (hundreds of > instructions & multiple system calls) while bootstrapping the shmem > index hash table itself. Otherwise, the lock is acquired and released > quickly, and even then it is only done during backend initialization, > so there shouldn't be much contention for it. Is this analysis > correct? Yes, at least that was the theory I was working from when I suggested Claudio do it this way ... regards, tom lane
Claudio Natoli wrote: > > (circa line 335 of ipc/shmem.c:) > > [snip] > > Doesn't this function still acquire ShmemIndexLock? (i.e. why was this > comment changed?) > > AFAICS this is just whitespace differences. > > With the exception of that missing "break" (Bruce, I guess it goes without > saying, but could you please remove that line from the patch before > applying... and again "Thank you Neil"), these are stylistic/cosmetic and > effect the EXEC_BACKEND code only. > > Would a follow-up patch to correct these, along with the next step of the > fork/exec changes, be acceptable? Claudio, let's go for a new version of the patch so everyone can see that is being applied. Thanks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Patch withdrawn. Author will resubmit new version. --------------------------------------------------------------------------- Claudio Natoli wrote: > > This patch is the next step towards (re)allowing fork/exec. > > Bruce, I've cleaned up the parts we discussed, and, pending objections from > anyone else, it is ready for application to HEAD. > > Cheers, > Claudio > > --- > Certain disclaimers and policies apply to all email sent from Memetrics. > For the full text of these disclaimers and policies see > <a > href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em > ailpolicy.html</a> > > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Resubmission, incorporating Neil Conway's comments and some minor corrections. > -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > Sent: Wednesday, 17 December 2003 11:12 AM > To: Claudio Natoli > Cc: 'pgsql-patches@postgresql.org' > Subject: Re: [PATCHES] fork/exec patch > > > > Patch withdrawn. Author will resubmit new version. > > -------------------------------------------------------------- > ------------- > > Claudio Natoli wrote: > > > > This patch is the next step towards (re)allowing fork/exec. > > > > Bruce, I've cleaned up the parts we discussed, and, pending > objections from > > anyone else, it is ready for application to HEAD. > > > > Cheers, > > Claudio > > > > --- > > Certain disclaimers and policies apply to all email sent > from Memetrics. > > For the full text of these disclaimers and policies see > > <a > > > href="http://www.memetrics.com/emailpolicy.html">http://www.me > metrics.com/em > > ailpolicy.html</a> > > > > > > [ Attachment, skipping... ] > > > > > ---------------------------(end of > broadcast)--------------------------- > > TIP 4: Don't 'kill -9' the postmaster > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, > Pennsylvania 19073 > --- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
Attachment
Claudio Natoli wrote: > > Resubmission, incorporating Neil Conway's comments and some minor > corrections. I am now thinking we have to remove pgsql/data/pgsql_tmp unconditionally: *************** *** 1217,1224 **** { while ((db_de = readdir(db_dir)) != NULL) { ! if (strcmp(db_de->d_name, ".") == 0 || ! strcmp(db_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), --- 1212,1223 ---- { while ((db_de = readdir(db_dir)) != NULL) { ! if (strcmp(db_de->d_name, ".") == 0 ! #ifndef EXEC_BACKEND ! /* no PG_TEMP_FILES_DIR in DataDir in non EXEC_BACKEND case */ ! || strcmp(db_de->d_name, "..") == 0 ! #endif ! ) continue; The reason is that if they stop a postmaster that is fork/exec, install a non-exec postmaster, and restart, we should still clear out that directory. I guess what i am saying is that I don't want to tie the directory format to the exec() case of the binary. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
[Thought I replied to this already] > I am now thinking we have to remove pgsql/data/pgsql_tmp > unconditionally: > [snip] > The reason is that if they stop a postmaster that is > fork/exec, install > a non-exec postmaster, and restart, we should still clear out that > directory. I guess what i am saying is that I don't want to tie the > directory format to the exec() case of the binary. Could do. On the other hand, it is a directory for a small number (usually zero) of tmp files. More pertitently, is *anyone* even going to use fork/exec? Whilst there is no reason (yet) why someone couldn't, other than for development, why would anyone want to? I've only really been seeing it as a stepping stone to pushing the Win32 port out... Cheers, Claudio --- Certain disclaimers and policies apply to all email sent from Memetrics. For the full text of these disclaimers and policies see <a href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em ailpolicy.html</a>
Claudio Natoli wrote: > > [Thought I replied to this already] > > > I am now thinking we have to remove pgsql/data/pgsql_tmp > > unconditionally: > > [snip] > > The reason is that if they stop a postmaster that is > > fork/exec, install > > a non-exec postmaster, and restart, we should still clear out that > > directory. I guess what i am saying is that I don't want to tie the > > directory format to the exec() case of the binary. > > Could do. On the other hand, it is a directory for a small number (usually > zero) of tmp files. > > More pertitently, is *anyone* even going to use fork/exec? Whilst there is > no reason (yet) why someone couldn't, other than for development, why would > anyone want to? I've only really been seeing it as a stepping stone to > pushing the Win32 port out... Agreed. Forget my idea. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches I will try to apply it within the next 48 hours. --------------------------------------------------------------------------- Claudio Natoli wrote: > > Resubmission, incorporating Neil Conway's comments and some minor > corrections. > > > -----Original Message----- > > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > > Sent: Wednesday, 17 December 2003 11:12 AM > > To: Claudio Natoli > > Cc: 'pgsql-patches@postgresql.org' > > Subject: Re: [PATCHES] fork/exec patch > > > > > > > > Patch withdrawn. Author will resubmit new version. > > > > -------------------------------------------------------------- > > ------------- > > > > Claudio Natoli wrote: > > > > > > This patch is the next step towards (re)allowing fork/exec. > > > > > > Bruce, I've cleaned up the parts we discussed, and, pending > > objections from > > > anyone else, it is ready for application to HEAD. > > > > > > Cheers, > > > Claudio > > > > > > --- > > > Certain disclaimers and policies apply to all email sent > > from Memetrics. > > > For the full text of these disclaimers and policies see > > > <a > > > > > href="http://www.memetrics.com/emailpolicy.html">http://www.me > > metrics.com/em > > > ailpolicy.html</a> > > > > > > > > > > [ Attachment, skipping... ] > > > > > > > > ---------------------------(end of > > broadcast)--------------------------- > > > TIP 4: Don't 'kill -9' the postmaster > > > > -- > > Bruce Momjian | http://candle.pha.pa.us > > pgman@candle.pha.pa.us | (610) 359-1001 > > + If your life is a hard drive, | 13 Roberts Road > > + Christ can be your backup. | Newtown Square, > > Pennsylvania 19073 > > > > --- > Certain disclaimers and policies apply to all email sent from Memetrics. > For the full text of these disclaimers and policies see > <a > href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em > ailpolicy.html</a> > > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Patch applied. Thanks. I had a little trouble patching shmem.c but got it fixed. --------------------------------------------------------------------------- Claudio Natoli wrote: > > Resubmission, incorporating Neil Conway's comments and some minor > corrections. > > > -----Original Message----- > > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > > Sent: Wednesday, 17 December 2003 11:12 AM > > To: Claudio Natoli > > Cc: 'pgsql-patches@postgresql.org' > > Subject: Re: [PATCHES] fork/exec patch > > > > > > > > Patch withdrawn. Author will resubmit new version. > > > > -------------------------------------------------------------- > > ------------- > > > > Claudio Natoli wrote: > > > > > > This patch is the next step towards (re)allowing fork/exec. > > > > > > Bruce, I've cleaned up the parts we discussed, and, pending > > objections from > > > anyone else, it is ready for application to HEAD. > > > > > > Cheers, > > > Claudio > > > > > > --- > > > Certain disclaimers and policies apply to all email sent > > from Memetrics. > > > For the full text of these disclaimers and policies see > > > <a > > > > > href="http://www.memetrics.com/emailpolicy.html">http://www.me > > metrics.com/em > > > ailpolicy.html</a> > > > > > > > > > > [ Attachment, skipping... ] > > > > > > > > ---------------------------(end of > > broadcast)--------------------------- > > > TIP 4: Don't 'kill -9' the postmaster > > > > -- > > Bruce Momjian | http://candle.pha.pa.us > > pgman@candle.pha.pa.us | (610) 359-1001 > > + If your life is a hard drive, | 13 Roberts Road > > + Christ can be your backup. | Newtown Square, > > Pennsylvania 19073 > > > > --- > Certain disclaimers and policies apply to all email sent from Memetrics. > For the full text of these disclaimers and policies see > <a > href="http://www.memetrics.com/emailpolicy.html">http://www.memetrics.com/em > ailpolicy.html</a> > > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073