Thread: get_controlfile() can leak fds in the backend
Hi all, (CC-ing Joe as of dc7d70e) I was just looking at the offline checksum patch, and noticed some sloppy coding in controldata_utils.c. The control file gets opened in get_controlfile(), and if it generates an error then the file descriptor remains open. As basic code rule in the backend we should only use BasicOpenFile() when opening files, so I think that the issue should be fixed as attached, even if this requires including fd.h for the backend compilation which is kind of ugly. Another possibility would be to just close the file descriptor before any error, saving appropriately errno before issuing any %m portion, but that still does not respect the backend policy regarding open(). We also do not have a wait event for the read() call, maybe we should have one, but knowing that this gets called only for the SQL-level functions accessing the control file, I don't really think that's worth it. Thoughts? -- Michael
Attachment
Bonjour Michaël, > Thoughts? None. However, while at it, there is also the question of whether the control file should be locked when updated, eg with flock(2) to avoid race conditions between concurrent commands. ISTM that there is currently not such thing in the code, but that it would be desirable. -- Fabie.
Hi, On 2019-02-27 10:23:45 +0100, Fabien COELHO wrote: > However, while at it, there is also the question of whether the control file > should be locked when updated, eg with flock(2) to avoid race conditions > between concurrent commands. ISTM that there is currently not such thing in > the code, but that it would be desirable. Shouldn't be necessary - the control file fits into a single page, and writes of that size ought to always be atomic. And I also think introducing flock usage for this would be quite disproportional. Greetings, Andres Freund
>> However, while at it, there is also the question of whether the control file >> should be locked when updated, eg with flock(2) to avoid race conditions >> between concurrent commands. ISTM that there is currently not such thing in >> the code, but that it would be desirable. > > Shouldn't be necessary - the control file fits into a single page, and > writes of that size ought to always be atomic. And I also think > introducing flock usage for this would be quite disproportional. Ok, fine. Note that my concern is not about the page size, but rather that as more commands may change the cluster status by editing the control file, it would be better that a postmaster does not start while a pg_rewind or enable checksum or whatever is in progress, and currently there is a possible race condition between the read and write that can induce an issue, at least theoretically. -- Fabien.
On 2/27/19 2:47 AM, Michael Paquier wrote: > Hi all, > (CC-ing Joe as of dc7d70e) > > I was just looking at the offline checksum patch, and noticed some > sloppy coding in controldata_utils.c. The control file gets opened in > get_controlfile(), and if it generates an error then the file > descriptor remains open. As basic code rule in the backend we should > only use BasicOpenFile() when opening files, so I think that the issue > should be fixed as attached, even if this requires including fd.h for > the backend compilation which is kind of ugly. > > Another possibility would be to just close the file descriptor before > any error, saving appropriately errno before issuing any %m portion, > but that still does not respect the backend policy regarding open(). In fd.c I see: 8<-------------------- * AllocateFile, AllocateDir, OpenPipeStream and OpenTransientFile are * wrappers around fopen(3), opendir(3), popen(3) and open(2), * respectively. They behave like the corresponding native functions, * except that the handle is registered with the current subtransaction, * and will be automatically closed at abort. These are intended mainly * for short operations like reading a configuration file; there is a * limit on the number of files that can be opened using these functions * at any one time. * * Finally, BasicOpenFile is just a thin wrapper around open() that can * release file descriptors in use by the virtual file descriptors if * necessary. There is no automatic cleanup of file descriptors returned * by BasicOpenFile, it is solely the caller's responsibility to close * the file descriptor by calling close(2). 8<-------------------- According to that comment BasicOpenFile does not seem to solve the issue you are pointing out (leaking of file descriptor on ERROR). Perhaps OpenTransientFile() is more appropriate? I am on the road at the moment so have not looked very deeply at this yet though. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On 2/27/19 10:26 AM, Joe Conway wrote: > On 2/27/19 2:47 AM, Michael Paquier wrote: >> Hi all, >> (CC-ing Joe as of dc7d70e) > According to that comment BasicOpenFile does not seem to solve the issue > you are pointing out (leaking of file descriptor on ERROR). Perhaps > OpenTransientFile() is more appropriate? I am on the road at the moment > so have not looked very deeply at this yet though. It seems to me that OpenTransientFile() is more appropriate. Patch done that way attached. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On Wed, Feb 27, 2019 at 11:50:17AM +0100, Fabien COELHO wrote: >> Shouldn't be necessary - the control file fits into a single page, and >> writes of that size ought to always be atomic. And I also think >> introducing flock usage for this would be quite disproportional. There are static assertions to make sure that the side of control file data never gets higher than 512 bytes for this purpose. > Note that my concern is not about the page size, but rather that as more > commands may change the cluster status by editing the control file, it would > be better that a postmaster does not start while a pg_rewind or enable > checksum or whatever is in progress, and currently there is a possible race > condition between the read and write that can induce an issue, at least > theoretically. Something that I think we could live instead is a special flag in the control file to mark the postmaster as in maintenance mode. This would be useful to prevent the postmaster to start if seeing this flag in the control file, as well to find out that a host has crashed in the middle of a maintenance operation. We don't give this insurance now when running pg_rewind, which is bad. That's also separate from the checksum-related patches and pg_rewind. flock() can be something hard to live with for cross-platform compatibility like Windows (LockFileEx) or fancy platforms. And note that we don't use it yet in the tree. And flock() would help in the first case I am mentioning, but not in the second. -- Michael
Attachment
Hi, On 2019-02-27 11:50:17 +0100, Fabien COELHO wrote: > Note that my concern is not about the page size, but rather that as more > commands may change the cluster status by editing the control file, it would > be better that a postmaster does not start while a pg_rewind or enable > checksum or whatever is in progress, and currently there is a possible race > condition between the read and write that can induce an issue, at least > theoretically. Seems odd to bring this up in this thread, it really has nothing to do with the topic. If we were to want to do more here, ISTM the right approach would use the postmaster pid file, not the control file. Greetings, Andres Freund
On Wed, Feb 27, 2019 at 07:45:11PM -0500, Joe Conway wrote: > It seems to me that OpenTransientFile() is more appropriate. Patch done > that way attached. Works for me, thanks for sending a patch! While on it, could you clean up the comment on top of get_controlfile()? "char" is mentioned instead of "const char" for DataDir which is incorrect. I would remove the complete set of arguments from the description and just keep the routine name. -- Michael
Attachment
Hello Andres, >> Note that my concern is not about the page size, but rather that as more >> commands may change the cluster status by editing the control file, it would >> be better that a postmaster does not start while a pg_rewind or enable >> checksum or whatever is in progress, and currently there is a possible race >> condition between the read and write that can induce an issue, at least >> theoretically. > > Seems odd to bring this up in this thread, it really has nothing to do > with the topic. Indeed. I raised it here because it is in the same area of code and Michaël was looking at it. > If we were to want to do more here, ISTM the right approach would use > the postmaster pid file, not the control file. ISTM that this just means re-inventing a manual poor-featured race-condition-prone lock API around another file, which seems to be created more or less only by "pg_ctl", while some other commands use the control file (eg pg_rewind, AFAICS). -- Fabien.
On 2/27/19 7:54 PM, Michael Paquier wrote: > On Wed, Feb 27, 2019 at 07:45:11PM -0500, Joe Conway wrote: >> It seems to me that OpenTransientFile() is more appropriate. Patch done >> that way attached. > > Works for me, thanks for sending a patch! While on it, could you > clean up the comment on top of get_controlfile()? "char" is mentioned > instead of "const char" for DataDir which is incorrect. I would > remove the complete set of arguments from the description and just > keep the routine name. Sure, will do. What are your thoughts on backpatching? This seems unlikely to be a practical concern in the field, so my inclination is a master only fix. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On Thu, Feb 28, 2019 at 07:11:04AM -0500, Joe Conway wrote: > Sure, will do. What are your thoughts on backpatching? This seems > unlikely to be a practical concern in the field, so my inclination is a > master only fix. I agree that this would unlikely become an issue as an error on the control file would most likely be a PANIC when updating it, so fixing only HEAD sounds fine to me. Thanks for asking! -- Michael
Attachment
Hi, On 2019-02-28 09:54:48 +0100, Fabien COELHO wrote: > > If we were to want to do more here, ISTM the right approach would use > > the postmaster pid file, not the control file. > > ISTM that this just means re-inventing a manual poor-featured > race-condition-prone lock API around another file, which seems to be created > more or less only by "pg_ctl", while some other commands use the control > file (eg pg_rewind, AFAICS). Huh? Postmaster.pid is written by the backend, pg_ctl just checks it to see if the backend has finished starting up etc. It's precisely what the backend uses to prevent two postmasters to start etc. It's also what say pg_resetwal checks to protect against a concurrently running lcuster (albeit in a racy way). If we want to make things more bulletproof, that's the place. The control file is constantly written to, sometimes by different processes, it'd just not be a good file for such lockout mechanisms. Greetings, Andres Freund
On 2/28/19 7:20 AM, Michael Paquier wrote: > On Thu, Feb 28, 2019 at 07:11:04AM -0500, Joe Conway wrote: >> Sure, will do. What are your thoughts on backpatching? This seems >> unlikely to be a practical concern in the field, so my inclination is a >> master only fix. > > I agree that this would unlikely become an issue as an error on the > control file would most likely be a PANIC when updating it, so fixing > only HEAD sounds fine to me. Thanks for asking! Committed and push that way. By the way, while looking at this, I noted at least a couple of places where OpenTransientFile() is being passed O_RDWR when the usage is pretty clearly intended to be read-only. For example at least two instances in slru.c -- SlruPhysicalReadPage() and SimpleLruDoesPhysicalPageExist(). Is it worth while searching for and fixing those instances? Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On Thu, Feb 28, 2019 at 04:09:32PM -0500, Joe Conway wrote: > Committed and push that way. Thanks for committing a fix. > By the way, while looking at this, I noted at least a couple of places > where OpenTransientFile() is being passed O_RDWR when the usage is > pretty clearly intended to be read-only. For example at least two > instances in slru.c -- SlruPhysicalReadPage() and > SimpleLruDoesPhysicalPageExist(). Is it worth while searching for and > fixing those instances? There are roughly 40~42 callers of OpenTransientFile(). Looking at them I can see that RestoreSlotFromDisk() could also switch to RDONLY instead of RDWR. I am also a bit tired of the lack error handling around CloseTransientFile(). While in some code paths the file descriptors are closed for an error, in some others we should report something. I am going to send a patch after a lookup. Let's see all that on a separate thread. -- Michael
Attachment
On Thu, Feb 28, 2019 at 01:07:23PM -0800, Andres Freund wrote: > Huh? Postmaster.pid is written by the backend, pg_ctl just checks it to > see if the backend has finished starting up etc. It's precisely what the > backend uses to prevent two postmasters to start etc. It's also what say > pg_resetwal checks to protect against a concurrently running lcuster > (albeit in a racy way). If we want to make things more bulletproof, > that's the place. The control file is constantly written to, sometimes > by different processes, it'd just not be a good file for such lockout > mechanisms. Hijacking more my own thread... I can do that, right? Or not. One thing is that we don't protect a data folder to be started when it is in the process of being treated by an external tool, like pg_rewind, or pg_checksums. So having an extra flag in the control file, which can be used by external tools to tell that the data folder is being treated for something does not sound that crazy to me. Having a tool write a fake postmaster.pid for this kind of task does not sound right from a correctness point of view, because the instance is not started. And a lock API won't protect much if the host is unplugged as well if its state is made durable... Let's keep the discussions where they are by the way. Joe has just closed the report of this thread with 4598a99, so I am moving on to the correct places. My apologies for the digressions. -- Michael
Attachment
Hi, On 2019-03-01 10:11:53 +0900, Michael Paquier wrote: > One thing is that we don't protect a data folder to be started when it > is in the process of being treated by an external tool, like > pg_rewind, or pg_checksums. So having an extra flag in the control > file, which can be used by external tools to tell that the data folder > is being treated for something does not sound that crazy to me. > Having a tool write a fake postmaster.pid for this kind of task does > not sound right from a correctness point of view, because the instance > is not started. I think putting this into the control file is a seriously bad idea. Postmaster interlocks against other postmasters running via postmaster.pid. Having a second interlock mechanism, in a different file, doesn't make any sort of sense. Nor does it seem sane to have external tool write over data as INTENSELY critical as the control file, when they then have to understand CRCs etc. > Let's keep the discussions where they are by the way. Joe has just > closed the report of this thread with 4598a99, so I am moving on to > the correct places. I don't know what that means, given you replied to the above in this thread? Greetings, Andres Freund
Hello Andres, > I think putting this into the control file is a seriously bad > idea. Postmaster interlocks against other postmasters running via > postmaster.pid. > Having a second interlock mechanism, in a different file, doesn't make > any sort of sense. Nor does it seem sane to have external tool write > over data as INTENSELY critical as the control file, when they then have > to understand CRCs etc. On this point, there are functions for that, get/update_controlfile, so this should be factored out. The initial insentive for raising the issue, probably in the wrong thread and without a clear understanding of the matter, is that I've been reviewing a patch to enable/disable checksums on a stopped cluster. The patch updates all the checksums in all the files, and changes the control file to tell that there are checksums. Currently it checks and creates a fake "posmaster.pid" file to attempt to prevent another tool to run concurrently to this operation, with ISTM a procedure prone to race conditions thus does not warrant that it would be the only tool running on the cluster. This looked to me as a bad hack. Given that other command that take on a cluster seemed to use the controlfile to signal that they are doing something, I'd thought that it would be the way to go, but then I noticed that the control file read/write procedure looks as bad as the postmaster.pid hack to ensure that only one command is active. Nevertheless, I'm ranting in the wrong thread and it seems that these is no real problem to solve, so I'll say fine to the to-me-unconvincing "postmaster.pid" hack proposed in the patch. -- Fabien.