Thread: Possible race condition in pg_mkdir_p()?
Hi Hackers,
We found a race condition in pg_mkdir_p(), here is a simple reproducer:
#!/bin/sh
basedir=pgdata
datadir=$basedir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z
logdir=$basedir/logs
n=2
rm -rf $basedir
mkdir -p $logdir
# init databases concurrently, they will all try to create the parent dirs
for i in `seq $n`; do
initdb -D $datadir/$i >$logdir/$i.log 2>&1 &
done
wait
# there is a chance one of the initdb commands failed to create the datadir
grep 'could not create directory' $logdir/*
basedir=pgdata
datadir=$basedir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z
logdir=$basedir/logs
n=2
rm -rf $basedir
mkdir -p $logdir
# init databases concurrently, they will all try to create the parent dirs
for i in `seq $n`; do
initdb -D $datadir/$i >$logdir/$i.log 2>&1 &
done
wait
# there is a chance one of the initdb commands failed to create the datadir
grep 'could not create directory' $logdir/*
The logic in pg_mkdir_p() is as below:
/* check for pre-existing directory */
if (stat(path, &sb) == 0)
{
if (!S_ISDIR(sb.st_mode))
{
if (last)
errno = EEXIST;
else
errno = ENOTDIR;
retval = -1;
break;
}
}
else if (mkdir(path, last ? omode : S_IRWXU | S_IRWXG | S_IRWXO) < 0)
{
retval = -1;
break;
}
if (stat(path, &sb) == 0)
{
if (!S_ISDIR(sb.st_mode))
{
if (last)
errno = EEXIST;
else
errno = ENOTDIR;
retval = -1;
break;
}
}
else if (mkdir(path, last ? omode : S_IRWXU | S_IRWXG | S_IRWXO) < 0)
{
retval = -1;
break;
}
This seems buggy as it first checks the existence of the dir and makes the dir if it does not exist yet, however when executing concurrently a possible race condition can be as below:
A: does a/ exists? no
B: does a/ exists? no
A: try to create a/, succeed
B: try to create a/, failed as it already exists
B: does a/ exists? no
A: try to create a/, succeed
B: try to create a/, failed as it already exists
To prevent the race condition we could mkdir() directly, if it returns -1 and errno is EEXIST then check whether it's really a dir with stat(). In fact this is what is done in the `mkdir -p` command: https://github.com/coreutils/gnulib/blob/b5a9fa677847081c9b4f26908272f122b15df8b9/lib/mkdir-p.c#L130-L164
By the way, some callers of pg_mkdir_p() checks for EEXIST explicitly, such as in pg_basebackup.c:
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
{
pg_log_error("could not create directory \"%s\": %m", statusdir);
exit(1);
}
{
pg_log_error("could not create directory \"%s\": %m", statusdir);
exit(1);
}
This is still wrong with current code logic, because when the statusdir is a file the errno is also EEXIST, but it can pass the check here. Even if we fix pg_mkdir_p() by following the `mkdir -p` way the errno check here is still wrong.
Best Regards
Ning
On Thu, Jul 18, 2019 at 04:17:22PM +0800, Ning Yu wrote: > This is still wrong with current code logic, because when the statusdir is > a file the errno is also EEXIST, but it can pass the check here. Even if > we fix pg_mkdir_p() by following the `mkdir -p` way the errno check here is > still wrong. Would you like to send a patch? -- Michael
Attachment
On Thu, Jul 18, 2019 at 4:57 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jul 18, 2019 at 04:17:22PM +0800, Ning Yu wrote:
> This is still wrong with current code logic, because when the statusdir is
> a file the errno is also EEXIST, but it can pass the check here. Even if
> we fix pg_mkdir_p() by following the `mkdir -p` way the errno check here is
> still wrong.
Would you like to send a patch?
Michael, we'll send out the patch later. Checked code, it seems that there is another related mkdir() issue.
MakePGDirectory() is actually a syscall mkdir(), and manpage says the errno meaning of EEXIST,
EEXIST pathname already exists (not necessarily as a directory). This includes the case where pathname is a symbolic link, dangling or not.
However it looks like some callers do not use that correctly, e.g.
{
if (errno == EEXIST)
return;
OR
if (MakePGDirectory(parentdir) < 0 && errno != EEXIST)
i.e. we should better use stat(path) && S_ISDIR(buf) && errno == EEXIST to replace errno == EEXIST.
One possible fix is to add an argument like ignore_created (in case some callers want to fail if the path has been created) in MakePGDirectory() and then add that code logic into it.
Hi Michael,
The patches are attached. To make reviewing easier we spilt them into small
pieces:
- v1-0001-Fix-race-condition-in-pg_mkdir_p.patch: the fix to pg_mkdir_p()
itself, basically we are following the `mkdir -p` logic;
- v1-0002-Test-concurrent-call-to-pg_mkdir_p.patch: the tests for pg_mkdir_p(),
we could see how it fails by reverting the first patch, and a reproducer with
initdb is also provided in the README; as we do not know how to create a unit
test in postgresql we have to employ a test module to do the job, not sure if
this is a proper solution;
- v1-0003-Fix-callers-of-pg_mkdir_p.patch &
v1-0004-Fix-callers-of-MakePGDirectory.patch: fix callers of pg_mkdir_p() and
MakePGDirectory(), tests are not provided for these changes;
MakePGDirectory() is also called in TablespaceCreateDbspace(), EEXIST is
considered as non-error for parent directories, however as it considers EEXIST
as a failure for the last level of the path so the logic is still correct, we
do not add a final stat() check for it.
Best Regards
Ning
The patches are attached. To make reviewing easier we spilt them into small
pieces:
- v1-0001-Fix-race-condition-in-pg_mkdir_p.patch: the fix to pg_mkdir_p()
itself, basically we are following the `mkdir -p` logic;
- v1-0002-Test-concurrent-call-to-pg_mkdir_p.patch: the tests for pg_mkdir_p(),
we could see how it fails by reverting the first patch, and a reproducer with
initdb is also provided in the README; as we do not know how to create a unit
test in postgresql we have to employ a test module to do the job, not sure if
this is a proper solution;
- v1-0003-Fix-callers-of-pg_mkdir_p.patch &
v1-0004-Fix-callers-of-MakePGDirectory.patch: fix callers of pg_mkdir_p() and
MakePGDirectory(), tests are not provided for these changes;
MakePGDirectory() is also called in TablespaceCreateDbspace(), EEXIST is
considered as non-error for parent directories, however as it considers EEXIST
as a failure for the last level of the path so the logic is still correct, we
do not add a final stat() check for it.
Best Regards
Ning
On Thu, Jul 18, 2019 at 4:57 PM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Jul 18, 2019 at 04:17:22PM +0800, Ning Yu wrote:
> This is still wrong with current code logic, because when the statusdir is
> a file the errno is also EEXIST, but it can pass the check here. Even if
> we fix pg_mkdir_p() by following the `mkdir -p` way the errno check here is
> still wrong.
Would you like to send a patch?
--
Michael
Attachment
On Tue, Jul 23, 2019 at 02:54:20PM +0800, Ning Yu wrote: > MakePGDirectory() is also called in TablespaceCreateDbspace(), EEXIST is > considered as non-error for parent directories, however as it considers > EEXIST as a failure for the last level of the path so the logic is > still correct, So the complains here are about two things: - In some code paths calling mkdir, we don't care about the fact that EEXIST can happen for something else than a folder. This could be a problem if we have conflicts in the backend related to the naming of the files/folders created. I find a bit surprising to not perform the sanity checks in MakePGDirectory() in your patch. What of all the existing callers of this routine? - pg_mkdir_p is pretty bad at detecting problems with concurrent creation of parent directories, leading to random failures where these should not happen. I may be missing something, but your patch does not actually fix problem 2, no? Trying to do an initdb with a set of N folders using the same parent folders not created still results in random failures. -- Michael
Attachment
On Tue, Jul 30, 2019 at 4:04 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Jul 23, 2019 at 02:54:20PM +0800, Ning Yu wrote: > > MakePGDirectory() is also called in TablespaceCreateDbspace(), EEXIST is > > considered as non-error for parent directories, however as it considers > > EEXIST as a failure for the last level of the path so the logic is > > still correct, > > So the complains here are about two things: > - In some code paths calling mkdir, we don't care about the fact that > EEXIST can happen for something else than a folder. This could be a > problem if we have conflicts in the backend related to the naming of > the files/folders created. I find a bit surprising to not perform > the sanity checks in MakePGDirectory() in your patch. What of all the > existing callers of this routine? Thanks for the reply. There are several callers of MakePGDirectory(), most of them already treats EEXIST as an error; TablespaceCreateDbspace() already has a proper checking for the target dir, it has the chance to fail on a concurrently created dir, but at least it will not be confused by a file with the same name; PathNameCreateTemporaryDir() is fixed by our patch 0004, we will call stat() after mkdir() to double check the result. In fact personally I'm thinking that whether could we replace all uses of MakePGDirectory() with pg_mkdir_p(), so we could simplify TablespaceCreateDbspace() and PathNameCreateTemporaryDir() and other callers significantly. > - pg_mkdir_p is pretty bad at detecting problems with concurrent > creation of parent directories, leading to random failures where these > should not happen. > > I may be missing something, but your patch does not actually fix > problem 2, no? Trying to do an initdb with a set of N folders using > the same parent folders not created still results in random failures. Well, we should have fixed problem 2, this is our major purpose of the patch 0001, it performs sanity check with stat() after mkdir() at each part of the path. The initdb test was just the one used by us to verify our fix, here is our script: n=4 testdir=testdir datadir=$testdir/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z logdir=$testdir/logs rm -rf $testdir mkdir -p $logdir for i in `seq $n`; do initdb -D $datadir/$i >$logdir/$i.log 2>&1 & done wait # check for failures grep 'could not create directory' $logdir/* We have provided a test module in patch 0002 to perform a similar test, it calls pg_mkdir_p() concurrently to trigger the issue, which has higher fail rate than initdb. With the patch 0001 both the initdb test and the test module will always succeed in our local environment. Thanks Ning > -- > Michael
On Tue, Jul 30, 2019 at 06:22:59PM +0800, Ning Yu wrote: > In fact personally I'm thinking that whether could we replace all uses of > MakePGDirectory() with pg_mkdir_p(), so we could simplify > TablespaceCreateDbspace() and PathNameCreateTemporaryDir() and other callers > significantly. I would still keep the wrapper, but I think that as a final result we should be able to get the code in PathNameCreateTemporaryDir() shaped in such a way that there are no multiple attempts at calling MakePGDirectory() on EEXIST. This has been introduced by dc6c4c9d to allow sharing temporary files between backends, which is rather recent but a fixed set of two retries is not a deterministic method of resolution. > Well, we should have fixed problem 2, this is our major purpose of the patch > 0001, it performs sanity check with stat() after mkdir() at each part of the > path. I just reuse the script presented at the top of the thread with n=2, and I get that: pgdata/logs/1.log:creating directory pgdata/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/1 ... initdb: error: could not create directory "pgdata/a": File exists But the result expected is that all the paths should be created with no complains about the parents existing, no? This reproduces on my Debian box 100% of the time, for different sub-paths. So something looks wrong in your solution. The code comes originally from FreeBSD, how do things happen there. Do we get failures if doing something like that? I would expect this sequence to not fail: for i in `seq 1 100`; do mkdir -p b/c/d/f/g/h/j/$i; done -- Michael
Attachment
Hi, On 2019-07-18 16:17:22 +0800, Ning Yu wrote: > This seems buggy as it first checks the existence of the dir and makes the > dir if it does not exist yet, however when executing concurrently a > possible race condition can be as below: > > A: does a/ exists? no > B: does a/ exists? no > A: try to create a/, succeed > B: try to create a/, failed as it already exists Hm. I'm not really seing much of a point in making mkdir_p safe against all of this. What's the scenario for pg where this matters? I assume you're using it for somewhat different purposes, and that's why it is problematic for you? Greetings, Andres Freund
On Wed, Jul 31, 2019 at 12:04 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Jul 30, 2019 at 06:22:59PM +0800, Ning Yu wrote: > > In fact personally I'm thinking that whether could we replace all uses of > > MakePGDirectory() with pg_mkdir_p(), so we could simplify > > TablespaceCreateDbspace() and PathNameCreateTemporaryDir() and other callers > > significantly. > > I would still keep the wrapper, but I think that as a final result we > should be able to get the code in PathNameCreateTemporaryDir() shaped > in such a way that there are no multiple attempts at calling > MakePGDirectory() on EEXIST. This has been introduced by dc6c4c9d to > allow sharing temporary files between backends, which is rather recent > but a fixed set of two retries is not a deterministic method of > resolution. > > > Well, we should have fixed problem 2, this is our major purpose of the patch > > 0001, it performs sanity check with stat() after mkdir() at each part of the > > path. > > I just reuse the script presented at the top of the thread with n=2, > and I get that: > pgdata/logs/1.log:creating directory > pgdata/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/1 > ... initdb: error: could not create directory "pgdata/a": File exists Could I double confirm with you that you made a clean rebuild after applying the patches? pg_mkdir_p() is compiled as part of libpgport.a, and the postgres makefile will not relink the initdb binary automatically, for myself I must 'make clean' and 'make' to ensure initdb gets relinked. > > But the result expected is that all the paths should be created with > no complains about the parents existing, no? This reproduces on my > Debian box 100% of the time, for different sub-paths. So something > looks wrong in your solution. The code comes originally from FreeBSD, > how do things happen there. Do we get failures if doing something > like that? I would expect this sequence to not fail: > for i in `seq 1 100`; do mkdir -p b/c/d/f/g/h/j/$i; done > -- > Michael
On Wed, Jul 31, 2019 at 12:11 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-07-18 16:17:22 +0800, Ning Yu wrote: > > This seems buggy as it first checks the existence of the dir and makes the > > dir if it does not exist yet, however when executing concurrently a > > possible race condition can be as below: > > > > A: does a/ exists? no > > B: does a/ exists? no > > A: try to create a/, succeed > > B: try to create a/, failed as it already exists > > Hm. I'm not really seing much of a point in making mkdir_p safe against > all of this. What's the scenario for pg where this matters? I assume > you're using it for somewhat different purposes, and that's why it is > problematic for you? Yes, you are right, postgres itself might not run into such kind of race condition issue. The problem we encountered was on a downstream product of postgres, where multiple postgres clusters are deployed on the same machine with common parent dirs. Best Regards Ning > > Greetings, > > Andres Freund
On Wed, Jul 31, 2019 at 12:26:30PM +0800, Ning Yu wrote: > Could I double confirm with you that you made a clean rebuild after > applying the patches? pg_mkdir_p() is compiled as part of libpgport.a, > and the postgres makefile will not relink the initdb binary > automatically, for myself I must 'make clean' and 'make' to ensure > initdb gets relinked. For any patch I test, I just do a "git clean -d -x -f" before building as I switch a lot across stable branches as well. It looks that you are right on this one though, I have just rebuilt from scratch and I don't see the failures anymore. -- Michael
Attachment
On Tue, Jul 30, 2019 at 09:11:44PM -0700, Andres Freund wrote: > Hm. I'm not really seing much of a point in making mkdir_p safe against > all of this. What's the scenario for pg where this matters? I assume > you're using it for somewhat different purposes, and that's why it is > problematic for you? I don't see why it is a problem to make our APIs more stable if we have ways to do so. I actually fixed one recently as of 754b90f for a problem that involved a tool linking to our version of readdir() that we ship. Even with that, the retries for mkdir() on the base directory in PathNameCreateTemporaryDir() are basically caused by that same limitation with the parent paths from this report, no? So we could actually remove the dependency to the base directory in this code path and just rely on pg_mkdir_p() to do the right thing for all the parent paths. That's also a point raised by Ning upthread. -- Michael
Attachment
On Wed, Jul 31, 2019 at 12:41 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Jul 31, 2019 at 12:26:30PM +0800, Ning Yu wrote: > > Could I double confirm with you that you made a clean rebuild after > > applying the patches? pg_mkdir_p() is compiled as part of libpgport.a, > > and the postgres makefile will not relink the initdb binary > > automatically, for myself I must 'make clean' and 'make' to ensure > > initdb gets relinked. > > For any patch I test, I just do a "git clean -d -x -f" before building > as I switch a lot across stable branches as well. It looks that you > are right on this one though, I have just rebuilt from scratch and I > don't see the failures anymore. Cool, glad to know that it works. Best Regards Ning > -- > Michael
Hi, On 2019-07-31 13:48:23 +0900, Michael Paquier wrote: > On Tue, Jul 30, 2019 at 09:11:44PM -0700, Andres Freund wrote: > > Hm. I'm not really seing much of a point in making mkdir_p safe against > > all of this. What's the scenario for pg where this matters? I assume > > you're using it for somewhat different purposes, and that's why it is > > problematic for you? > > I don't see why it is a problem to make our APIs more stable if we > have ways to do so. Well, wanting to support additional use-cases often isn't free. There's additional code for the new usecase, there's review & commit time, there's additional test time, there's bug fixes for the new code etc. We're not developing a general application support library... I don't really have a problem fixing this case if we think it's useful. But I'm a bit bothered by all the "fixes" being submitted that don't matter for PG itself. They do eat up resources. And sorry, adding in-backend threading to test testing mkdir_p doesn't make me inclined to believe that this is all well considered. There's minor issues like us not supporting threads in the backend, pthread not being portable, and also it being entirely out of proportion to the issue. Greetings, Andres Freund
On Tue, Jul 30, 2019 at 10:19:45PM -0700, Andres Freund wrote: > I don't really have a problem fixing this case if we think it's > useful. But I'm a bit bothered by all the "fixes" being submitted that > don't matter for PG itself. They do eat up resources. Sure. In this particular case, we can simplify at least one code path in the backend though for temporary path creation. Such cleanup rings like a sufficient argument to me. > And sorry, adding in-backend threading to test testing mkdir_p doesn't > make me inclined to believe that this is all well considered. There's > minor issues like us not supporting threads in the backend, pthread not > being portable, and also it being entirely out of proportion to the > issue. Agreed on this one. The test case may be useful for the purpose of testing the presented patches, but it does not have enough value to be merged. -- Michael
Attachment
On Wed, Jul 31, 2019 at 1:31 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Jul 30, 2019 at 10:19:45PM -0700, Andres Freund wrote: > > I don't really have a problem fixing this case if we think it's > > useful. But I'm a bit bothered by all the "fixes" being submitted that > > don't matter for PG itself. They do eat up resources. > > Sure. In this particular case, we can simplify at least one code path > in the backend though for temporary path creation. Such cleanup rings > like a sufficient argument to me. Yes, in current postgres source code there are several wrappers of mkdir() that do similar jobs. If we could have a safe mkdir_p() implementation then we could use it directly in all these wrappers, that could save a lot of maintenance effort in the long run. I'm not saying that our patches are enough to make it safe and reliable, and I agree that any patches may introduce new bugs, but I think that having a safe and unified mkdir_p() is a good direction to go. > > > And sorry, adding in-backend threading to test testing mkdir_p doesn't > > make me inclined to believe that this is all well considered. There's > > minor issues like us not supporting threads in the backend, pthread not > > being portable, and also it being entirely out of proportion to the > > issue. > > Agreed on this one. The test case may be useful for the purpose of > testing the presented patches, but it does not have enough value to be > merged. Yes, that's why we put the testing module in a separate patch from the fix, feel free to ignore it. In fact ourselves have concerns about it ;) Best Regards Ning > -- > Michael
On Wed, Jul 31, 2019 at 02:02:03PM +0800, Ning Yu wrote: > Yes, in current postgres source code there are several wrappers of > mkdir() that do similar jobs. If we could have a safe mkdir_p() > implementation then we could use it directly in all these wrappers, that > could save a lot of maintenance effort in the long run. I'm not saying > that our patches are enough to make it safe and reliable, and I agree > that any patches may introduce new bugs, but I think that having a safe > and unified mkdir_p() is a good direction to go. That's my impression as well. Please note that this does not involve an actual bug in Postgres and that this is rather invasive, so this does not really qualify for a back-patch. No objections to simplify the backend on HEAD though. It would be good if you could actually register a patch to the commit fest app [1] and also rework the patch set so as at least PathNameCreateTemporaryDir wins its simplifications for the first problem (double-checking the other code paths would be nice as well). The EEXIST handling, and the confusion about EEXIST showing for both a path and a file need some separate handling (not sure what to do on these parts yet). > Yes, that's why we put the testing module in a separate patch from the > fix, feel free to ignore it. In fact ourselves have concerns about it ;) I think that it is nice that you took the time to do so as you get yourself more familiar with the TAP infrastructure in the tree and prove your point. For this case, I would not have gone to do this much though ;p [1]: https://commitfest.postgresql.org/24/ -- Michael
Attachment
On Wed, Jul 31, 2019 at 2:21 PM Michael Paquier <michael@paquier.xyz> wrote: > That's my impression as well. Please note that this does not involve > an actual bug in Postgres and that this is rather invasive, so this > does not really qualify for a back-patch. No objections to simplify > the backend on HEAD though. It would be good if you could actually > register a patch to the commit fest app [1] and also rework the patch > set so as at least PathNameCreateTemporaryDir wins its simplifications > for the first problem (double-checking the other code paths would be > nice as well). The EEXIST handling, and the confusion about EEXIST > showing for both a path and a file need some separate handling (not > sure what to do on these parts yet). Thanks for the suggestion and information, we will rework the patch and register it. The planned changes are: 1) remove the tests (cool!), 2) simplify PathNameCreateTemporaryDir() and other callers. The EEXIST handling will be put in a separate patch so depends on the reviews we could accept or drop it easily. Best Regards Ning