Thread: pgsql: Improve error handling in RemovePgTempFiles().
Improve error handling in RemovePgTempFiles(). Modify this function and its subsidiaries so that syscall failures are reported via ereport(LOG), rather than silently ignored as before. We don't want to throw a hard ERROR, as that would prevent database startup, and getting rid of leftover temporary files is not important enough for that. On the other hand, not reporting trouble at all seems like an odd choice not in line with current project norms, especially since any failure here is quite unexpected. On the same reasoning, adjust these functions' AllocateDir/ReadDir calls so that failure to scan a directory results in LOG not ERROR. I also removed the previous practice of silently ignoring ENOENT failures during directory opens --- there are some corner cases where that could happen given a previous database crash, but that seems like a bad excuse for ignoring a condition that isn't expected in most cases. A LOG message during postmaster start seems OK in such situations, and better than no output at all. In passing, make RemovePgTempRelationFiles' test for "is the file name all digits" look more like the way it's done elsewhere. Discussion: https://postgr.es/m/19907.1512402254@sss.pgh.pa.us Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/561885db05d3296082ce8750805b8ec322cf9aa1 Modified Files -------------- src/backend/storage/file/fd.c | 70 +++++++++++++++++++++---------------------- 1 file changed, 35 insertions(+), 35 deletions(-)
On Tue, Dec 5, 2017 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Improve error handling in RemovePgTempFiles(). > > [...] I also > removed the previous practice of silently ignoring ENOENT failures during > directory opens --- there are some corner cases where that could happen > given a previous database crash, but that seems like a bad excuse for > ignoring a condition that isn't expected in most cases. A LOG message > during postmaster start seems OK in such situations, and better than > no output at all. Was it intentional that a newly created cluster produces a LOG error on startup until you eventually do something to cause base/pgsql_tmp to exist? Same for each tablespace you create. 2018-01-07 16:34:20.352 NZDT [77400] LOG: could not open directory "base/pgsql_tmp": No such file or directory 2018-01-07 16:34:20.354 NZDT [77400] LOG: could not open directory "pg_tblspc/16384/PG_11_201712251/pgsql_tmp": No such file or directory Perhaps RemovePgTempFiles() could check if each one exists before calling RemovePgTempFilesInDir(), like in the attached? Alternatively we could make RemovePgTempFilesInDir() return early if temp_dir == NULL (going against your commit message above), or I suppose we could arrange for temporary directories always to exist in base and each tablespace. -- Thomas Munro http://www.enterprisedb.com
Attachment
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Tue, Dec 5, 2017 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Improve error handling in RemovePgTempFiles(). > Was it intentional that a newly created cluster produces a LOG error > on startup until you eventually do something to cause base/pgsql_tmp > to exist? Same for each tablespace you create. > 2018-01-07 16:34:20.352 NZDT [77400] LOG: could not open directory > "base/pgsql_tmp": No such file or directory Ooops, nope. > Perhaps RemovePgTempFiles() could check if each one exists before > calling RemovePgTempFilesInDir(), like in the attached? Alternatively > we could make RemovePgTempFilesInDir() return early if temp_dir == > NULL (going against your commit message above), or I suppose we could > arrange for temporary directories always to exist in base and each > tablespace. Seems like the first of those alternatives is the best, but it's pretty late here and my brain is fried. Will think about this mañana. regards, tom lane
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Perhaps RemovePgTempFiles() could check if each one exists before > calling RemovePgTempFilesInDir(), like in the attached? Alternatively > we could make RemovePgTempFilesInDir() return early if temp_dir == > NULL (going against your commit message above), or I suppose we could > arrange for temporary directories always to exist in base and each > tablespace. After further thought I don't especially like your solution as written: it has a race condition if the directory is deleted between the pre-check and the AllocateDir proper. What seems better to me is a tighter version of your second alternative: let's add a "missing_ok" parameter to RemovePgTempFilesInDir, which'd be passed as true only at the top level, and do something like temp_dir = AllocateDir(tmpdirname); + if (temp_dir == NULL && errno == ENOENT && missing_ok) + return; + while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL) This might be problematic if RemovePgTempFilesInDir were a globally exposed API, but it isn't, so adding a parameter seems fine. Hmmm ... actually, in the recursive call case, it wouldn't be that awful to ignore ENOENT either; if a directory goes away between being stat'd and being opened, you'd still get a log message about rmdir failure at the caller level. So maybe we should just do your second alternative as-is (ie, code as above but without missing_ok). Having an explicit control parameter seems marginally clearer but I'm not sure it's buying anything meaningful. Thoughts? regards, tom lane
I wrote: > Hmmm ... actually, in the recursive call case, it wouldn't be that > awful to ignore ENOENT either; if a directory goes away between being > stat'd and being opened, you'd still get a log message about rmdir > failure at the caller level. So maybe we should just do your > second alternative as-is (ie, code as above but without missing_ok). > Having an explicit control parameter seems marginally clearer but > I'm not sure it's buying anything meaningful. Thoughts? Hearing no comments, I did it the first way. regards, tom lane
On Mon, Jan 8, 2018 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Hmmm ... actually, in the recursive call case, it wouldn't be that >> awful to ignore ENOENT either; if a directory goes away between being >> stat'd and being opened, you'd still get a log message about rmdir >> failure at the caller level. So maybe we should just do your >> second alternative as-is (ie, code as above but without missing_ok). >> Having an explicit control parameter seems marginally clearer but >> I'm not sure it's buying anything meaningful. Thoughts? > > Hearing no comments, I did it the first way. It's funny that the two boolean arguments are always opposite. They're essentially both saying "top level?". I was also a bit confused about who else would delete the file in between the check and the attempt to open it with my proposal, considering this is code running in the postmaster at startup, so I figured I must be missing something and hadn't got around to figure out what and replying. Thanks for fixing this. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Mon, Jan 8, 2018 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hearing no comments, I did it the first way. > It's funny that the two boolean arguments are always opposite. > They're essentially both saying "top level?". Yeah. I thought about using a single "bool top_level" argument instead, but concluded that this way is clearer and potentially more flexible. There are other places where we have similar cases of flag arguments that are redundant in current usage, but we leave it like that for clarity as to what the function's API spec is. regards, tom lane