Thread: pgsql: Improve error handling in RemovePgTempFiles().

pgsql: Improve error handling in RemovePgTempFiles().

From
Tom Lane
Date:
Improve error handling in RemovePgTempFiles().

Modify this function and its subsidiaries so that syscall failures are
reported via ereport(LOG), rather than silently ignored as before.
We don't want to throw a hard ERROR, as that would prevent database
startup, and getting rid of leftover temporary files is not important
enough for that.  On the other hand, not reporting trouble at all
seems like an odd choice not in line with current project norms,
especially since any failure here is quite unexpected.

On the same reasoning, adjust these functions' AllocateDir/ReadDir calls
so that failure to scan a directory results in LOG not ERROR.  I also
removed the previous practice of silently ignoring ENOENT failures during
directory opens --- there are some corner cases where that could happen
given a previous database crash, but that seems like a bad excuse for
ignoring a condition that isn't expected in most cases.  A LOG message
during postmaster start seems OK in such situations, and better than
no output at all.

In passing, make RemovePgTempRelationFiles' test for "is the file name
all digits" look more like the way it's done elsewhere.

Discussion: https://postgr.es/m/19907.1512402254@sss.pgh.pa.us

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/561885db05d3296082ce8750805b8ec322cf9aa1

Modified Files
--------------
src/backend/storage/file/fd.c | 70 +++++++++++++++++++++----------------------
1 file changed, 35 insertions(+), 35 deletions(-)


Re: pgsql: Improve error handling in RemovePgTempFiles().

From
Thomas Munro
Date:
On Tue, Dec 5, 2017 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Improve error handling in RemovePgTempFiles().
>
> [...] I also
> removed the previous practice of silently ignoring ENOENT failures during
> directory opens --- there are some corner cases where that could happen
> given a previous database crash, but that seems like a bad excuse for
> ignoring a condition that isn't expected in most cases.  A LOG message
> during postmaster start seems OK in such situations, and better than
> no output at all.

Was it intentional that a newly created cluster produces a LOG error
on startup until you eventually do something to cause base/pgsql_tmp
to exist?  Same for each tablespace you create.

2018-01-07 16:34:20.352 NZDT [77400] LOG:  could not open directory
"base/pgsql_tmp": No such file or directory
2018-01-07 16:34:20.354 NZDT [77400] LOG:  could not open directory
"pg_tblspc/16384/PG_11_201712251/pgsql_tmp": No such file or directory

Perhaps RemovePgTempFiles() could check if each one exists before
calling RemovePgTempFilesInDir(), like in the attached?  Alternatively
we could make RemovePgTempFilesInDir() return early if temp_dir ==
NULL (going against your commit message above), or I suppose we could
arrange for temporary directories always to exist in base and each
tablespace.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: pgsql: Improve error handling in RemovePgTempFiles().

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Tue, Dec 5, 2017 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Improve error handling in RemovePgTempFiles().

> Was it intentional that a newly created cluster produces a LOG error
> on startup until you eventually do something to cause base/pgsql_tmp
> to exist?  Same for each tablespace you create.

> 2018-01-07 16:34:20.352 NZDT [77400] LOG:  could not open directory
> "base/pgsql_tmp": No such file or directory

Ooops, nope.

> Perhaps RemovePgTempFiles() could check if each one exists before
> calling RemovePgTempFilesInDir(), like in the attached?  Alternatively
> we could make RemovePgTempFilesInDir() return early if temp_dir ==
> NULL (going against your commit message above), or I suppose we could
> arrange for temporary directories always to exist in base and each
> tablespace.

Seems like the first of those alternatives is the best, but it's pretty
late here and my brain is fried.  Will think about this mañana.

            regards, tom lane


Re: pgsql: Improve error handling in RemovePgTempFiles().

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Perhaps RemovePgTempFiles() could check if each one exists before
> calling RemovePgTempFilesInDir(), like in the attached?  Alternatively
> we could make RemovePgTempFilesInDir() return early if temp_dir ==
> NULL (going against your commit message above), or I suppose we could
> arrange for temporary directories always to exist in base and each
> tablespace.

After further thought I don't especially like your solution as written:
it has a race condition if the directory is deleted between the pre-check
and the AllocateDir proper.  What seems better to me is a tighter version
of your second alternative: let's add a "missing_ok" parameter to
RemovePgTempFilesInDir, which'd be passed as true only at the top level,
and do something like

    temp_dir = AllocateDir(tmpdirname);

+    if (temp_dir == NULL && errno == ENOENT && missing_ok)
+        return;
+
    while ((temp_de = ReadDirExtended(temp_dir, tmpdirname, LOG)) != NULL)

This might be problematic if RemovePgTempFilesInDir were a globally
exposed API, but it isn't, so adding a parameter seems fine.

Hmmm ... actually, in the recursive call case, it wouldn't be that
awful to ignore ENOENT either; if a directory goes away between being
stat'd and being opened, you'd still get a log message about rmdir
failure at the caller level.  So maybe we should just do your
second alternative as-is (ie, code as above but without missing_ok).
Having an explicit control parameter seems marginally clearer but
I'm not sure it's buying anything meaningful.  Thoughts?

            regards, tom lane


Re: pgsql: Improve error handling in RemovePgTempFiles().

From
Tom Lane
Date:
I wrote:
> Hmmm ... actually, in the recursive call case, it wouldn't be that
> awful to ignore ENOENT either; if a directory goes away between being
> stat'd and being opened, you'd still get a log message about rmdir
> failure at the caller level.  So maybe we should just do your
> second alternative as-is (ie, code as above but without missing_ok).
> Having an explicit control parameter seems marginally clearer but
> I'm not sure it's buying anything meaningful.  Thoughts?

Hearing no comments, I did it the first way.

            regards, tom lane


Re: pgsql: Improve error handling in RemovePgTempFiles().

From
Thomas Munro
Date:
On Mon, Jan 8, 2018 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Hmmm ... actually, in the recursive call case, it wouldn't be that
>> awful to ignore ENOENT either; if a directory goes away between being
>> stat'd and being opened, you'd still get a log message about rmdir
>> failure at the caller level.  So maybe we should just do your
>> second alternative as-is (ie, code as above but without missing_ok).
>> Having an explicit control parameter seems marginally clearer but
>> I'm not sure it's buying anything meaningful.  Thoughts?
>
> Hearing no comments, I did it the first way.

It's funny that the two boolean arguments are always opposite.
They're essentially both saying "top level?".  I was also a bit
confused about who else would delete the file in between the check and
the attempt to open it with my proposal, considering this is code
running in the postmaster at startup, so I figured I must be missing
something and hadn't got around to figure out what and replying.
Thanks for fixing this.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: pgsql: Improve error handling in RemovePgTempFiles().

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Mon, Jan 8, 2018 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hearing no comments, I did it the first way.

> It's funny that the two boolean arguments are always opposite.
> They're essentially both saying "top level?".

Yeah.  I thought about using a single "bool top_level" argument instead,
but concluded that this way is clearer and potentially more flexible.
There are other places where we have similar cases of flag arguments
that are redundant in current usage, but we leave it like that for
clarity as to what the function's API spec is.

            regards, tom lane