Thread: Failure with 004_logrotate in prairiedog

Failure with 004_logrotate in prairiedog

From
Michael Paquier
Date:
Hi all,

prairiedog has failed in a way that seems a bit obscure to me:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29

Here are the details of the failure:
server signaled to rotate log file
could not read

"/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
No such file or directory at t/004_logrotate.pl line 78
### Stopping node "primary" using mode immediate

update_metainfo_datafile() creates a temporary file renamed to
current_logfiles with rename().  It should be atomic, though this
error points out that this is not the case?  The previous steps of
this test ensure that current_logfiles should exist.

We could use some eval blocks in this area, but a non-atomic rename()
would cause problems in more areas.  Thoughts?
--
Michael

Attachment

Re: Failure with 004_logrotate in prairiedog

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> prairiedog has failed in a way that seems a bit obscure to me:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29
> ...
> We could use some eval blocks in this area, but a non-atomic rename()
> would cause problems in more areas.  Thoughts?

Awhile back we discovered that old macOS versions have non-atomic rename
[1].  I eventually shut down dromedary because that was causing failures
often enough to be annoying.  I'd not seen such a failure before on
prairiedog, but it sure seems plausible that this is one.

            regards, tom lane

[1] https://www.postgresql.org/message-id/2636.1569016167@sss.pgh.pa.us



Re: Failure with 004_logrotate in prairiedog

From
Michael Paquier
Date:
On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote:
> Awhile back we discovered that old macOS versions have non-atomic rename
> [1].  I eventually shut down dromedary because that was causing failures
> often enough to be annoying.  I'd not seen such a failure before on
> prairiedog, but it sure seems plausible that this is one.

Thanks for the pointer.  This indeed looks like the same problem.
--
Michael

Attachment

Re: Failure with 004_logrotate in prairiedog

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote:
>> Awhile back we discovered that old macOS versions have non-atomic rename
>> [1].  I eventually shut down dromedary because that was causing failures
>> often enough to be annoying.  I'd not seen such a failure before on
>> prairiedog, but it sure seems plausible that this is one.

> Thanks for the pointer.  This indeed looks like the same problem.

For context, dromedary (now florican) is a dual-CPU machine while
prairiedog has but one CPU.  I'd thought that maybe not being
multi-CPU insulated prairiedog from the non-atomic-rename problem.
But now it looks like it's merely a lot less probable on that hardware.

            regards, tom lane



Re: Failure with 004_logrotate in prairiedog

From
Kyotaro Horiguchi
Date:
At Sun, 18 Jul 2021 12:32:20 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> Hi all,
> 
> prairiedog has failed in a way that seems a bit obscure to me:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29
> 
> Here are the details of the failure:
> server signaled to rotate log file
> could not read
>
"/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles":
> No such file or directory at t/004_logrotate.pl line 78
> ### Stopping node "primary" using mode immediate
> 
> update_metainfo_datafile() creates a temporary file renamed to
> current_logfiles with rename().  It should be atomic, though this
> error points out that this is not the case?  The previous steps of
> this test ensure that current_logfiles should exist.
> 
> We could use some eval blocks in this area, but a non-atomic rename()
> would cause problems in more areas.  Thoughts?

PostgresNode.logrotate() just invokes pg_ctl logrotate, which ends
with triggering log rotation by a signal.

When rotation happens, the metainfo file is once removed then
created. If slurp_file in the metafile-checking loop hits the gap, the
slurp_file fails with ENOENT.

For non-win32 platforms, the error is identifiable by #!{ENOENT} but
I'm not sure how we can identify the error for createFile(). $!
doesn't work, and $^E returns a human-readable string in the platform
language..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Failure with 004_logrotate in prairiedog

From
Michael Paquier
Date:
On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:
> When rotation happens, the metainfo file is once removed then
> created. If slurp_file in the metafile-checking loop hits the gap, the
> slurp_file fails with ENOENT.

I can read the following code, as of update_metainfo_datafile():
if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)
    ereport(LOG,
            (errcode_for_file_access(),
             errmsg("could not rename file \"%s\" to \"%s\": %m",
                    LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE)));

This creates a temporary file that gets renamed to current_logfiles.
--
Michael

Attachment

Re: Failure with 004_logrotate in prairiedog

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> When rotation happens, the metainfo file is once removed then
> created. If slurp_file in the metafile-checking loop hits the gap, the
> slurp_file fails with ENOENT.

Oh!  Yeah, that's dumb, we should fix it to use rename().  Can't blame
platform's rename() if it's not being used.

            regards, tom lane



Re: Failure with 004_logrotate in prairiedog

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:
>> When rotation happens, the metainfo file is once removed then
>> created. If slurp_file in the metafile-checking loop hits the gap, the
>> slurp_file fails with ENOENT.

> I can read the following code, as of update_metainfo_datafile():
> if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)

Yeah, ignore my previous message.  There is an unlink up at the top
of the function, which fooled me in my caffeine-deprived state.
But that path is only taken when logging was just turned off, so
we must remove the now-irrelevant metafile.

            regards, tom lane



Re: Failure with 004_logrotate in prairiedog

From
Kyotaro Horiguchi
Date:
At Mon, 19 Jul 2021 10:23:46 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Michael Paquier <michael@paquier.xyz> writes:
> > On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote:
> >> When rotation happens, the metainfo file is once removed then
> >> created. If slurp_file in the metafile-checking loop hits the gap, the
> >> slurp_file fails with ENOENT.
> 
> > I can read the following code, as of update_metainfo_datafile():
> > if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0)
> 
> Yeah, ignore my previous message.  There is an unlink up at the top
> of the function, which fooled me in my caffeine-deprived state.

Yeah, sorry for the stupidity.

> But that path is only taken when logging was just turned off, so
> we must remove the now-irrelevant metafile.

I'm not sure this is relevant, I found the following article. (as a
token of my apology:p)

http://www.weirdnet.nl/apple/rename.html

> There is an easy way to empirically prove that rename() is not
> atomic on Leopard 10.5.2. All you have to do is create a link to a
> directory, replace that link with a

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center