Thread: Failure with 004_logrotate in prairiedog
Hi all, prairiedog has failed in a way that seems a bit obscure to me: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29 Here are the details of the failure: server signaled to rotate log file could not read "/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles": No such file or directory at t/004_logrotate.pl line 78 ### Stopping node "primary" using mode immediate update_metainfo_datafile() creates a temporary file renamed to current_logfiles with rename(). It should be atomic, though this error points out that this is not the case? The previous steps of this test ensure that current_logfiles should exist. We could use some eval blocks in this area, but a non-atomic rename() would cause problems in more areas. Thoughts? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > prairiedog has failed in a way that seems a bit obscure to me: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29 > ... > We could use some eval blocks in this area, but a non-atomic rename() > would cause problems in more areas. Thoughts? Awhile back we discovered that old macOS versions have non-atomic rename [1]. I eventually shut down dromedary because that was causing failures often enough to be annoying. I'd not seen such a failure before on prairiedog, but it sure seems plausible that this is one. regards, tom lane [1] https://www.postgresql.org/message-id/2636.1569016167@sss.pgh.pa.us
On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote: > Awhile back we discovered that old macOS versions have non-atomic rename > [1]. I eventually shut down dromedary because that was causing failures > often enough to be annoying. I'd not seen such a failure before on > prairiedog, but it sure seems plausible that this is one. Thanks for the pointer. This indeed looks like the same problem. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote: >> Awhile back we discovered that old macOS versions have non-atomic rename >> [1]. I eventually shut down dromedary because that was causing failures >> often enough to be annoying. I'd not seen such a failure before on >> prairiedog, but it sure seems plausible that this is one. > Thanks for the pointer. This indeed looks like the same problem. For context, dromedary (now florican) is a dual-CPU machine while prairiedog has but one CPU. I'd thought that maybe not being multi-CPU insulated prairiedog from the non-atomic-rename problem. But now it looks like it's merely a lot less probable on that hardware. regards, tom lane
At Sun, 18 Jul 2021 12:32:20 +0900, Michael Paquier <michael@paquier.xyz> wrote in > Hi all, > > prairiedog has failed in a way that seems a bit obscure to me: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-07-18%2000%3A23%3A29 > > Here are the details of the failure: > server signaled to rotate log file > could not read > "/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pg_ctl/tmp_check/t_004_logrotate_primary_data/pgdata/current_logfiles": > No such file or directory at t/004_logrotate.pl line 78 > ### Stopping node "primary" using mode immediate > > update_metainfo_datafile() creates a temporary file renamed to > current_logfiles with rename(). It should be atomic, though this > error points out that this is not the case? The previous steps of > this test ensure that current_logfiles should exist. > > We could use some eval blocks in this area, but a non-atomic rename() > would cause problems in more areas. Thoughts? PostgresNode.logrotate() just invokes pg_ctl logrotate, which ends with triggering log rotation by a signal. When rotation happens, the metainfo file is once removed then created. If slurp_file in the metafile-checking loop hits the gap, the slurp_file fails with ENOENT. For non-win32 platforms, the error is identifiable by #!{ENOENT} but I'm not sure how we can identify the error for createFile(). $! doesn't work, and $^E returns a human-readable string in the platform language.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote: > When rotation happens, the metainfo file is once removed then > created. If slurp_file in the metafile-checking loop hits the gap, the > slurp_file fails with ENOENT. I can read the following code, as of update_metainfo_datafile(): if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0) ereport(LOG, (errcode_for_file_access(), errmsg("could not rename file \"%s\" to \"%s\": %m", LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE))); This creates a temporary file that gets renamed to current_logfiles. -- Michael
Attachment
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > When rotation happens, the metainfo file is once removed then > created. If slurp_file in the metafile-checking loop hits the gap, the > slurp_file fails with ENOENT. Oh! Yeah, that's dumb, we should fix it to use rename(). Can't blame platform's rename() if it's not being used. regards, tom lane
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote: >> When rotation happens, the metainfo file is once removed then >> created. If slurp_file in the metafile-checking loop hits the gap, the >> slurp_file fails with ENOENT. > I can read the following code, as of update_metainfo_datafile(): > if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0) Yeah, ignore my previous message. There is an unlink up at the top of the function, which fooled me in my caffeine-deprived state. But that path is only taken when logging was just turned off, so we must remove the now-irrelevant metafile. regards, tom lane
At Mon, 19 Jul 2021 10:23:46 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Michael Paquier <michael@paquier.xyz> writes: > > On Mon, Jul 19, 2021 at 04:15:36PM +0900, Kyotaro Horiguchi wrote: > >> When rotation happens, the metainfo file is once removed then > >> created. If slurp_file in the metafile-checking loop hits the gap, the > >> slurp_file fails with ENOENT. > > > I can read the following code, as of update_metainfo_datafile(): > > if (rename(LOG_METAINFO_DATAFILE_TMP, LOG_METAINFO_DATAFILE) != 0) > > Yeah, ignore my previous message. There is an unlink up at the top > of the function, which fooled me in my caffeine-deprived state. Yeah, sorry for the stupidity. > But that path is only taken when logging was just turned off, so > we must remove the now-irrelevant metafile. I'm not sure this is relevant, I found the following article. (as a token of my apology:p) http://www.weirdnet.nl/apple/rename.html > There is an easy way to empirically prove that rename() is not > atomic on Leopard 10.5.2. All you have to do is create a link to a > directory, replace that link with a regards. -- Kyotaro Horiguchi NTT Open Source Software Center