Thread: thinko in basic_archive.c

thinko in basic_archive.c

From
Nathan Bossart
Date:
Hi hackers,

I intended for the temporary file name generated by basic_archive.c to
include the current timestamp so that the name was "sufficiently unique."
Of course, this could also be used to determine the creation time, but you
can just as easily use stat(1) for that.  In any case, I forgot to divide
the microseconds field by 1000 to obtain the current timestamp in
milliseconds, so while the value is unique, it's also basically garbage.
I've attached a small patch that fixes this so that the temporary file name
includes the timestamp in milliseconds for when it was created.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Fri, Oct 14, 2022 at 10:11 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> I intended for the temporary file name generated by basic_archive.c to

I'm trying to understand this a bit:

    /*
     * Pick a sufficiently unique name for the temporary file so that a
     * collision is unlikely.  This helps avoid problems in case a temporary
     * file was left around after a crash or another server happens to be
     * archiving to the same directory.
     */

Given that temp file name includes WAL file name, epoch to
milliseconds scale and MyProcPid, can there be name collisions after a
server crash or even when multiple servers with different pids are
archiving/copying the same WAL file to the same directory?

What happens to the left-over temp files after a server crash? Will
they be lying around in the archive directory? I understand that we
can't remove such files because we can't distinguish left-over files
from a crash and the temp files that another server is in the process
of copying.

If the goal is to copy files atomically, why can't we name the temp
file 'wal_file_name.pid.temp', assuming no PID wraparound and get rid
of appending time? Since basic_archive is a test module illustrating
archive_library implementation, do we really need to worry about name
collisions?

> I've attached a small patch that fixes this so that the temporary file name
> includes the timestamp in milliseconds for when it was created.

The patch LGTM.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: thinko in basic_archive.c

From
Nathan Bossart
Date:
On Fri, Oct 14, 2022 at 02:15:19PM +0530, Bharath Rupireddy wrote:
> Given that temp file name includes WAL file name, epoch to
> milliseconds scale and MyProcPid, can there be name collisions after a
> server crash or even when multiple servers with different pids are
> archiving/copying the same WAL file to the same directory?

While unlikely, I think it's theoretically possible.  If there is a
collision, archiving should fail and retry later with a different temporary
file name.

> What happens to the left-over temp files after a server crash? Will
> they be lying around in the archive directory? I understand that we
> can't remove such files because we can't distinguish left-over files
> from a crash and the temp files that another server is in the process
> of copying.

The temporary files are not automatically removed after a crash.  The
documentation for basic archive has a note about this [0].

> If the goal is to copy files atomically, why can't we name the temp
> file 'wal_file_name.pid.temp', assuming no PID wraparound and get rid
> of appending time? Since basic_archive is a test module illustrating
> archive_library implementation, do we really need to worry about name
> collisions?

Yeah, it's debatable how much we care about this for basic_archive.  We
previously decided that we at least care a little [1], so that's why we
have such elaborate temporary file names.  If anything, I hope that the
presence of this logic causes archive module authors to think about these
problems.

> The patch LGTM.

Thanks!

[0] https://www.postgresql.org/docs/devel/basic-archive.html#id-1.11.7.15.6
[1] https://postgr.es/m/CA%2BTgmoaSkSmo22SwJaV%2BycNPoGpxe0JV%3DTcTbh4ip8Cwjr0ULQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Sat, Oct 15, 2022 at 12:03 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Fri, Oct 14, 2022 at 02:15:19PM +0530, Bharath Rupireddy wrote:
> > Given that temp file name includes WAL file name, epoch to
> > milliseconds scale and MyProcPid, can there be name collisions after a
> > server crash or even when multiple servers with different pids are
> > archiving/copying the same WAL file to the same directory?
>
> While unlikely, I think it's theoretically possible.

Can you please help me understand how name collisions can happen with
temp file names including WAL file name, timestamp to millisecond
scale, and PID? Having the timestamp is enough to provide a non-unique
temp file name when PID wraparound occurs, right? Am I missing
something here?

> > What happens to the left-over temp files after a server crash? Will
> > they be lying around in the archive directory? I understand that we
> > can't remove such files because we can't distinguish left-over files
> > from a crash and the temp files that another server is in the process
> > of copying.
>
> The temporary files are not automatically removed after a crash.  The
> documentation for basic archive has a note about this [0].

Hm, we cannot remove the temp file for all sorts of crashes, but
having on_shmem_exit() or before_shmem_exit() or atexit() or any such
callback removing it would help us cover some crash scenarios (that
exit with proc_exit() or exit()) at least. I think the basic_archive
module currently leaves temp files around even when the server is
restarted legitimately while copying to or renaming the temp file, no?

I can quickly find these exit callbacks deleting the files:
atexit(cleanup_directories_atexit);
atexit(remove_temp);
before_shmem_exit(ReplicationSlotShmemExit, 0);
before_shmem_exit(logicalrep_worker_onexit, (Datum) 0);
before_shmem_exit(BeforeShmemExit_Files, 0);

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: thinko in basic_archive.c

From
Nathan Bossart
Date:
On Sat, Oct 15, 2022 at 10:19:05AM +0530, Bharath Rupireddy wrote:
> Can you please help me understand how name collisions can happen with
> temp file names including WAL file name, timestamp to millisecond
> scale, and PID? Having the timestamp is enough to provide a non-unique
> temp file name when PID wraparound occurs, right? Am I missing
> something here?

Outside of contrived cases involving multiple servers, inaccurate clocks,
PID reuse, etc., it seems unlikely.

> Hm, we cannot remove the temp file for all sorts of crashes, but
> having on_shmem_exit() or before_shmem_exit() or atexit() or any such
> callback removing it would help us cover some crash scenarios (that
> exit with proc_exit() or exit()) at least. I think the basic_archive
> module currently leaves temp files around even when the server is
> restarted legitimately while copying to or renaming the temp file, no?

I think the right way to do this would be to add handling for leftover
files in the sigsetjmp() block and a shutdown callback (which just sets up
a before_shmem_exit callback).  While this should ensure those files are
cleaned up after an ERROR or FATAL, crashes and unlink() failures could
still leave files behind.  We'd probably also need to avoid cleaning up the
temp file if copy_file() fails because it already exists, as we won't know
if it's actually ours.  Overall, I suspect it'd be more trouble than it's
worth.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: thinko in basic_archive.c

From
Michael Paquier
Date:
On Sat, Oct 15, 2022 at 02:10:26PM -0700, Nathan Bossart wrote:
> On Sat, Oct 15, 2022 at 10:19:05AM +0530, Bharath Rupireddy wrote:
>> Can you please help me understand how name collisions can happen with
>> temp file names including WAL file name, timestamp to millisecond
>> scale, and PID? Having the timestamp is enough to provide a non-unique
>> temp file name when PID wraparound occurs, right? Am I missing
>> something here?
>
> Outside of contrived cases involving multiple servers, inaccurate clocks,
> PID reuse, etc., it seems unlikely.

With a name based on a PID in a world where pid_max can be larger than
the default and a timestamp, I would say even more unlikely than what
you are implying with unlikely ;p

> I think the right way to do this would be to add handling for leftover
> files in the sigsetjmp() block and a shutdown callback (which just sets up
> a before_shmem_exit callback).  While this should ensure those files are
> cleaned up after an ERROR or FATAL, crashes and unlink() failures could
> still leave files behind.  We'd probably also need to avoid cleaning up the
> temp file if copy_file() fails because it already exists, as we won't know
> if it's actually ours.  Overall, I suspect it'd be more trouble than it's
> worth.

Agreed.  My opinion is that we should keep basic_archive as
minimalistic as we can: short still useful.  It does not have to be
perfect, just to fit with what we want it to show, as a reference.

Anyway, the maths were wrong, so I have applied the patch of upthread,
with an extra pair of parenthesis, a comment where epoch is declared
to tell that it is in milliseconds, and a comment in basic_archive's
Makefile to mention the reason why we have NO_INSTALLCHECK.
--
Michael

Attachment

Re: thinko in basic_archive.c

From
Robert Haas
Date:
On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> What happens to the left-over temp files after a server crash? Will
> they be lying around in the archive directory? I understand that we
> can't remove such files because we can't distinguish left-over files
> from a crash and the temp files that another server is in the process
> of copying.

Yeah, leaving a potentially unbounded number of files around after
system crashes seems pretty unfriendly. I'm not sure how to fix that,
exactly. We could use a name based on the database system identifier
if we thought that we might be archiving from multiple unrelated
clusters to the same directory, but presumably the real hazard is a
bunch of machines that are doing physical replication among
themselves, and will therefore share a system identifier. There might
be no better answer than to suggest that temporary files that are
"old" should be removed by means external to the database, but that's
not an entirely satisfying answer.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Mon, Oct 17, 2022 at 6:45 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > What happens to the left-over temp files after a server crash? Will
> > they be lying around in the archive directory? I understand that we
> > can't remove such files because we can't distinguish left-over files
> > from a crash and the temp files that another server is in the process
> > of copying.
>
> Yeah, leaving a potentially unbounded number of files around after
> system crashes seems pretty unfriendly. I'm not sure how to fix that,
> exactly.

A simple server restart while the basic_archive module is copying
to/from temp file would leave the file behind, see[1]. I think we can
fix this by defining shutdown_cb for the basic_archive module, like
the attached patch. While this helps in most of the crashes, but not
all. However, this is better than what we have right now.

[1] ubuntu:~/postgres/contrib/basic_archive/archive_directory$ ls
000000010000000000000001
archtemp.000000010000000000000002.2493876.1666091933457
archtemp.000000010000000000000002.2495316.1666091958680

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: thinko in basic_archive.c

From
Kyotaro Horiguchi
Date:
 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Mon, Oct 17, 2022 at 6:45 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Oct 14, 2022 at 4:45 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > What happens to the left-over temp files after a server crash? Will
> > > they be lying around in the archive directory? I understand that we
> > > can't remove such files because we can't distinguish left-over files
> > > from a crash and the temp files that another server is in the process
> > > of copying.
> >
> > Yeah, leaving a potentially unbounded number of files around after
> > system crashes seems pretty unfriendly. I'm not sure how to fix that,
> > exactly.

Unbounded number of sequential crash-restarts itself is a more serious
problem..

An archive module could clean up them at startup or at the first call
but that might be dangerous (since archive directory is I think
thought as external resource).

Honestly, I'm not sure about a reasonable scenario where simultaneous
archivings of a same file is acceptable, though. I feel that we should
not allow simultaneous archiving of the same segment by some kind of
interlocking. In other words, we might should serialize duplicate
archiving of asame file.

In another direction, the current code allows duplicate simultaneous
copying to temporary files with different names then the latest
renaming wins.  We reach the almost same result (on Linuxen (or
POSIX?))  by unlinking the existing tempfile first then create a new
one with the same name then continue. Even if the tempfile were left
alone after a crash, that file would be unlinked at the next trial of
archiving. But I'm not sure how this can be done on Windows..  In the
first place I'm not sure that the latest-unconditionally-wins policy
is appropriate or not, though.

> A simple server restart while the basic_archive module is copying
> to/from temp file would leave the file behind, see[1]. I think we can
> fix this by defining shutdown_cb for the basic_archive module, like
> the attached patch. While this helps in most of the crashes, but not
> all. However, this is better than what we have right now.

ShutdownWalRecovery() does the similar thing, but as you say this one
covers rather narrow cases than that since RestoreArchiveFile()
finally overwrites the left-alone files at the next call for that
file.

# The patch seems forgetting to clear the tmepfilepath *after* a
# successful renaming. And I don't see how the callback is called.

> [1] ubuntu:~/postgres/contrib/basic_archive/archive_directory$ ls
> 000000010000000000000001
> archtemp.000000010000000000000002.2493876.1666091933457
> archtemp.000000010000000000000002.2495316.1666091958680

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Wed, Oct 19, 2022 at 8:58 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Unbounded number of sequential crash-restarts itself is a more serious
> problem..

Agree. The basic_archive module currently leaves temp files around
even for normal restarts of the cluster, which is bad IMO.

> An archive module could clean up them at startup or at the first call
> but that might be dangerous (since archive directory is I think
> thought as external resource).

The archive module must be responsible for cleaning up the temp file
that it creates. One way to do it is in the archive module's shutdown
callback, this covers most of the cases, but not all.

> Honestly, I'm not sure about a reasonable scenario where simultaneous
> archivings of a same file is acceptable, though. I feel that we should
> not allow simultaneous archiving of the same segment by some kind of
> interlocking. In other words, we might should serialize duplicate
> archiving of asame file.

In a typical production environment, there's some kind of locking (for
instance lease files) that allow/disallow file
creation/deletion/writes/reads which guarantees that the same file
isn't put into a directory (can be archive location) many times. And
as you rightly said archive_directory is something external to
postgres and we really can't deal with concurrent writers
writing/creating the same files. Even if we somehow try to do it, it
makes things complicated. This is true for any PGDATA directories.
However, the archive module implementers can choose to define such a
locking strategy.

> In another direction, the current code allows duplicate simultaneous
> copying to temporary files with different names then the latest
> renaming wins.  We reach the almost same result (on Linuxen (or
> POSIX?))  by unlinking the existing tempfile first then create a new
> one with the same name then continue. Even if the tempfile were left
> alone after a crash, that file would be unlinked at the next trial of
> archiving. But I'm not sure how this can be done on Windows..  In the
> first place I'm not sure that the latest-unconditionally-wins policy
> is appropriate or not, though.

We really can't just unlink the temp file because it has pid and
timestamp in the filename and it's hard to determine the temp file
that we created earlier.

As far as the basic_archive module is concerned, we ought to keep it
simple. I still think the simplest we can do is to use the
basic_archive's shutdown_cb to delete (in most of the cases, but not
all) the left-over temp file that the module is dealing with
as-of-the-moment and add a note about the users dealing with
concurrent writers to the basic_archive.archive_directory like the
attached v2 patch.

> > A simple server restart while the basic_archive module is copying
> > to/from temp file would leave the file behind, see[1]. I think we can
> > fix this by defining shutdown_cb for the basic_archive module, like
> > the attached patch. While this helps in most of the crashes, but not
> > all. However, this is better than what we have right now.
>
> ShutdownWalRecovery() does the similar thing, but as you say this one
> covers rather narrow cases than that since RestoreArchiveFile()
> finally overwrites the left-alone files at the next call for that
> file.

We're using unique temp file names in the basic_archive module so we
can't overwrite the same upon restart.

> # The patch seems forgetting to clear the tmepfilepath *after* a
> # successful renaming.

It does so at the beginning of basic_archive_file() which is sufficient.

> And I don't see how the callback is called.

call_archive_module_shutdown_callback()->basic_archive_shutdown().

Please see the attached v2 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: thinko in basic_archive.c

From
Robert Haas
Date:
On Tue, Oct 18, 2022 at 11:28 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> > > Yeah, leaving a potentially unbounded number of files around after
> > > system crashes seems pretty unfriendly. I'm not sure how to fix that,
> > > exactly.
>
> Unbounded number of sequential crash-restarts itself is a more serious
> problem..

They don't have to be sequential. Garbage could accumulate over weeks,
months, or years.

Anyway, I agree we should hope that the system doesn't crash often,
but we cannot prevent the system administrator from removing the power
whenever they like. We can however try to reduce the number of
database-related things that go wrong if this happens, and I think we
should. Bharath's patch seems like it's probably a good idea, and if
we can do better, we should.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: thinko in basic_archive.c

From
Kyotaro Horiguchi
Date:
At Wed, 19 Oct 2022 10:48:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Tue, Oct 18, 2022 at 11:28 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > > > Yeah, leaving a potentially unbounded number of files around after
> > > > system crashes seems pretty unfriendly. I'm not sure how to fix that,
> > > > exactly.
> >
> > Unbounded number of sequential crash-restarts itself is a more serious
> > problem..

(Sorry, this was just a kidding.)

> They don't have to be sequential. Garbage could accumulate over weeks,
> months, or years.

Sure. Users' archive cleanup facilities don't work if they only
handles the files that with legit WAL file names.

> Anyway, I agree we should hope that the system doesn't crash often,
> but we cannot prevent the system administrator from removing the power
> whenever they like. We can however try to reduce the number of
> database-related things that go wrong if this happens, and I think we
> should. Bharath's patch seems like it's probably a good idea, and if
> we can do better, we should.

Yeah, I don't deny this, rather agree.  So, we should name temporary
files so that they are identifiable as garbage unconditionally at the
next startup. (They can be being actually active otherwise.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: thinko in basic_archive.c

From
Kyotaro Horiguchi
Date:
At Wed, 19 Oct 2022 10:21:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Wed, Oct 19, 2022 at 8:58 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > An archive module could clean up them at startup or at the first call
> > but that might be dangerous (since archive directory is I think
> > thought as external resource).
> 
> The archive module must be responsible for cleaning up the temp file
> that it creates. One way to do it is in the archive module's shutdown
> callback, this covers most of the cases, but not all.

True. But I agree to Robert that such temporary files should be
cleanup-able without needing temporarily-valid knowledge (current file
name, in this case). A common strategy for this is to name those files
by names that can be identifed as garbage.

> > Honestly, I'm not sure about a reasonable scenario where simultaneous
> > archivings of a same file is acceptable, though. I feel that we should
> > not allow simultaneous archiving of the same segment by some kind of
> > interlocking. In other words, we might should serialize duplicate
> > archiving of asame file.
> 
> In a typical production environment, there's some kind of locking (for
> instance lease files) that allow/disallow file
> creation/deletion/writes/reads which guarantees that the same file
> isn't put into a directory (can be archive location) many times. And
> as you rightly said archive_directory is something external to
> postgres and we really can't deal with concurrent writers
> writing/creating the same files. Even if we somehow try to do it, it
> makes things complicated. This is true for any PGDATA directories.
> However, the archive module implementers can choose to define such a
> locking strategy.

flock() on nfs..

> > In another direction, the current code allows duplicate simultaneous
> > copying to temporary files with different names then the latest
> > renaming wins.  We reach the almost same result (on Linuxen (or
> > POSIX?))  by unlinking the existing tempfile first then create a new
> > one with the same name then continue. Even if the tempfile were left
> > alone after a crash, that file would be unlinked at the next trial of
> > archiving. But I'm not sure how this can be done on Windows..  In the
> > first place I'm not sure that the latest-unconditionally-wins policy
> > is appropriate or not, though.
> 
> We really can't just unlink the temp file because it has pid and
> timestamp in the filename and it's hard to determine the temp file
> that we created earlier.

But since power cut is a typical crash source, we need to identify
ruined temporary files and the current naming convention is incomplete
in this regard.

The worst case I can come up with regardless of feasibility is a
multi-standby physical replication set where all hosts share one
archive directory. Indeed live and dead temprary files can coexist
there.  However, I think we can identify truly rotten temp files by
inserting host name or cluster name (means cluster_name in
postgresql.conf) even in that case.  This premise that DBA names every
cluster differently, but I think DBAs that is going to configure such
a system are required to be very cautious about that kind of aspect.

> As far as the basic_archive module is concerned, we ought to keep it
> simple. I still think the simplest we can do is to use the
> basic_archive's shutdown_cb to delete (in most of the cases, but not

(Sorry, my memory was confused at the time. That callback feature
already existed.)

> all) the left-over temp file that the module is dealing with
> as-of-the-moment and add a note about the users dealing with
> concurrent writers to the basic_archive.archive_directory like the
> attached v2 patch.
> 
> > > A simple server restart while the basic_archive module is copying
> > > to/from temp file would leave the file behind, see[1]. I think we can
> > > fix this by defining shutdown_cb for the basic_archive module, like
> > > the attached patch. While this helps in most of the crashes, but not
> > > all. However, this is better than what we have right now.
> >
> > ShutdownWalRecovery() does the similar thing, but as you say this one
> > covers rather narrow cases than that since RestoreArchiveFile()
> > finally overwrites the left-alone files at the next call for that
> > file.
> 
> We're using unique temp file names in the basic_archive module so we
> can't overwrite the same upon restart.

Of course, it premised that a cluster uses the same name for a
segment. If we insert cluseter_name into the temprary name, a starting
cluster can indentify garbage files to clean up. For example if we
name them as follows.

ARCHTEMP_cluster1_pid_time_<lsn>

A starting cluster can clean up all files starts with
"archtemp_cluster1_*". (We need to choose the delimiter carefully,
though..)

> > # The patch seems forgetting to clear the tmepfilepath *after* a
> > # successful renaming.
> 
> It does so at the beginning of basic_archive_file() which is sufficient.

No. I didn't mean that, If server stops after a successfull
durable_rename but before the next call to
basic_archive_file_internal, that call back makes false comlaint since
that temprary file is actually gone.

> > And I don't see how the callback is called.
> 
> call_archive_module_shutdown_callback()->basic_archive_shutdown().

Yeah, sorry for the noise.

> Please see the attached v2 patch.

+static char    tempfilepath[MAXPGPATH + 256];

MAXPGPATH is the maximum length of a file name that PG assumes to be
able to handle. Thus extending that length seems wrong.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Thu, Oct 20, 2022 at 6:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> > The archive module must be responsible for cleaning up the temp file
> > that it creates. One way to do it is in the archive module's shutdown
> > callback, this covers most of the cases, but not all.
>
> True. But I agree to Robert that such temporary files should be
> cleanup-able without needing temporarily-valid knowledge (current file
> name, in this case). A common strategy for this is to name those files
> by names that can be identifed as garbage.

I'm not sure how we can distinguish temp files as garbage based on
name. As Robert pointed out upthread, using system identifier may not
help as the standbys share the same system identifier and it's
possible that they might archive to the same directory. Is there any
other way?

> But since power cut is a typical crash source, we need to identify
> ruined temporary files and the current naming convention is incomplete
> in this regard.

Please note that basic_archive module creates one temp file at a time
to make file copying/moving atomic and it can keep track of the temp
file name and delete it using shutdown callback which helps in most of
the scenarios. As said upthread, repeated crashes while basic_archive
module is atomically copying files around is a big problem in itself
and basic_archive module need not worry about it much.

> flock() on nfs..
>
> The worst case I can come up with regardless of feasibility is a
> multi-standby physical replication set where all hosts share one
> archive directory. Indeed live and dead temprary files can coexist
> there.  However, I think we can identify truly rotten temp files by
> inserting host name or cluster name (means cluster_name in
> postgresql.conf) even in that case.  This premise that DBA names every
> cluster differently, but I think DBAs that is going to configure such
> a system are required to be very cautious about that kind of aspect.

Well, these ideas are great! However, we can leave defining such
strategies to archive module implementors. IMO, the basich_archive
module ought to be as simple and elegant as possible yet showing up
the usability of archive modules feature.

> Of course, it premised that a cluster uses the same name for a
> segment. If we insert cluseter_name into the temprary name, a starting
> cluster can indentify garbage files to clean up. For example if we
> name them as follows.
>
> ARCHTEMP_cluster1_pid_time_<lsn>
>
> A starting cluster can clean up all files starts with
> "archtemp_cluster1_*". (We need to choose the delimiter carefully,
> though..)

Postgres cleaning up basic_archive modules temp files at the start up
isn't a great idea IMO. Because these files are not related to server
functionality in any way unlike temp files removed in
RemovePgTempFiles(). IMO, we ought to keep the basic_archive module
simple.
.
> No. I didn't mean that, If server stops after a successfull
> durable_rename but before the next call to
> basic_archive_file_internal, that call back makes false comlaint since
> that temprary file is actually gone.

Right. Fixed it.

> > Please see the attached v2 patch.
>
> +static char    tempfilepath[MAXPGPATH + 256];
>
> MAXPGPATH is the maximum length of a file name that PG assumes to be
> able to handle. Thus extending that length seems wrong.

I think it was to accommodate the temp file name - "archtemp", file,
MyProcPid, epoch, but I agree that it can just be MAXPGPATH. However,
most of the places the core defines the path name to be MAXPGPATH +
some bytes.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: thinko in basic_archive.c

From
Kyotaro Horiguchi
Date:
Hi.

Anyway, on second thought, lager picture than just adding the
post-process-end callback would out of the scope of this patch. So I
write some comments on the patch first, then discussion the rest.


Thu, 20 Oct 2022 13:29:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> > No. I didn't mean that, If server stops after a successfull
> > durable_rename but before the next call to
> > basic_archive_file_internal, that call back makes false comlaint since
> > that temprary file is actually gone.
> 
> Right. Fixed it.

Thanks, but we don't need to wipe out the all bytes. Just putting \0
at the beginning of the buffer is sufficient. And the Memset() at the
beginning of basic_archive_file_internal is not needed since that
static variables are initially initialized to zeros.

This is not necessarily needed, but it might be better we empty
tempfilepath after unlinking the file.


> > +static char    tempfilepath[MAXPGPATH + 256];
> >
> > MAXPGPATH is the maximum length of a file name that PG assumes to be
> > able to handle. Thus extending that length seems wrong.
> 
> I think it was to accommodate the temp file name - "archtemp", file,
> MyProcPid, epoch, but I agree that it can just be MAXPGPATH. However,
> most of the places the core defines the path name to be MAXPGPATH +
> some bytes.

Mmm. I found that basic_archive already does the same thing. So lets
follow that in this patch.


+      expectation that a value will soon be provided. Care must be taken when
+      multiple servers are archiving to the same
+      <varname>basic_archive.archive_library</varname> directory as they all
+      might try to archive the same WAL file.

I don't understand what kind of care should be taken by reading this..

Anyway the PID + millisecond-resolution timestamps work in the almost
all cases, but it's not perfect. So.. I don't come up with what to
think about this..

Traditionally we told people that "archiving should not overwrite a
file unconditionally. Generally it is safe only when the contents are
identical then should be errored-out otherwise.".. Ah this is.

https://www.postgresql.org/docs/devel/continuous-archiving.html

> Archive commands and libraries should generally be designed to
> refuse to overwrite any pre-existing archive file. This is an
> important safety feature to preserve the integrity of your archive
> in case of administrator error (such as sending the output of two
> different servers to the same archive directory). It is advisable to
> test your proposed archive library to ensure that it does not
> overwrite an existing file.
...
> file again after restarting (provided archiving is still
> enabled). When an archive command or library encounters a
> pre-existing file, it should return a zero status or true,
> respectively, if the WAL file has identical contents to the
> pre-existing archive and the pre-existing archive is fully persisted
> to storage. If a pre-existing file contains different contents than
> the WAL file being archived, the archive command or library must
> return a nonzero status or false, respectively.

On the other hand, basic_archive seems to overwrite existing files
unconditionally.  I think this is not great, in that we offer a tool
betrays to our own wrtten suggestion...



The following is out-of-the-scope discussions.

==================
At Thu, 20 Oct 2022 13:29:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Thu, Oct 20, 2022 at 6:57 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote: > > > The archive module must be
> responsible for cleaning up the temp file > > that it creates. One
> way to do it is in the archive module's shutdown > > callback, this
> covers most of the cases, but not all.  > > True. But I agree to
> Robert that such temporary files should be > cleanup-able without
> needing temporarily-valid knowledge (current file > name, in this
> case). A common strategy for this is to name those files > by names
> that can be identifed as garbage.
> 
> I'm not sure how we can distinguish temp files as garbage based on
> name. As Robert pointed out upthread, using system identifier may not
> help as the standbys share the same system identifier and it's
> possible that they might archive to the same directory. Is there any
> other way?

Correct naming scheme would lead to resolution.

> > But since power cut is a typical crash source, we need to identify
> > ruined temporary files and the current naming convention is incomplete
> > in this regard.
> 
> Please note that basic_archive module creates one temp file at a time
> to make file copying/moving atomic and it can keep track of the temp
> file name and delete it using shutdown callback which helps in most of
> the scenarios. As said upthread, repeated crashes while basic_archive
> module is atomically copying files around is a big problem in itself
> and basic_archive module need not worry about it much.

I'm not sure. It's a "basic_archiver", but an "example_archiver".  I
read the name as "it is no highly configuratable but practically
usable".  In this criteria, clean up feature is not too much.

> > there.  However, I think we can identify truly rotten temp files by
> > inserting host name or cluster name (means cluster_name in
> > postgresql.conf) even in that case.  This premise that DBA names every
> 
> Well, these ideas are great! However, we can leave defining such
> strategies to archive module implementors. IMO, the basich_archive
> module ought to be as simple and elegant as possible yet showing up
> the usability of archive modules feature.

I don't understand why garbage-cleanup is not elegant.

On the other hand, if we pursue minimalism about this tool, we don't
need the timestamp part since this tool cannot write two or more files
simultaneously by the same process. (I don't mean we shoud remove that
part.)

> > Of course, it premised that a cluster uses the same name for a
> > segment. If we insert cluseter_name into the temprary name, a starting
> > cluster can indentify garbage files to clean up. For example if we
> > name them as follows.
> >
> > ARCHTEMP_cluster1_pid_time_<lsn>
> >
> > A starting cluster can clean up all files starts with
> > "archtemp_cluster1_*". (We need to choose the delimiter carefully,
> > though..)
> 
> Postgres cleaning up basic_archive modules temp files at the start up
> isn't a great idea IMO. Because these files are not related to server
> functionality in any way unlike temp files removed in
> RemovePgTempFiles(). IMO, we ought to keep the basic_archive module
> simple.

I think "init" feature is mandatory. But it would be another project.

> > > Please see the attached v2 patch.
> >
> > +static char    tempfilepath[MAXPGPATH + 256];
> >
> > MAXPGPATH is the maximum length of a file name that PG assumes to be
> > able to handle. Thus extending that length seems wrong.
> 
> I think it was to accommodate the temp file name - "archtemp", file,
> MyProcPid, epoch, but I agree that it can just be MAXPGPATH. However,
> most of the places the core defines the path name to be MAXPGPATH +
> some bytes.

Oooh.  I don't say "most" but some instances are found. (Almost all
usage of MAXPGPATH are not accompanied by additional length). But it
would be another issue.

Anyway I don't think even if the use of over-sized path buffers is
widely spread in our tree, it cannot be a reason for this new code
need to do the same thing. If we follow that direction, the following
code in basic_archive should have MAXPGPATH*2 wide, since it stores a
"<directory>/<filename>" construct. (That usage is found in, e.g.,
dbsize.c, which of course I think stupid..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: thinko in basic_archive.c

From
Kyotaro Horiguchi
Date:
Sorry, the previous mail are sent inadvertently..

At Fri, 21 Oct 2022 14:13:46 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> +      expectation that a value will soon be provided. Care must be taken when
> +      multiple servers are archiving to the same
> +      <varname>basic_archive.archive_library</varname> directory as they all
> +      might try to archive the same WAL file.
> 
> I don't understand what kind of care should be taken by reading this..
> 
> Anyway the PID + millisecond-resolution timestamps work in the almost
> all cases, but it's not perfect. So.. I don't come up with what to
> think about this..
> 
> Traditionally we told people that "archiving should not overwrite a
> file unconditionally. Generally it is safe only when the contents are
> identical then should be errored-out otherwise.".. Ah this is.

basic_archive follows the suggestion if the same file exists before it
starts to write a file.  So please forget this.

>     * Sync the temporary file to disk and move it to its final destination.
>     * Note that this will overwrite any existing file, but this is only
>     * possible if someone else created the file since the stat() above.

I'm not sure why we are allowed to allow this behavior.. But it also
would be another issue, if anyone cares. Thus I feel that we might not
touch this description in this patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Thanks, but we don't need to wipe out the all bytes. Just putting \0
> at the beginning of the buffer is sufficient.

Nah, that's not a clean way IMO.

> And the Memset() at the
> beginning of basic_archive_file_internal is not needed since that
> static variables are initially initialized to zeros.

Removed. MemSet() after durable_rename() would be sufficient.

> This is not necessarily needed, but it might be better we empty
> tempfilepath after unlinking the file.

I think it's not necessary as the archiver itself is shutting down and
I don't think the server calls the shutdown callback twice. However,
if we want basic_archive_shutdown() to be more protective against
multiple calls (for any reason that we're missing), we can have a
static local variable to quickly exit if the callback is already
called. instead of MemSet(), but that's not needed I guess.

> +      expectation that a value will soon be provided. Care must be taken when
> +      multiple servers are archiving to the same
> +      <varname>basic_archive.archive_library</varname> directory as they all
> +      might try to archive the same WAL file.
>
> I don't understand what kind of care should be taken by reading this..

It's just a notice, however I agree with you that it may be confusing.
I've removed it.

Please review the attached v4 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: thinko in basic_archive.c

From
Nathan Bossart
Date:
On Fri, Oct 21, 2022 at 09:30:16PM +0530, Bharath Rupireddy wrote:
> On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>> Thanks, but we don't need to wipe out the all bytes. Just putting \0
>> at the beginning of the buffer is sufficient.
> 
> Nah, that's not a clean way IMO.

Why not?  This is a commonly-used technique.  I see over 80 existing useѕ
in PostgreSQL.  Plus, your shutdown callback only checks for the first
byte, anyway.

+    if (tempfilepath[0] == '\0')
+        return;

As noted upthread [0], I think we should be cautious to only remove the
temporary file if we know we created it.  I still feel that trying to add
this cleanup logic to basic_archive is probably more trouble than it's
worth, but if there is a safe and effective way to do so, I won't object.

[0] https://postgr.es/m/20221015211026.GA1821022%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Sun, Nov 6, 2022 at 3:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Oct 21, 2022 at 09:30:16PM +0530, Bharath Rupireddy wrote:
> > On Fri, Oct 21, 2022 at 10:43 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> >> Thanks, but we don't need to wipe out the all bytes. Just putting \0
> >> at the beginning of the buffer is sufficient.
> >
> > Nah, that's not a clean way IMO.
>
> Why not?  This is a commonly-used technique.  I see over 80 existing useѕ
> in PostgreSQL.  Plus, your shutdown callback only checks for the first
> byte, anyway.
>
> +       if (tempfilepath[0] == '\0')
> +               return;

The tempfile name can vary in size for the simple reason that a pid
can be of varying digits - for instance, tempfile name is 'foo1234'
(pid being 1234) and it becomes '\'\0\'oo1234' if we just reset the
first char to '\0' and say pid wraparound occurred, now the it becomes
'bar5674' (pid being 567).

BTW, I couldn't find the 80 existing instances, can you please let me
know your search keyword?

> As noted upthread [0], I think we should be cautious to only remove the
> temporary file if we know we created it.  I still feel that trying to add
> this cleanup logic to basic_archive is probably more trouble than it's
> worth, but if there is a safe and effective way to do so, I won't object.
>
> [0] https://postgr.es/m/20221015211026.GA1821022%40nathanxps13

So, IIUC, your point here is what if the copy_file fails to create the
temp file when it already exists. With the name collision being a rare
scenario, given the pid and timestamp variables, I'm not sure if
copy_file can ever fail because the temp file already exists (with
errno EEXIST). However, if we want to be extra-cautious, checking if
temp file exists with file_exists() before calling copy_file() might
help avoid such cases. If we don't want to have extra system call (via
file_exists()) to check the temp file existence, we can think of
sending a flag to copy_file(src, dst, &is_dst_file_created) and use
is_dst_file_created in the shutdown callback. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: thinko in basic_archive.c

From
Nathan Bossart
Date:
On Mon, Nov 07, 2022 at 04:53:35PM +0530, Bharath Rupireddy wrote:
> The tempfile name can vary in size for the simple reason that a pid
> can be of varying digits - for instance, tempfile name is 'foo1234'
> (pid being 1234) and it becomes '\'\0\'oo1234' if we just reset the
> first char to '\0' and say pid wraparound occurred, now the it becomes
> 'bar5674' (pid being 567).

The call to snprintf() should take care of adding a terminating null byte
in the right place.

> BTW, I couldn't find the 80 existing instances, can you please let me
> know your search keyword?

grep "\[0\] = '\\\0'" src -r

> So, IIUC, your point here is what if the copy_file fails to create the
> temp file when it already exists. With the name collision being a rare
> scenario, given the pid and timestamp variables, I'm not sure if
> copy_file can ever fail because the temp file already exists (with
> errno EEXIST). However, if we want to be extra-cautious, checking if
> temp file exists with file_exists() before calling copy_file() might
> help avoid such cases. If we don't want to have extra system call (via
> file_exists()) to check the temp file existence, we can think of
> sending a flag to copy_file(src, dst, &is_dst_file_created) and use
> is_dst_file_created in the shutdown callback. Thoughts?

Presently, if copy_file() encounters a pre-existing file, it should ERROR,
which will be caught in the sigsetjmp() block in basic_archive_file().  The
shutdown callback shouldn't run in this scenario.

I think this cleanup logic should run in both the shutdown callback and the
sigsetjmp() block, but it should only take action (i.e., deleting the
leftover temporary file) if the ERROR or shutdown occurs after creating the
file in copy_file() and before renaming the temporary file to its final
destination.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: thinko in basic_archive.c

From
Bharath Rupireddy
Date:
On Tue, Nov 8, 2022 at 3:18 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> The call to snprintf() should take care of adding a terminating null byte
> in the right place.

Ah, my bad. MemSet is avoided in v5 patch setting only the first byte.

> > So, IIUC, your point here is what if the copy_file fails to create the
> > temp file when it already exists. With the name collision being a rare
> > scenario, given the pid and timestamp variables, I'm not sure if
> > copy_file can ever fail because the temp file already exists (with
> > errno EEXIST). However, if we want to be extra-cautious, checking if
> > temp file exists with file_exists() before calling copy_file() might
> > help avoid such cases. If we don't want to have extra system call (via
> > file_exists()) to check the temp file existence, we can think of
> > sending a flag to copy_file(src, dst, &is_dst_file_created) and use
> > is_dst_file_created in the shutdown callback. Thoughts?
>
> Presently, if copy_file() encounters a pre-existing file, it should ERROR,
> which will be caught in the sigsetjmp() block in basic_archive_file().  The
> shutdown callback shouldn't run in this scenario.

Determining the "file already exists" error/EEXIST case from a bunch
of other errors in copy_file() is tricky. However, I quickly hacked up
copy_file() by adding elevel parameter, please see the attached
v5-0001.

> I think this cleanup logic should run in both the shutdown callback and the
> sigsetjmp() block, but it should only take action (i.e., deleting the
> leftover temporary file) if the ERROR or shutdown occurs after creating the
> file in copy_file() and before renaming the temporary file to its final
> destination.

Please see attached v5 patch set.

If the direction seems okay, I'll add a CF entry.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment