Re: thinko in basic_archive.c - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: thinko in basic_archive.c
Date
Msg-id 20221020.102647.746666963360821887.horikyota.ntt@gmail.com
Whole thread Raw
In response to thinko in basic_archive.c  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: thinko in basic_archive.c  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
At Wed, 19 Oct 2022 10:21:12 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> On Wed, Oct 19, 2022 at 8:58 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > An archive module could clean up them at startup or at the first call
> > but that might be dangerous (since archive directory is I think
> > thought as external resource).
> 
> The archive module must be responsible for cleaning up the temp file
> that it creates. One way to do it is in the archive module's shutdown
> callback, this covers most of the cases, but not all.

True. But I agree to Robert that such temporary files should be
cleanup-able without needing temporarily-valid knowledge (current file
name, in this case). A common strategy for this is to name those files
by names that can be identifed as garbage.

> > Honestly, I'm not sure about a reasonable scenario where simultaneous
> > archivings of a same file is acceptable, though. I feel that we should
> > not allow simultaneous archiving of the same segment by some kind of
> > interlocking. In other words, we might should serialize duplicate
> > archiving of asame file.
> 
> In a typical production environment, there's some kind of locking (for
> instance lease files) that allow/disallow file
> creation/deletion/writes/reads which guarantees that the same file
> isn't put into a directory (can be archive location) many times. And
> as you rightly said archive_directory is something external to
> postgres and we really can't deal with concurrent writers
> writing/creating the same files. Even if we somehow try to do it, it
> makes things complicated. This is true for any PGDATA directories.
> However, the archive module implementers can choose to define such a
> locking strategy.

flock() on nfs..

> > In another direction, the current code allows duplicate simultaneous
> > copying to temporary files with different names then the latest
> > renaming wins.  We reach the almost same result (on Linuxen (or
> > POSIX?))  by unlinking the existing tempfile first then create a new
> > one with the same name then continue. Even if the tempfile were left
> > alone after a crash, that file would be unlinked at the next trial of
> > archiving. But I'm not sure how this can be done on Windows..  In the
> > first place I'm not sure that the latest-unconditionally-wins policy
> > is appropriate or not, though.
> 
> We really can't just unlink the temp file because it has pid and
> timestamp in the filename and it's hard to determine the temp file
> that we created earlier.

But since power cut is a typical crash source, we need to identify
ruined temporary files and the current naming convention is incomplete
in this regard.

The worst case I can come up with regardless of feasibility is a
multi-standby physical replication set where all hosts share one
archive directory. Indeed live and dead temprary files can coexist
there.  However, I think we can identify truly rotten temp files by
inserting host name or cluster name (means cluster_name in
postgresql.conf) even in that case.  This premise that DBA names every
cluster differently, but I think DBAs that is going to configure such
a system are required to be very cautious about that kind of aspect.

> As far as the basic_archive module is concerned, we ought to keep it
> simple. I still think the simplest we can do is to use the
> basic_archive's shutdown_cb to delete (in most of the cases, but not

(Sorry, my memory was confused at the time. That callback feature
already existed.)

> all) the left-over temp file that the module is dealing with
> as-of-the-moment and add a note about the users dealing with
> concurrent writers to the basic_archive.archive_directory like the
> attached v2 patch.
> 
> > > A simple server restart while the basic_archive module is copying
> > > to/from temp file would leave the file behind, see[1]. I think we can
> > > fix this by defining shutdown_cb for the basic_archive module, like
> > > the attached patch. While this helps in most of the crashes, but not
> > > all. However, this is better than what we have right now.
> >
> > ShutdownWalRecovery() does the similar thing, but as you say this one
> > covers rather narrow cases than that since RestoreArchiveFile()
> > finally overwrites the left-alone files at the next call for that
> > file.
> 
> We're using unique temp file names in the basic_archive module so we
> can't overwrite the same upon restart.

Of course, it premised that a cluster uses the same name for a
segment. If we insert cluseter_name into the temprary name, a starting
cluster can indentify garbage files to clean up. For example if we
name them as follows.

ARCHTEMP_cluster1_pid_time_<lsn>

A starting cluster can clean up all files starts with
"archtemp_cluster1_*". (We need to choose the delimiter carefully,
though..)

> > # The patch seems forgetting to clear the tmepfilepath *after* a
> > # successful renaming.
> 
> It does so at the beginning of basic_archive_file() which is sufficient.

No. I didn't mean that, If server stops after a successfull
durable_rename but before the next call to
basic_archive_file_internal, that call back makes false comlaint since
that temprary file is actually gone.

> > And I don't see how the callback is called.
> 
> call_archive_module_shutdown_callback()->basic_archive_shutdown().

Yeah, sorry for the noise.

> Please see the attached v2 patch.

+static char    tempfilepath[MAXPGPATH + 256];

MAXPGPATH is the maximum length of a file name that PG assumes to be
able to handle. Thus extending that length seems wrong.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: GUC values - recommended way to declare the C variables?
Next
From: "shiy.fnst@fujitsu.com"
Date:
Subject: RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher