Re: .ready and .done files considered harmful - Mailing list pgsql-hackers

From Bossart, Nathan
Subject Re: .ready and .done files considered harmful
Date
Msg-id D68A2BF6-3BCE-4122-9CA8-40486812AFAE@amazon.com
Whole thread Raw
In response to Re: .ready and .done files considered harmful  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: .ready and .done files considered harmful  (Dipesh Pandit <dipesh.pandit@gmail.com>)
Re: .ready and .done files considered harmful  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 8/17/21, 12:11 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 8/17/21, 11:28 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
>> I can't actually see that there's any kind of hard synchronization
>> requirement here at all. What we're trying to do is guarantee that if
>> the timeline changes, we'll pick up the timeline history for the new
>> timeline next, and that if files are archived out of order, we'll
>> switch to archiving the oldest file that is now present rather than
>> continuing with consecutive files. But suppose we just use an
>> unsynchronized bool. The worst case is that we'll archive one extra
>> file proceeding in order before we jump to the file that we were
>> supposed to archive next. It's not evident to me that this is all that
>> bad. The same thing would have happened if the previous file had been
>> archived slightly faster than it actually was, so that we began
>> archiving the next file just before, rather than just after, the
>> notification was sent. And if it is bad, wrapping an LWLock around the
>> accesses to the flag variable, or using an atomic, does nothing to
>> stop it.
>
> I am inclined to agree.  The archiver only ever reads the flag and
> sets it to false (if we are doing a directory scan).  Others only ever
> set the flag to true.  The only case I can think of where we might
> miss the timeline switch or out-of-order .ready file is when the
> archiver sets the flag to false and then ReadDir() fails.  However,
> that seems to cause the archiver process to restart, and we always
> start with a directory scan at first.

Thinking further, I think the most important thing to ensure is that
resetting the flag happens before we begin the directory scan.
Consider the following scenario in which a timeline history file would
potentially be lost:

        1. Archiver completes directory scan.
        2. A timeline history file is created and the flag is set.
        3. Archiver resets the flag.

I don't think there's any problem with the archiver reading a stale
value for the flag.  It should eventually be updated and route us to
the directory scan code path.

I'd also note that we're depending on the directory scan logic for
picking up all timeline history files and out-of-order .ready files
that may have been created each time the flag is set.  AFAICT that is
safe since we prioritize timeline history files and reset the archiver
state anytime we do a directory scan.  We'll first discover timeline
history files via directory scans, and then we'll move on to .ready
files, starting at the one with the lowest segment number.  If a new
timeline history file or out-of-order .ready file is created, the
archiver is notified, and we start over.

Nathan


pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: The Free Space Map: Problems and Opportunities
Next
From: "alvherre@alvh.no-ip.org"
Date:
Subject: Re: archive status ".ready" files may be created too early