Thread: .ready and .done files considered harmful

.ready and .done files considered harmful

From
Robert Haas
Date:
I and various colleagues of mine have from time to time encountered
systems that got a bit behind on WAL archiving, because the
archive_command started failing and nobody noticed right away.
Ideally, people should have monitoring for this and put it to rights
immediately, but some people don't. If those people happen to have a
relatively small pg_wal partition, they will likely become aware of
the issue when it fills up and takes down the server, but some users
provision disk space pretty generously and therefore nothing compels
them to notice the issue until they fill it up. In at least one case,
on a system that was actually generating a reasonable amount of WAL,
this took in excess of six months.

As you might imagine, pg_wal can get fairly large in such scenarios,
but the user is generally less concerned with solving that problem
than they are with getting the system back up. It is doubtless true
that the user would prefer to shrink the disk usage down to something
more reasonable over time, but on the facts as presented, it can't
really be an urgent issue for them. What they really need is just free
up a little disk space somehow or other and then get archiving running
fast enough to keep up with future WAL generation. Regrettably, the
archiver cannot do this, not even if you set archive_command =
/bin/true, because the archiver will barely ever actually run the
archive_command. Instead, it will spend virtually all of its time
calling readdir(), because for some reason it feels a need to make a
complete scan of the archive_status directory before archiving a WAL
file, and then it has to make another scan before archiving the next
one.

Someone - and it's probably for the best that the identity of that
person remains unknown to me - came up with a clever solution to this
problem, which is now used almost as a matter of routine whenever this
comes up. You just run pg_archivecleanup on your pg_wal directory, and
then remove all the corresponding .ready files and call it a day. I
haven't scrutinized the code for pg_archivecleanup, but evidently it
avoids needing O(n^2) time for this and therefore can clean up the
whole directory in something like the amount of time the archiver
would take to deal with a single file. While this seems to be quite an
effective procedure and I have not yet heard any user complaints, it
seems disturbingly error-prone, and honestly shouldn't ever be
necessary. The issue here is only that pgarch.c acts as though after
archiving 000000010000000000000001, 000000010000000000000002, and then
000000010000000000000003, we have no idea what file we might need to
archive next. Could it, perhaps, be 000000010000000000000004? Only a
full directory scan will tell us the answer!

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan. As far as I can see, this is just
cheap insurance. If archiving is keeping up, the extra stat() won't
matter much. If it's not, this will save more system calls than it
costs. Since during normal operation it shouldn't really be possible
for files to show up in pg_wal out of order, I don't really see a
scenario where this changes the behavior, either. If there are gaps in
the sequence at startup time, this will cope with it exactly the same
as we do now, except with a better chance of finishing before I
retire.

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.
And I think that we can't get around that by just using stat() to look
for the appearance of the file we expect to see, because it's possible
that we might be doing all of this on a standby which then gets
promoted, or some upstream primary gets promoted, and WAL files start
appearing on a different timeline, making our prediction of what the
next filename will be incorrect. But perhaps we could work around this
by allowing pgarch.c to access shared memory, in which case it could
examine the current timeline whenever it wants, and probably also
whatever LSNs it needs to know what's safe to archive. If we did that,
could we just get rid of the .ready and .done files altogether? Are
they just a really expensive IPC mechanism to avoid a shared memory
connection, or is there some more fundamental reason why we need them?
And is there any good reason why the archiver shouldn't be connected
to shared memory? It is certainly nice to avoid having more processes
connected to shared memory than necessary, but the current scheme is
so inefficient that I think we end up worse off.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Andres Freund
Date:
Hi,

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
> I have two possible ideas for addressing this; perhaps other people
> will have further suggestions. A relatively non-invasive fix would be
> to teach pgarch.c how to increment a WAL file name. After archiving
> segment N, check using stat() whether there's an .ready file for
> segment N+1. If so, do that one next. If not, then fall back to
> performing a full directory scan.

Hm. I wonder if it'd not be better to determine multiple files to be
archived in one readdir() pass?


> As far as I can see, this is just cheap insurance. If archiving is
> keeping up, the extra stat() won't matter much. If it's not, this will
> save more system calls than it costs. Since during normal operation it
> shouldn't really be possible for files to show up in pg_wal out of
> order, I don't really see a scenario where this changes the behavior,
> either. If there are gaps in the sequence at startup time, this will
> cope with it exactly the same as we do now, except with a better
> chance of finishing before I retire.

There's definitely gaps in practice :(. Due to the massive performance
issues with archiving there are several tools that archive multiple
files as part of one archive command invocation (and mark the additional
archived files as .done immediately).


> However, that's still pretty wasteful. Every time we have to wait for
> the next file to be ready for archiving, we'll basically fall back to
> repeatedly scanning the whole directory, waiting for it to show up.

Hm. That seems like it's only an issue because .done and .ready are in
the same directory? Otherwise the directory would be empty while we're
waiting for the next file to be ready to be archived. I hate that that's
a thing but given teh serial nature of archiving, with high per-call
overhead, I don't think it'd be ok to just break that without a
replacement :(.


> But perhaps we could work around this by allowing pgarch.c to access
> shared memory, in which case it could examine the current timeline
> whenever it wants, and probably also whatever LSNs it needs to know
> what's safe to archive.

FWIW, the shared memory stats patch implies doing that, since the
archiver reports stats.


> If we did that, could we just get rid of the .ready and .done files
> altogether? Are they just a really expensive IPC mechanism to avoid a
> shared memory connection, or is there some more fundamental reason why
> we need them?

What kind of shared memory mechanism are you thinking of? Due to
timelines and history files I don't think simple position counters would
be quite enough.

I think the aforementioned "batching" archive commands are part of the
problem :(.



> And is there any good reason why the archiver shouldn't be connected
> to shared memory? It is certainly nice to avoid having more processes
> connected to shared memory than necessary, but the current scheme is
> so inefficient that I think we end up worse off.

I think there is no fundamental for avoiding shared memory in the
archiver. I guess there's a minor robustness advantage, because the
forked shell to start the archvive command won't be attached to shared
memory. But that's only until the child exec()s to the archive command.

There is some minor performance advantage as well, not having to process
the often large and contended memory mapping for shared_buffers is
probably measurable - but swamped by the cost of needing to actually
archive the segment.


My only "concern" with doing anything around this is that I think the
whole approach of archive_command is just hopelessly broken, with even
just halfway busy servers only able to keep up archiving if they muck
around with postgres internal data during archive command execution. Add
to that how hard it is to write a robust archive command (e.g. the one
in our docs still suggests test ! -f && cp, which means that copy
failing in the middle yields an incomplete archive)...

While I don't think it's all that hard to design a replacement, it's
however likely still more work than addressing the O(n^2) issue, so ...

Greetings,

Andres Freund



Re: .ready and .done files considered harmful

From
Andrey Borodin
Date:

> 4 мая 2021 г., в 09:27, Andres Freund <andres@anarazel.de> написал(а):
>
> Hi,
>
> On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
>> I have two possible ideas for addressing this; perhaps other people
>> will have further suggestions. A relatively non-invasive fix would be
>> to teach pgarch.c how to increment a WAL file name. After archiving
>> segment N, check using stat() whether there's an .ready file for
>> segment N+1. If so, do that one next. If not, then fall back to
>> performing a full directory scan.
>
> Hm. I wonder if it'd not be better to determine multiple files to be
> archived in one readdir() pass?

FWIW we use both methods [0]. WAL-G has a pipe with WAL-push candidates.
We add there some predictions, and if it does not fill upload concurrency - list archive_status contents (concurrently
tobackground uploads). 

>
>
>> As far as I can see, this is just cheap insurance. If archiving is
>> keeping up, the extra stat() won't matter much. If it's not, this will
>> save more system calls than it costs. Since during normal operation it
>> shouldn't really be possible for files to show up in pg_wal out of
>> order, I don't really see a scenario where this changes the behavior,
>> either. If there are gaps in the sequence at startup time, this will
>> cope with it exactly the same as we do now, except with a better
>> chance of finishing before I retire.
>
> There's definitely gaps in practice :(. Due to the massive performance
> issues with archiving there are several tools that archive multiple
> files as part of one archive command invocation (and mark the additional
> archived files as .done immediately).
Interestingly, we used to rename .ready->.done some years ago. But pgBackRest developers convinced me that it's not a
goodidea to mess with data dir [1]. Then pg_probackup developers convinced me that renaming .ready->.done on our own
scalesbetter and implemented this functionality for us [2]. 

>> If we did that, could we just get rid of the .ready and .done files
>> altogether? Are they just a really expensive IPC mechanism to avoid a
>> shared memory connection, or is there some more fundamental reason why
>> we need them?
>
> What kind of shared memory mechanism are you thinking of? Due to
> timelines and history files I don't think simple position counters would
> be quite enough.
>
> I think the aforementioned "batching" archive commands are part of the
> problem :(.archiv
I'd be happy if we had a table with files that need to be archived, a table with registered archivers and a function to
say"archiver number X has done its job on file Y". Archiver could listen to some archiver channel while sleeping or
somethinglike that. 

Thanks!

Best regards, Andrey Borodin.


[0] https://github.com/x4m/wal-g/blob/c8a785217fe1123197280fd24254e51492bf5a68/internal/bguploader.go#L119-L137
[1]
https://www.postgresql.org/message-id/flat/20180828200754.GI3326%40tamriel.snowman.net#0b07304710b9ce5244438b7199447ee7
[2] https://github.com/wal-g/wal-g/pull/950


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
> > I have two possible ideas for addressing this; perhaps other people
> > will have further suggestions. A relatively non-invasive fix would be
> > to teach pgarch.c how to increment a WAL file name. After archiving
> > segment N, check using stat() whether there's an .ready file for
> > segment N+1. If so, do that one next. If not, then fall back to
> > performing a full directory scan.
>
> Hm. I wonder if it'd not be better to determine multiple files to be
> archived in one readdir() pass?

I think both methods have some merit. If we had a way to pass a range
of files to archive_command instead of just one, then your way is
distinctly better, and perhaps we should just go ahead and invent such
a thing. If not, your way doesn't entirely solve the O(n^2) problem,
since you have to choose some upper bound on the number of file names
you're willing to buffer in memory, but it may lower it enough that it
makes no practical difference. I am somewhat inclined to think that it
would be good to start with the method I'm proposing, since it is a
clear-cut improvement over what we have today and can be done with a
relatively limited amount of code change and no redesign, and then
perhaps do something more ambitious afterward.

> There's definitely gaps in practice :(. Due to the massive performance
> issues with archiving there are several tools that archive multiple
> files as part of one archive command invocation (and mark the additional
> archived files as .done immediately).

Good to know.

> > However, that's still pretty wasteful. Every time we have to wait for
> > the next file to be ready for archiving, we'll basically fall back to
> > repeatedly scanning the whole directory, waiting for it to show up.
>
> Hm. That seems like it's only an issue because .done and .ready are in
> the same directory? Otherwise the directory would be empty while we're
> waiting for the next file to be ready to be archived.

I think that's right.

> I hate that that's
> a thing but given teh serial nature of archiving, with high per-call
> overhead, I don't think it'd be ok to just break that without a
> replacement :(.

I don't know quite what you mean by this. Moving .done files to a
separate directory from .ready files could certainly be done and I
don't think it even would be that hard. It does seem like a bit of a
half measure though. If we're going to redesign this I think we ought
to be more ambitious than that.

> > But perhaps we could work around this by allowing pgarch.c to access
> > shared memory, in which case it could examine the current timeline
> > whenever it wants, and probably also whatever LSNs it needs to know
> > what's safe to archive.
>
> FWIW, the shared memory stats patch implies doing that, since the
> archiver reports stats.

Are you planning to commit that for v15? If so, will it be early in
the cycle, do you think?

> What kind of shared memory mechanism are you thinking of? Due to
> timelines and history files I don't think simple position counters would
> be quite enough.

I was thinking of simple position counters, but we could do something
more sophisticated. I don't even care if we stick with .ready/.done
for low-frequency stuff like timeline and history files. But I think
we'd be better off avoiding it for WAL files, because there are just
too many of them, and it's too hard to create a system that actually
scales. Or else we need a way for a single .ready file to cover many
WAL files in need of being archived, rather than just one.

> I think there is no fundamental for avoiding shared memory in the
> archiver. I guess there's a minor robustness advantage, because the
> forked shell to start the archvive command won't be attached to shared
> memory. But that's only until the child exec()s to the archive command.

That doesn't seem like a real issue because we're not running
user-defined code between fork() and exec().

> There is some minor performance advantage as well, not having to process
> the often large and contended memory mapping for shared_buffers is
> probably measurable - but swamped by the cost of needing to actually
> archive the segment.

Process it how?

Another option would be to have two processes. You could have one that
stayed connected to shared memory and another that JUST ran the
archive_command, and they could talk over a socket or something. But
that would add a bunch of extra complexity, so I don't want to do it
unless we actually need to do it.

> My only "concern" with doing anything around this is that I think the
> whole approach of archive_command is just hopelessly broken, with even
> just halfway busy servers only able to keep up archiving if they muck
> around with postgres internal data during archive command execution. Add
> to that how hard it is to write a robust archive command (e.g. the one
> in our docs still suggests test ! -f && cp, which means that copy
> failing in the middle yields an incomplete archive)...
>
> While I don't think it's all that hard to design a replacement, it's
> however likely still more work than addressing the O(n^2) issue, so ...

I think it is probably a good idea to fix the O(n^2) issue first, and
then as a separate step try to redefine things so that a decent
archive command doesn't have to poke around as much at internal stuff.
Part of that should probably involve having a way to pass a range of
files to archive_command instead of a single file. I was also
wondering whether we should go further and allow for the archiving to
be performed by C code running inside the backend rather than shelling
out to an external command.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Dilip Kumar
Date:
  iOn Tue, May 4, 2021 at 7:38 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
> > > I have two possible ideas for addressing this; perhaps other people
> > > will have further suggestions. A relatively non-invasive fix would be
> > > to teach pgarch.c how to increment a WAL file name. After archiving
> > > segment N, check using stat() whether there's an .ready file for
> > > segment N+1. If so, do that one next. If not, then fall back to
> > > performing a full directory scan.
> >
> > Hm. I wonder if it'd not be better to determine multiple files to be
> > archived in one readdir() pass?
>
> I think both methods have some merit. If we had a way to pass a range
> of files to archive_command instead of just one, then your way is
> distinctly better, and perhaps we should just go ahead and invent such
> a thing. If not, your way doesn't entirely solve the O(n^2) problem,
> since you have to choose some upper bound on the number of file names
> you're willing to buffer in memory, but it may lower it enough that it
> makes no practical difference. I am somewhat inclined to think that it
> would be good to start with the method I'm proposing, since it is a
> clear-cut improvement over what we have today and can be done with a
> relatively limited amount of code change and no redesign, and then
> perhaps do something more ambitious afterward.

I agree that if we continue to archive one file using the archive
command then Robert's solution of checking the existence of the next
WAL segment (N+1) has an advantage.  But, currently, if you notice
pgarch_readyXlog always consider any history file as the oldest file
but that will not be true if we try to predict the next WAL segment
name.  For example, if we have archived 000000010000000000000004 then
next we will look for 000000010000000000000005 but after generating
segment 000000010000000000000005, if there is a timeline switch then
we will have the below files in the archive status
(000000010000000000000005.ready, 00000002.history file).  Now, the
existing archiver will archive 00000002.history first whereas our code
will archive 000000010000000000000005 first.  Said that I don't see
any problem with that because before archiving any segment file from
TL 2 we will definitely archive the 00000002.history file because we
will not find the 000000010000000000000006.ready and we will scan the
full directory and now we will find 00000002.history as oldest file.

>
> > > However, that's still pretty wasteful. Every time we have to wait for
> > > the next file to be ready for archiving, we'll basically fall back to
> > > repeatedly scanning the whole directory, waiting for it to show up.

Is this true? that only when we have to wait for the next file to be
ready we got for scanning?  If I read the code in
"pgarch_ArchiverCopyLoop", for every single file to achieve it is
calling "pgarch_readyXlog", wherein it scans the directory every time.
So I did not understand your point that only when it needs to wait for
the next .ready file it need to scan the full directory.  It appeared
it always scans the full directory after archiving each WAL segment.
What am I missing?

> > Hm. That seems like it's only an issue because .done and .ready are in
> > the same directory? Otherwise the directory would be empty while we're
> > waiting for the next file to be ready to be archived.
>
> I think that's right.

If we agree with your above point that it only needs to scan the full
directory when it has to wait for the next file to be ready then
making a separate directory for .done file can improve a lot because
the directory will be empty so scanning will not be very costly.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, May 4, 2021 at 11:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I agree that if we continue to archive one file using the archive
> command then Robert's solution of checking the existence of the next
> WAL segment (N+1) has an advantage.  But, currently, if you notice
> pgarch_readyXlog always consider any history file as the oldest file
> but that will not be true if we try to predict the next WAL segment
> name.  For example, if we have archived 000000010000000000000004 then
> next we will look for 000000010000000000000005 but after generating
> segment 000000010000000000000005, if there is a timeline switch then
> we will have the below files in the archive status
> (000000010000000000000005.ready, 00000002.history file).  Now, the
> existing archiver will archive 00000002.history first whereas our code
> will archive 000000010000000000000005 first.  Said that I don't see
> any problem with that because before archiving any segment file from
> TL 2 we will definitely archive the 00000002.history file because we
> will not find the 000000010000000000000006.ready and we will scan the
> full directory and now we will find 00000002.history as oldest file.

OK, that makes sense and is good to know.

> > > > However, that's still pretty wasteful. Every time we have to wait for
> > > > the next file to be ready for archiving, we'll basically fall back to
> > > > repeatedly scanning the whole directory, waiting for it to show up.
>
> Is this true? that only when we have to wait for the next file to be
> ready we got for scanning?  If I read the code in
> "pgarch_ArchiverCopyLoop", for every single file to achieve it is
> calling "pgarch_readyXlog", wherein it scans the directory every time.
> So I did not understand your point that only when it needs to wait for
> the next .ready file it need to scan the full directory.  It appeared
> it always scans the full directory after archiving each WAL segment.
> What am I missing?

It's not true now, but my proposal would make it true.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Dilip Kumar
Date:
On Tue, May 4, 2021 at 10:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

> > Is this true? that only when we have to wait for the next file to be
> > ready we got for scanning?  If I read the code in
> > "pgarch_ArchiverCopyLoop", for every single file to achieve it is
> > calling "pgarch_readyXlog", wherein it scans the directory every time.
> > So I did not understand your point that only when it needs to wait for
> > the next .ready file it need to scan the full directory.  It appeared
> > it always scans the full directory after archiving each WAL segment.
> > What am I missing?
>
> It's not true now, but my proposal would make it true.

Okay, got it.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, May 4, 2021 at 11:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I agree that if we continue to archive one file using the archive
> > command then Robert's solution of checking the existence of the next
> > WAL segment (N+1) has an advantage.  But, currently, if you notice
> > pgarch_readyXlog always consider any history file as the oldest file
> > but that will not be true if we try to predict the next WAL segment
> > name.  For example, if we have archived 000000010000000000000004 then
> > next we will look for 000000010000000000000005 but after generating
> > segment 000000010000000000000005, if there is a timeline switch then
> > we will have the below files in the archive status
> > (000000010000000000000005.ready, 00000002.history file).  Now, the
> > existing archiver will archive 00000002.history first whereas our code
> > will archive 000000010000000000000005 first.  Said that I don't see
> > any problem with that because before archiving any segment file from
> > TL 2 we will definitely archive the 00000002.history file because we
> > will not find the 000000010000000000000006.ready and we will scan the
> > full directory and now we will find 00000002.history as oldest file.
>
> OK, that makes sense and is good to know.

I expect David will chime in on this thread too, but I did want to point
out that when it coming to archiving history files you'd *really* like
that to be done just about as quickly as absolutely possible, to avoid
the case that we saw before that code was added, to wit: two promotions
done too quickly that ended up with conflicting history and possibly
conflicting WAL files trying to be archived, and ensuing madness.

It's not just about making sure that we archive the history file for a
timeline before archiving WAL segments along that timeline but also
about making sure we get that history file into the archive as fast as
we can, and archiving a 16MB WAL first would certainly delay that.

Thanks,

Stephen

Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Wed, May 5, 2021 at 1:06 PM Stephen Frost <sfrost@snowman.net> wrote:
> It's not just about making sure that we archive the history file for a
> timeline before archiving WAL segments along that timeline but also
> about making sure we get that history file into the archive as fast as
> we can, and archiving a 16MB WAL first would certainly delay that.

Ooph. That's a rather tough constraint. Could we get around it by
introducing some kind of signalling mechanism, perhaps? Like if
there's a new history file, that must mean the server has switched
timelines -- I think, anyway -- so if we notified the archiver every
time there was a timeline switch it could react accordingly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, May 5, 2021 at 1:06 PM Stephen Frost <sfrost@snowman.net> wrote:
> > It's not just about making sure that we archive the history file for a
> > timeline before archiving WAL segments along that timeline but also
> > about making sure we get that history file into the archive as fast as
> > we can, and archiving a 16MB WAL first would certainly delay that.
>
> Ooph. That's a rather tough constraint. Could we get around it by
> introducing some kind of signalling mechanism, perhaps? Like if
> there's a new history file, that must mean the server has switched
> timelines -- I think, anyway -- so if we notified the archiver every
> time there was a timeline switch it could react accordingly.

I would think something like that would be alright and not worse than
what we've got now.

That said, in an ideal world, we'd have a way to get the new timeline to
switch to in a way that doesn't leave open race conditions, so as long
we're talking about big changes to the way archiving and archive_command
work (or about throwing out the horrible idea that is archive_command in
the first place and replacing it with appropriate hooks such that
someone could install an extension which would handle archiving...), I
would hope we'd have a way of saying "please, atomically, go get me a new
timeline."

Just as a reminder for those following along at home, as I'm sure you're
already aware, the way we figure out what timeline to switch to when a
replica is getting promoted is that we go run the restore command asking
for history files until we get back "nope, there is no file named
0000123.history", and then we switch to that timeline and then try to
push such a history file into the repo and hope that it works.

Thanks,

Stephen

Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Wed, May 5, 2021 at 4:13 PM Stephen Frost <sfrost@snowman.net> wrote:
> I would think something like that would be alright and not worse than
> what we've got now.

OK.

> That said, in an ideal world, we'd have a way to get the new timeline to
> switch to in a way that doesn't leave open race conditions, so as long
> we're talking about big changes to the way archiving and archive_command
> work (or about throwing out the horrible idea that is archive_command in
> the first place and replacing it with appropriate hooks such that
> someone could install an extension which would handle archiving...), I
> would hope we'd have a way of saying "please, atomically, go get me a new
> timeline."
>
> Just as a reminder for those following along at home, as I'm sure you're
> already aware, the way we figure out what timeline to switch to when a
> replica is getting promoted is that we go run the restore command asking
> for history files until we get back "nope, there is no file named
> 0000123.history", and then we switch to that timeline and then try to
> push such a history file into the repo and hope that it works.

Huh, I had not thought about that problem. So, at the risk of getting
sidetracked, what exactly are you asking for here? Let the extension
pick the timeline using an algorithm of its own devising, rather than
having core do it? Or what?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Andres Freund
Date:
Hi,

On 2021-05-05 16:13:08 -0400, Stephen Frost wrote:
> Just as a reminder for those following along at home, as I'm sure you're
> already aware, the way we figure out what timeline to switch to when a
> replica is getting promoted is that we go run the restore command asking
> for history files until we get back "nope, there is no file named
> 0000123.history", and then we switch to that timeline and then try to
> push such a history file into the repo and hope that it works.

Which is why the whole concept of timelines as we have them right now is
pretty much useless. It is fundamentally impossible to guarantee unique
timeline ids in all cases if they are assigned sequentially at timeline
creation - consider needing to promote a node on both ends of a split
network.  I'm quite doubtful that pretending to tackle this problem via
archiving order is a good idea, given the fundamentally racy nature.

Greetings,

Andres Freund



Re: .ready and .done files considered harmful

From
Andres Freund
Date:
Hi,

On 2021-05-05 16:22:21 -0400, Robert Haas wrote:
> Huh, I had not thought about that problem. So, at the risk of getting
> sidetracked, what exactly are you asking for here? Let the extension
> pick the timeline using an algorithm of its own devising, rather than
> having core do it? Or what?

Not Stephen, but to me the most reasonable way to address this is to
make timeline identifier wider and randomly allocated. The sequential
looking natures of timelines imo is actively unhelpful.

Greetings,

Andres Freund



Re: .ready and .done files considered harmful

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, May 5, 2021 at 4:13 PM Stephen Frost <sfrost@snowman.net> wrote:
> > That said, in an ideal world, we'd have a way to get the new timeline to
> > switch to in a way that doesn't leave open race conditions, so as long
> > we're talking about big changes to the way archiving and archive_command
> > work (or about throwing out the horrible idea that is archive_command in
> > the first place and replacing it with appropriate hooks such that
> > someone could install an extension which would handle archiving...), I
> > would hope we'd have a way of saying "please, atomically, go get me a new
> > timeline."
> >
> > Just as a reminder for those following along at home, as I'm sure you're
> > already aware, the way we figure out what timeline to switch to when a
> > replica is getting promoted is that we go run the restore command asking
> > for history files until we get back "nope, there is no file named
> > 0000123.history", and then we switch to that timeline and then try to
> > push such a history file into the repo and hope that it works.
>
> Huh, I had not thought about that problem. So, at the risk of getting
> sidetracked, what exactly are you asking for here? Let the extension
> pick the timeline using an algorithm of its own devising, rather than
> having core do it? Or what?

Having the extension do it somehow is an interesting idea and one which
might be kind of cool.

The first thought I had was to make it archive_command's job to "pick"
the timeline by just re-trying to push the .history file (the actual
contents of it don't change, as the information in the file is about the
timeline we are switching *from* and at what LSN).  That requires an
archive command which will fail if that file already exists though and,
ideally, would perform the file archival in an atomic fashion (though
this last bit isn't stricly necessary- anything along these lines would
certainly be better than the current state).

Having an entirely independent command/hook that's explicitly for this
case would be another approach, of course, either in a manner that
allows the extension to pick the destination timeline or is defined to
be "return success only if the file is successfully archived, but do
*not* overwrite any existing file of the same name and return an error
instead." and then the same approach as outlined above.

Thanks,

Stephen

Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Wed, May 5, 2021 at 4:31 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-05-05 16:22:21 -0400, Robert Haas wrote:
> > Huh, I had not thought about that problem. So, at the risk of getting
> > sidetracked, what exactly are you asking for here? Let the extension
> > pick the timeline using an algorithm of its own devising, rather than
> > having core do it? Or what?
>
> Not Stephen, but to me the most reasonable way to address this is to
> make timeline identifier wider and randomly allocated. The sequential
> looking natures of timelines imo is actively unhelpful.

Yeah, I always wondered why we didn't assign them randomly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, May 5, 2021 at 4:31 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-05-05 16:22:21 -0400, Robert Haas wrote:
> > > Huh, I had not thought about that problem. So, at the risk of getting
> > > sidetracked, what exactly are you asking for here? Let the extension
> > > pick the timeline using an algorithm of its own devising, rather than
> > > having core do it? Or what?
> >
> > Not Stephen, but to me the most reasonable way to address this is to
> > make timeline identifier wider and randomly allocated. The sequential
> > looking natures of timelines imo is actively unhelpful.
>
> Yeah, I always wondered why we didn't assign them randomly.

Based on what we do today regarding the info we put into .history files,
trying to figure out which is the "latest" timeline might be a bit
tricky with randomly selected timelines.  Maybe we could find a way to
solve that though.

I do note that this comment is timeline.c is, ahem, perhaps over-stating
things a bit:

 * Note: while this is somewhat heuristic, it does positively guarantee
 * that (result + 1) is not a known timeline, and therefore it should
 * be safe to assign that ID to a new timeline.

Thanks,

Stephen

Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Wed, May 5, 2021 at 4:53 PM Stephen Frost <sfrost@snowman.net> wrote:
> I do note that this comment is timeline.c is, ahem, perhaps over-stating
> things a bit:
>
>  * Note: while this is somewhat heuristic, it does positively guarantee
>  * that (result + 1) is not a known timeline, and therefore it should
>  * be safe to assign that ID to a new timeline.

OK, that made me laugh out loud.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Tue, 4 May 2021 10:07:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
> > > But perhaps we could work around this by allowing pgarch.c to access
> > > shared memory, in which case it could examine the current timeline
> > > whenever it wants, and probably also whatever LSNs it needs to know
> > > what's safe to archive.
> >
> > FWIW, the shared memory stats patch implies doing that, since the
> > archiver reports stats.
> 
> Are you planning to commit that for v15? If so, will it be early in
> the cycle, do you think?

FWIW It's already done for v14 individually.

Author: Fujii Masao <fujii@postgresql.org>
Date:   Mon Mar 15 13:13:14 2021 +0900

    Make archiver process an auxiliary process.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> FWIW It's already done for v14 individually.
>
> Author: Fujii Masao <fujii@postgresql.org>
> Date:   Mon Mar 15 13:13:14 2021 +0900
>
>     Make archiver process an auxiliary process.

Oh, I hadn't noticed. Thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Hannu Krosing
Date:
How are you envisioning the shared-memory signaling should work in the
original sample case, where the archiver had been failing for half a
year ?

Or should we perhaps have a system table for ready-to-archive WAL
files to get around limitation sof file system to return just the
needed files with ORDER BY ... LIMIT as we already know how to make
lookups in database fast ?

Cheers
Hannu


On Thu, May 6, 2021 at 12:24 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > FWIW It's already done for v14 individually.
> >
> > Author: Fujii Masao <fujii@postgresql.org>
> > Date:   Mon Mar 15 13:13:14 2021 +0900
> >
> >     Make archiver process an auxiliary process.
>
> Oh, I hadn't noticed. Thanks.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>
>



Re: .ready and .done files considered harmful

From
Andres Freund
Date:
Hi,

On 2021-05-06 21:23:36 +0200, Hannu Krosing wrote:
> How are you envisioning the shared-memory signaling should work in the
> original sample case, where the archiver had been failing for half a
> year ?

If we leave history files and gaps in the .ready sequence aside for a
second, we really only need an LSN or segment number describing the
current "archive position". Then we can iterate over the segments
between the "archive position" and the flush position (which we already
know). Even if we needed to keep statting .ready/.done files (to handle
gaps due to archive command mucking around with .ready/done), it'd still
be a lot cheaper than what we do today.  It probably would even still be
cheaper if we just statted all potentially relevant timeline history
files all the time to send them first.


> Or should we perhaps have a system table for ready-to-archive WAL
> files to get around limitation sof file system to return just the
> needed files with ORDER BY ... LIMIT as we already know how to make
> lookups in database fast ?

Archiving needs to work on a standby so that doesn't seem like an
option.

Regards,

Andres Freund



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

We have addressed the O(n^2) problem which involves directory scan for
archiving individual WAL files by maintaining a WAL counter to identify
the next WAL file in a sequence.

WAL archiver scans the status directory to identify the next WAL file
which needs to be archived. This directory scan can be minimized by
maintaining the log segment number of the current file which is being archived
and incrementing it by '1' to get the next WAL file in a sequence. Archiver
can check the availability of the next file in status directory and in case if the
file is not available then it should fall-back to directory scan to get the oldest
WAL file.

Please find attached patch v1.

Thanks,
Dipesh

On Fri, May 7, 2021 at 1:31 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-05-06 21:23:36 +0200, Hannu Krosing wrote:
> How are you envisioning the shared-memory signaling should work in the
> original sample case, where the archiver had been failing for half a
> year ?

If we leave history files and gaps in the .ready sequence aside for a
second, we really only need an LSN or segment number describing the
current "archive position". Then we can iterate over the segments
between the "archive position" and the flush position (which we already
know). Even if we needed to keep statting .ready/.done files (to handle
gaps due to archive command mucking around with .ready/done), it'd still
be a lot cheaper than what we do today.  It probably would even still be
cheaper if we just statted all potentially relevant timeline history
files all the time to send them first.


> Or should we perhaps have a system table for ready-to-archive WAL
> files to get around limitation sof file system to return just the
> needed files with ORDER BY ... LIMIT as we already know how to make
> lookups in database fast ?

Archiving needs to work on a standby so that doesn't seem like an
option.

Regards,

Andres Freund


Attachment

Re: .ready and .done files considered harmful

From
Dilip Kumar
Date:
On Tue, Jul 6, 2021 at 11:36 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
>
> Hi,
>
> We have addressed the O(n^2) problem which involves directory scan for
> archiving individual WAL files by maintaining a WAL counter to identify
> the next WAL file in a sequence.
>
> WAL archiver scans the status directory to identify the next WAL file
> which needs to be archived. This directory scan can be minimized by
> maintaining the log segment number of the current file which is being archived
> and incrementing it by '1' to get the next WAL file in a sequence. Archiver
> can check the availability of the next file in status directory and in case if the
> file is not available then it should fall-back to directory scan to get the oldest
> WAL file.
>
> Please find attached patch v1.
>

I have a few suggestions on the patch
1.
+
+ /*
+ * Found the oldest WAL, reset timeline ID and log segment number to generate
+ * the next WAL file in the sequence.
+ */
+ if (found && !historyFound)
+ {
+ XLogFromFileName(xlog, &curFileTLI, &nextLogSegNo, wal_segment_size);
+ ereport(LOG,
+ (errmsg("directory scan to archive write-ahead log file \"%s\"",
+ xlog)));
+ }

If a history file is found we are not updating curFileTLI and
nextLogSegNo, so it will attempt the previously found segment.  This
is fine because it will not find that segment and it will rescan the
directory.  But I think we can do better, instead of searching the
same old segment in the previous timeline we can search that old
segment in the new TL so that if the TL switch happened within the
segment then we will find the segment and we will avoid the directory
search.


 /*
+ * Log segment number and timeline ID to get next WAL file in a sequence.
+ */
+static XLogSegNo nextLogSegNo = 0;
+static TimeLineID curFileTLI = 0;
+

So everytime archiver will start with searching segno=0 in timeline=0.
Instead of doing this can't we first scan the directory and once we
get the first segment to archive then only we can start predicting the
next wal segment?  I think there is nothing wrong even if we try to
look for seg 0 in timeline 0, everytime we start the archivar but that
will be true only once in the history of the cluster so why not skip
this until we scan the directory once?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Stephen Frost
Date:
Greetings,

* Dipesh Pandit (dipesh.pandit@gmail.com) wrote:
> We have addressed the O(n^2) problem which involves directory scan for
> archiving individual WAL files by maintaining a WAL counter to identify
> the next WAL file in a sequence.

This seems to have missed the concerns raised in
https://postgr.es/m/20210505170601.GF20766@tamriel.snowman.net ..?

And also the comments immediately above the ones being added here:

> @@ -596,29 +606,55 @@ pgarch_archiveXlog(char *xlog)
>   * larger ID; the net result being that past timelines are given higher
>   * priority for archiving.  This seems okay, or at least not obviously worth
>   * changing.
> + *
> + * WAL files are generated in a specific order of log segment number. The
> + * directory scan for each WAL file can be minimized by identifying the next
> + * WAL file in the sequence. This can be achieved by maintaining log segment
> + * number and timeline ID corresponding to WAL file currently being archived.
> + * The log segment number of current WAL file can be incremented by '1' upon
> + * successful archival to point to the next WAL file.

specifically about history files being given higher priority for
archiving.  If we go with this change then we'd at least want to rewrite
or remove those comments, but I don't actually agree that we should
remove that preference to archive history files ahead of WAL, for the
reasons brought up previously.

As was suggested on that subthread, it seems like it should be possible
to just track the current timeline and adjust what we're doing if the
timeline changes, and we should even know what the .history file is at
that point and likely don't even need to scan the directory for it, as
it'll be the old timeline ID.

Thanks,

Stephen

Attachment

Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
> specifically about history files being given higher priority for
> archiving.  If we go with this change then we'd at least want to rewrite
> or remove those comments, but I don't actually agree that we should
> remove that preference to archive history files ahead of WAL, for the
> reasons brought up previously.

> As was suggested on that subthread, it seems like it should be possible
> to just track the current timeline and adjust what we're doing if the
> timeline changes, and we should even know what the .history file is at
> that point and likely don't even need to scan the directory for it, as
> it'll be the old timeline ID.

I agree, I missed this part. The .history file should be given higher preference.
I will take care of it in the next patch.

Thanks,
Dipesh

Re: .ready and .done files considered harmful

From
Jeevan Ladhe
Date:
 
I have a few suggestions on the patch
1.
+
+ /*
+ * Found the oldest WAL, reset timeline ID and log segment number to generate
+ * the next WAL file in the sequence.
+ */
+ if (found && !historyFound)
+ {
+ XLogFromFileName(xlog, &curFileTLI, &nextLogSegNo, wal_segment_size);
+ ereport(LOG,
+ (errmsg("directory scan to archive write-ahead log file \"%s\"",
+ xlog)));
+ }

If a history file is found we are not updating curFileTLI and
nextLogSegNo, so it will attempt the previously found segment.  This
is fine because it will not find that segment and it will rescan the
directory.  But I think we can do better, instead of searching the
same old segment in the previous timeline we can search that old
segment in the new TL so that if the TL switch happened within the
segment then we will find the segment and we will avoid the directory
search.


 /*
+ * Log segment number and timeline ID to get next WAL file in a sequence.
+ */
+static XLogSegNo nextLogSegNo = 0;
+static TimeLineID curFileTLI = 0;
+

So everytime archiver will start with searching segno=0 in timeline=0.
Instead of doing this can't we first scan the directory and once we
get the first segment to archive then only we can start predicting the
next wal segment?  I think there is nothing wrong even if we try to
look for seg 0 in timeline 0, everytime we start the archivar but that
will be true only once in the history of the cluster so why not skip
this until we scan the directory once?

+1, I like Dilip's ideas here to optimize further.

Also, one minor comment:

+   /*
+    * Log segment number already points to the next file in the sequence            
+    * (as part of successful archival of the previous file). Generate the path      
+    * for status file.                                                              
+    */

This comment is a bit confusing with the name of the variable nextLogSegNo.
I think the name of the variable is appropriate here, but maybe we can reword
the comment something like:

+       /*
+        * We already have the next anticipated log segment number and the
+        * timeline, check if this WAL file is ready to be archived. If yes, skip
+        * the directory scan.
+        */

Regards,
Jeevan Ladhe

Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

> I agree, I missed this part. The .history file should be given higher preference.
> I will take care of it in the next patch.

Archiver does not have access to shared memory and the current timeline ID
is not available at archiver. In order to keep track of timeline switch we have
to push a notification from backend to archiver.  Backend can send a signal
to notify archiver about the timeline change. Archiver can register this
notification and perform a full directory scan to make sure that archiving
history files take precedence over archiving WAL files.

> If a history file is found we are not updating curFileTLI and
> nextLogSegNo, so it will attempt the previously found segment.  This
> is fine because it will not find that segment and it will rescan the
> directory.  But I think we can do better, instead of searching the
> same old segment in the previous timeline we can search that old
> segment in the new TL so that if the TL switch happened within the
> segment then we will find the segment and we will avoid the directory
> search.

This could have been done with the approach mentioned in patch v1 but now
considering archiving history file takes precedence over WAL files we cannot
update the "curFileTLI" whenever a history file is found.

> So everytime archiver will start with searching segno=0 in timeline=0.
> Instead of doing this can't we first scan the directory and once we
> get the first segment to archive then only we can start predicting the
> next wal segment?

Done.

> This comment is a bit confusing with the name of the variable nextLogSegNo.
> I think the name of the variable is appropriate here, but maybe we can reword
> the comment something like:

Done.

I have incorporated these changes and updated a new patch. PFA, patch v2.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
Dilip Kumar
Date:
On Mon, Jul 19, 2021 at 5:43 PM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
>
> Hi,
>
> > I agree, I missed this part. The .history file should be given higher preference.
> > I will take care of it in the next patch.
>
> Archiver does not have access to shared memory and the current timeline ID
> is not available at archiver. In order to keep track of timeline switch we have
> to push a notification from backend to archiver.  Backend can send a signal
> to notify archiver about the timeline change. Archiver can register this
> notification and perform a full directory scan to make sure that archiving
> history files take precedence over archiving WAL files.

Yeah, that makes sense, some comments on v2.

1.
+pgarch_timeline_switch(SIGNAL_ARGS)
+{
+    int            save_errno = errno;
+
+    /* Set the flag to register a timeline switch */
+    timeline_switch = true;
+    SetLatch(MyLatch);
+

On the timeline switch, setting a flag should be enough, I don't think
that we need to wake up the archiver.  Because it will just waste the
scan cycle.  We have set the flag and that should be enough and let
the XLogArchiveNotify() wake this up when something is ready to be
archived and that time we will scan the directory first based on the
flag.


2.
+     */
+    if (XLogArchivingActive() && ArchiveRecoveryRequested)
+        XLogArchiveNotifyTLISwitch();
+
+
.....

 /*
+ * Signal archiver to notify timeline switch
+ */
+void
+XLogArchiveNotifyTLISwitch(void)
+{
+    if (IsUnderPostmaster)
+        PgArchNotifyTLISwitch();
+}

Why do we need multi level interfaces? I mean instead of calling first
XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch,
can't we directly call PgArchNotifyTLISwitch()?

3.
+        if (timeline_switch)
+        {
+            /* Perform a full directory scan in next cycle */
+            dirScan = true;
+            timeline_switch = false;
+        }

I suggest you can add some comments atop this check.

4.
+PgArchNotifyTLISwitch(void)
+{
+    int            arch_pgprocno = PgArch->pgprocno;
+
+    if (arch_pgprocno != INVALID_PGPROCNO)
+    {
+        int        archiver_pid = ProcGlobal->allProcs[arch_pgprocno].pid;
+
+        if (kill(archiver_pid, SIGINT) < 0)
+            elog(ERROR, "could not notify timeline change to archiver");


I think you should use %m in the error message so that it also prints
the OS error code.

5.
+/* Flag to specify a full directory scan to find next log file */
+static bool dirScan = true;

Why is this a global variable?  I mean whenever you enter the function
pgarch_ArchiverCopyLoop(), this can be set to true and after that you
can pass this as inout parameter to pgarch_readyXlog() there in it can
be conditionally set to false once we get some segment and whenever
the timeline switch we can set it back to the true.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, Jul 6, 2021 at 9:34 AM Stephen Frost <sfrost@snowman.net> wrote:
> As was suggested on that subthread, it seems like it should be possible
> to just track the current timeline and adjust what we're doing if the
> timeline changes, and we should even know what the .history file is at
> that point and likely don't even need to scan the directory for it, as
> it'll be the old timeline ID.

I'm a little concerned that this might turn out to be more complicated
than it's worth. It's not a case that should happen often, and if you
handle it then you have to be careful to handle cases like two
timeline switches in very rapid succession, which seems like it could
be tricky.

Maybe it's fine, though. I'm not really sure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

> some comments on v2.
Thanks for your comments. I have incorporated the changes
and updated a new patch. Please find the details below.

> On the timeline switch, setting a flag should be enough, I don't think
> that we need to wake up the archiver.  Because it will just waste the
> scan cycle.
Yes, I modified it.

> Why do we need multi level interfaces? I mean instead of calling first
> XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch,
> can't we directly call PgArchNotifyTLISwitch()?
Yes, multilevel interfaces are not required. Removed extra interface.

> +        if (timeline_switch)
> +        {
> +            /* Perform a full directory scan in next cycle */
> +            dirScan = true;
> +            timeline_switch = false;
> +        }

> I suggest you can add some comments atop this check.
Added comment to specify the action required in case of a
timeline switch.

> I think you should use %m in the error message so that it also prints
> the OS error code.
Done.

> Why is this a global variable?  I mean whenever you enter the function
> pgarch_ArchiverCopyLoop(), this can be set to true and after that you
> can pass this as inout parameter to pgarch_readyXlog() there in it can
> be conditionally set to false once we get some segment and whenever
> the timeline switch we can set it back to the true.
Yes, It is not necessary to have global scope for "dirScan". Changed
the scope to local for "dirScan" and "nextLogSegNo".

PFA patch v3.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
Jeevan Ladhe
Date:
Thanks, Dipesh. The patch LGTM.

Some minor suggestions:

+ *

+ * "nextLogSegNo" identifies the next log file to be archived in a log

+ * sequence and the flag "dirScan" specifies a full directory scan to find

+ * the next log file.


IMHO, this comment should go atop of pgarch_readyXlog() as a description

of its parameters, and not in pgarch_ArchiverCopyLoop().


 /*

+ * Interrupt handler for archiver

+ *

+ * There is a timeline switch and we have been notified by backend.

+ */


Instead, I would suggest having something like this:


+/*

+ * Interrupt handler for handling the timeline switch.

+ *

+ * A timeline switch has been notified, mark this event so that the next iteration

+ * of pgarch_ArchiverCopyLoop() archives the history file, and we set the

+ * timeline to the new one for the next anticipated log segment.

+ */


Regards,

Jeevan Ladhe


On Thu, Jul 22, 2021 at 12:46 PM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
Hi,

> some comments on v2.
Thanks for your comments. I have incorporated the changes
and updated a new patch. Please find the details below.

> On the timeline switch, setting a flag should be enough, I don't think
> that we need to wake up the archiver.  Because it will just waste the
> scan cycle.
Yes, I modified it.

> Why do we need multi level interfaces? I mean instead of calling first
> XLogArchiveNotifyTLISwitch and then calling PgArchNotifyTLISwitch,
> can't we directly call PgArchNotifyTLISwitch()?
Yes, multilevel interfaces are not required. Removed extra interface.

> +        if (timeline_switch)
> +        {
> +            /* Perform a full directory scan in next cycle */
> +            dirScan = true;
> +            timeline_switch = false;
> +        }

> I suggest you can add some comments atop this check.
Added comment to specify the action required in case of a
timeline switch.

> I think you should use %m in the error message so that it also prints
> the OS error code.
Done.

> Why is this a global variable?  I mean whenever you enter the function
> pgarch_ArchiverCopyLoop(), this can be set to true and after that you
> can pass this as inout parameter to pgarch_readyXlog() there in it can
> be conditionally set to false once we get some segment and whenever
> the timeline switch we can set it back to the true.
Yes, It is not necessary to have global scope for "dirScan". Changed
the scope to local for "dirScan" and "nextLogSegNo".

PFA patch v3.

Thanks,
Dipesh

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 5/6/21, 1:01 PM, "Andres Freund" <andres@anarazel.de> wrote:
> If we leave history files and gaps in the .ready sequence aside for a
> second, we really only need an LSN or segment number describing the
> current "archive position". Then we can iterate over the segments
> between the "archive position" and the flush position (which we already
> know). Even if we needed to keep statting .ready/.done files (to handle
> gaps due to archive command mucking around with .ready/done), it'd still
> be a lot cheaper than what we do today.  It probably would even still be
> cheaper if we just statted all potentially relevant timeline history
> files all the time to send them first.

My apologies for chiming in so late to this thread, but a similar idea
crossed my mind while working on a bug where .ready files get created
too early [0].  Specifically, instead of maintaining a status file per
WAL segment, I was thinking we could narrow it down to a couple of
files to keep track of the boundaries we care about:

    1. earliest_done: the oldest segment that has been archived and
       can be recycled/removed
    2. latest_done: the newest segment that has been archived
    3. latest_ready: the newest segment that is ready for archival

This might complicate matters for backup utilities that currently
modify the .ready/.done files, but it would simplify this archive
status stuff quite a bit and eliminate the need to worry about the
directory scans in the first place.

Nathan

[0] https://www.postgresql.org/message-id/flat/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Fri, Jul 23, 2021 at 5:46 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> My apologies for chiming in so late to this thread, but a similar idea
> crossed my mind while working on a bug where .ready files get created
> too early [0].  Specifically, instead of maintaining a status file per
> WAL segment, I was thinking we could narrow it down to a couple of
> files to keep track of the boundaries we care about:
>
>     1. earliest_done: the oldest segment that has been archived and
>        can be recycled/removed
>     2. latest_done: the newest segment that has been archived
>     3. latest_ready: the newest segment that is ready for archival
>
> This might complicate matters for backup utilities that currently
> modify the .ready/.done files, but it would simplify this archive
> status stuff quite a bit and eliminate the need to worry about the
> directory scans in the first place.

In terms of immediate next steps, I think we should focus on
eliminating the O(n^2) problem and not get sucked into a bigger
redesign. The patch on the table aims to do just that much and I think
that's a good thing.

But in the longer term I agree that we want to redesign the signalling
somehow. I am not convinced that using a file is the right way to go.
If we had to rewrite that file for every change, and especially if we
had to fsync it, it would be almost as bad as what we're doing right
now in terms of the amount of traffic to the filesystem. Atomicity is
a problem too, because if we simply create a file then after a crash
it will either exist or not, but a file might end up garbled with a
mix of old and new contents unless we always write a temporary file
and automatically rename that over the existing one. As I said in my
original post, I'm kind of wondering about keeping the information in
shared memory instead of using the filesystem. I think we would still
need to persist it to disk at least occasionally but perhaps there is
a way to avoid having to do that as frequently as what we do now. I
haven't thought too deeply about what the requirements are here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 7/26/21, 6:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> In terms of immediate next steps, I think we should focus on
> eliminating the O(n^2) problem and not get sucked into a bigger
> redesign. The patch on the table aims to do just that much and I think
> that's a good thing.

I agree.  I'll leave further discussion about a redesign for another
thread.

Nathan


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
> Some minor suggestions:
Thanks for your comments. I have incorporated the changes
and updated a new patch. Please find the attached patch v4.

Thanks,
Dipesh

On Mon, Jul 26, 2021 at 9:44 PM Bossart, Nathan <bossartn@amazon.com> wrote:
On 7/26/21, 6:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> In terms of immediate next steps, I think we should focus on
> eliminating the O(n^2) problem and not get sucked into a bigger
> redesign. The patch on the table aims to do just that much and I think
> that's a good thing.

I agree.  I'll leave further discussion about a redesign for another
thread.

Nathan

Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, Jul 27, 2021 at 3:43 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
> and updated a new patch. Please find the attached patch v4.

Some review:

        /*
+        * If archiver is active, send notification that timeline has switched.
+        */
+       if (XLogArchivingActive() && ArchiveRecoveryRequested &&
+               IsUnderPostmaster)
+               PgArchNotifyTLISwitch();

There are a few other places in xlog.c that are conditional on
XLogArchivingActive(), but none of them test ArchiveRecoveryRequested
or IsUnderPostmaster. It appears to me that PgArchStartupAllowed()
controls whether the archiver runs, and that's not contingent on
ArchiveRecoveryRequested and indeed couldn't be, since it's running in
the postmaster where that variable wouldn't be initialized. So why do
we care about ArchiveRecoveryRequested here? This is not entirely a
rhetorical question; maybe there's some reason we should care. If so,
the comment ought to mention it. If not, the test should go away.

IsUnderPostmaster does make a difference, but I think that test could
be placed inside PgArchNotifyTLISwitch() rather than putting it here
in StartupXLOG(). In fact, I think the test could be removed entirely,
since if PgArchNotifyTLISwitch() is called in single-user mode, it
will presumably just discover that arch_pgprocno == INVALID_PGPROCNO,
so it will simply do nothing even without the special-case code.

+       pqsignal(SIGINT, pgarch_timeline_switch);

I don't think it's great that we're using up SIGINT for this purpose.
There aren't that many signals available at the O/S level that we can
use for our purposes, and we generally try to multiplex them at the
application layer, e.g. by setting a latch or a flag in shared memory,
rather than using a separate signal. Can we do something of that sort
here? Or maybe we don't even need a signal. ThisTimeLineID is already
visible in shared memory, so why not just have the archiver just check
and see whether it's changed, say via a new accessor function
GetCurrentTimeLineID()? I guess there could be a concern about the
expensive of that, because we'd probably be taking a spinlock or an
lwlock for every cycle, but I don't think it's probably that bad,
because I doubt we can archive much more than a double-digit number of
files per second even with a very fast archive_command, and contention
on a lock generally requires a five digit number of acquisitions per
second. It would be worth testing to see if we can see a problem here,
but I'm fairly hopeful that it's not an issue. If we do feel that it's
important to avoid repeatedly taking a lock, let's see if we can find
a way to do it without dedicating a signal to this purpose.

+        *
+        * "nextLogSegNo" identifies the next log file to be archived in a log
+        * sequence and the flag "dirScan" specifies a full directory
scan to find
+        * the next log file.
         */
-       while (pgarch_readyXlog(xlog))
+       while (pgarch_readyXlog(xlog, &dirScan, &nextLogSegNo))

I do not like this very much. dirScan and nextLogSegNo aren't clearly
owned either by pgarch_ArchiverCopyLoop() or by pgarch_readyXlog(),
since both functions modify both variables, in each case
conditionally, while also relying on the way that the other function
manipulates them. Essentially these are global variables in disguise.
There's a third, related variable too, which is handled differently:

+       static TimeLineID curFileTLI = 0;

This is really the same kind of thing as the other two, but because
pgarch_readyXlog() happens not to need this one, you just made it
static inside pgarch_readyXlog() instead of passing it back and forth.

The problem with all this is that you can't understand either function
in isolation. Unless you read them both together and look at all of
the ways these three variables are manipulated, you can't really
understand the logic. And there's really no reason why that needs to
be true. The job of cleaning timeline_switch and setting dirScan could
be done entirely within pgarch_readyXlog(), and so could the job of
incrementing nextLogSegNo, because we're not going to again call
pgarch_readyXlog() unless archiving succeeded.

Also note that the TLI which is stored in curFileTLI corresponds to
the segment number stored in nextLogSegNo, yet one of them has "cur"
for "current" in the name and the other has "next". It would be easier
to read the code if the names were chosen more consistently.

My tentative idea as to how to clean this up is: declare a new struct
with a name like readyXlogState and members lastTLI and lastSegNo.
Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero
it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave
it alone. Then let pgarch_readyXlog() do all of the manipulation of
the values stored therein.

+       /*
+        * Fall-back to directory scan
+        *
+        * open xlog status directory and read through list of xlogs
that have the
+        * .ready suffix, looking for earliest file. It is possible to optimise
+        * this code, though only a single file is expected on the vast majority
+        * of calls, so....
+        */

You've moved this comment from its original location, but the trouble
is that the comment is 100% false. In fact, the whole reason why you
wrote this patch is *because* this comment is 100% false. In fact it
is not difficult to create cases where each scan finds many files, and
the purpose of the patch is precisely to optimize the code that the
person who wrote this thought didn't need optimizing. Now it may take
some work to figure out what we want to say here exactly, but
preserving the comment as it's written here is certainly misleading.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

> I don't think it's great that we're using up SIGINT for this purpose.
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()?

As of now shared memory is not attached to the archiver. Archiver cannot
access ThisTimeLineID or a flag available in shared memory.

    if (strcmp(argv[1], "--forkbackend") == 0 ||                                                        
        strcmp(argv[1], "--forkavlauncher") == 0 ||                                                      
        strcmp(argv[1], "--forkavworker") == 0 ||                                                        
        strcmp(argv[1], "--forkboot") == 0 ||                                                            
        strncmp(argv[1], "--forkbgworker=", 15) == 0)                                                    
        PGSharedMemoryReAttach();                                                                        
    else                                                                                                
        PGSharedMemoryNoReAttach();

This is the reason we have thought of sending a notification to the archiver if
there is a timeline switch. Should we consider attaching shared memory to
archiver process or explore more on notification mechanism to avoid
using SIGINT?

Thanks,
Dipesh

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Wed, Jul 28, 2021 at 6:48 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
> As of now shared memory is not attached to the archiver. Archiver cannot
> access ThisTimeLineID or a flag available in shared memory.

If that is true, why are there functions PgArchShmemSize() and
PgArchShmemInit(), and how does this statement in PgArchiverMain()
manage not to core dump?

    /*
     * Advertise our pgprocno so that backends can use our latch to wake us up
     * while we're sleeping.
     */
    PgArch->pgprocno = MyProc->pgprocno;

I think what you are saying is true before v14, but not in v14 and master.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

> I think what you are saying is true before v14, but not in v14 and master.
Yes, we can use archiver specific shared memory. Thanks.

> I don't think it's great that we're using up SIGINT for this purpose.
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()? I guess there could be a concern about the
> expensive of that, because we'd probably be taking a spinlock or an
> lwlock for every cycle, but I don't think it's probably that bad,
> because I doubt we can archive much more than a double-digit number of
> files per second even with a very fast archive_command, and contention
> on a lock generally requires a five digit number of acquisitions per
> second. It would be worth testing to see if we can see a problem here,
> but I'm fairly hopeful that it's not an issue. If we do feel that it's
> important to avoid repeatedly taking a lock, let's see if we can find
> a way to do it without dedicating a signal to this purpose.

We can maintain the current timeline ID in archiver specific shared memory.
If we switch to a new timeline then the backend process can update the new
timeline ID in shared memory. Archiver can keep a track of current timeline ID
and if it finds that there is a timeline switch then it can perform a full directory
scan to make sure that archiving history files takes precedence over WAL files.
Access to the shared memory area can be protected by adding a WALArchiverLock.
If we take this approach then it doesn't require to use a dedicated signal to notify
a timeline switch.

> The problem with all this is that you can't understand either function
> in isolation. Unless you read them both together and look at all of
> the ways these three variables are manipulated, you can't really
> understand the logic. And there's really no reason why that needs to
> be true. The job of cleaning timeline_switch and setting dirScan could
> be done entirely within pgarch_readyXlog(), and so could the job of
> incrementing nextLogSegNo, because we're not going to again call
> pgarch_readyXlog() unless archiving succeeded.

> Also note that the TLI which is stored in curFileTLI corresponds to
> the segment number stored in nextLogSegNo, yet one of them has "cur"
> for "current" in the name and the other has "next". It would be easier
> to read the code if the names were chosen more consistently.

> My tentative idea as to how to clean this up is: declare a new struct
> with a name like readyXlogState and members lastTLI and lastSegNo.
> Have pgarch_ArchiverCopyLoop() declare a variable of this type, zero
> it, pass it as a parameter to pgarch_readyXlog(), and otherwise leave
> it alone. Then let pgarch_readyXlog() do all of the manipulation of
> the values stored therein.

Make sense, we can move the entire logic to a single function pgarch_readyXlog()
and declare a new struct readyXLogState.

I think we cannot declare a variable of this type in pgarch_ArchiverCopyLoop()
due to the fact that this function will be called every time the archiver wakes up.
Initializing readyXLogState here will reset the next anticipated log segment number
when the archiver wakes up from a wait state. We can declare and initialize it in
pgarch_MainLoop() to avoid resetting the next anticipated log segment number
when the archiver wakes up.

> You've moved this comment from its original location, but the trouble
> is that the comment is 100% false. In fact, the whole reason why you
> wrote this patch is *because* this comment is 100% false. In fact it
> is not difficult to create cases where each scan finds many files, and
> the purpose of the patch is precisely to optimize the code that the
> person who wrote this thought didn't need optimizing. Now it may take
> some work to figure out what we want to say here exactly, but
> preserving the comment as it's written here is certainly misleading.

Yes, I agree. We can update the comments here to list the scenarios
where we may need to perform a full directory scan.

I have incorporated these changes and updated a new patch. Please find
the attached patch v5.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Mon, Aug 2, 2021 at 9:06 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
> We can maintain the current timeline ID in archiver specific shared memory.
> If we switch to a new timeline then the backend process can update the new
> timeline ID in shared memory. Archiver can keep a track of current timeline ID
> and if it finds that there is a timeline switch then it can perform a full directory
> scan to make sure that archiving history files takes precedence over WAL files.
> Access to the shared memory area can be protected by adding a WALArchiverLock.
> If we take this approach then it doesn't require to use a dedicated signal to notify
> a timeline switch.

Hi,

I don't really understand why you are storing something in shared
memory specifically for the archiver. Can't we use XLogCtl's
ThisTimeLineID instead of storing another copy of the information?

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
+    /*
+     * Perform a full directory scan to identify the next log segment. There
+     * may be one of the following scenarios which may require us to perform a
+     * full directory scan.
+     *
+     * 1. This is the first cycle since archiver has started and there is no
+     * idea about the next anticipated log segment.
+     *
+     * 2. There is a timeline switch, i.e. the timeline ID tracked at archiver
+     * does not match with current timeline ID. Archive history file as part of
+     * this timeline switch.
+     *
+     * 3. The next anticipated log segment is not available.

One benefit of the current implementation of pgarch_readyXlog() is
that .ready files created out of order will be prioritized before
segments with greater LSNs.  IIUC, with this patch, as long as there
is a "next anticipated" segment available, the archiver won't go back
and archive segments it missed.  I don't think the archive status
files are regularly created out of order, but XLogArchiveCheckDone()
has handling for that case, and the work to avoid creating .ready
files too early [0] seems to make it more likely.  Perhaps we should
also force a directory scan when we detect that we are creating a
.ready file for a segment that is older than the "next anticipated"
segment.

Nathan

[0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

> I don't really understand why you are storing something in shared
> memory specifically for the archiver. Can't we use XLogCtl's
> ThisTimeLineID instead of storing another copy of the information?

Yes, we can avoid storing another copy of information. We can
use XLogCtl's ThisTimeLineID on Primary. However,
XLogCtl's ThisTimeLineID is not set to the current timeline ID on
Standby server. It's value is set to '0'. Can we use XLogCtl's
replayEndTLI on the Standby server to get the current timeline ID?

Thanks,
Dipesh

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, Aug 5, 2021 at 7:39 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
> Yes, we can avoid storing another copy of information. We can
> use XLogCtl's ThisTimeLineID on Primary. However,
> XLogCtl's ThisTimeLineID is not set to the current timeline ID on
> Standby server. It's value is set to '0'. Can we use XLogCtl's
> replayEndTLI on the Standby server to get the current timeline ID?

I'm not sure. I think we need the value to be accurate during
recovery, so I'm not sure whether replayEndTLI would get us there.
Another approach might be to set ThisTimeLineID on standbys also.
Actually just taking a fast look at the code I'm not quite sure why
that isn't happening already. Do you have any understanding of that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
> I'm not sure. I think we need the value to be accurate during
> recovery, so I'm not sure whether replayEndTLI would get us there.
> Another approach might be to set ThisTimeLineID on standbys also.
> Actually just taking a fast look at the code I'm not quite sure why
> that isn't happening already. Do you have any understanding of that?

During investigation I found that the current timeline ID (ThisTimeLineID)
gets updated in XLogCtl’s ThisTimeLineID once it gets finalised as part
of archive recovery.
 
        /*
         * Write the timeline history file, and have it archived. After this
         * point (or rather, as soon as the file is archived), the timeline
         * will appear as "taken" in the WAL archive and to any standby
         * servers.  If we crash before actually switching to the new
         * timeline, standby servers will nevertheless think that we switched
         * to the new timeline, and will try to connect to the new timeline.
         * To minimize the window for that, try to do as little as possible
         * between here and writing the end-of-recovery record.
         */

In case of Standby this happens only when it gets promoted.

If Standby is in recovery mode then replayEndTLI points to the most
recent TLI corresponding to the replayed records. Also, if replying a
record causes timeline switch then replayEndTLI gets updated with
the new timeline. As long as it is in recovery mode replayEndTLI should
point to the current timeline ID on Standby. Thoughts?

Thanks,
Dipesh

Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Tue, 3 Aug 2021 20:46:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> +    /*
> +     * Perform a full directory scan to identify the next log segment. There
> +     * may be one of the following scenarios which may require us to perform a
> +     * full directory scan.
> +     *
> +     * 1. This is the first cycle since archiver has started and there is no
> +     * idea about the next anticipated log segment.
> +     *
> +     * 2. There is a timeline switch, i.e. the timeline ID tracked at archiver
> +     * does not match with current timeline ID. Archive history file as part of
> +     * this timeline switch.
> +     *
> +     * 3. The next anticipated log segment is not available.
> 
> One benefit of the current implementation of pgarch_readyXlog() is
> that .ready files created out of order will be prioritized before
> segments with greater LSNs.  IIUC, with this patch, as long as there
> is a "next anticipated" segment available, the archiver won't go back
> and archive segments it missed.  I don't think the archive status
> files are regularly created out of order, but XLogArchiveCheckDone()
> has handling for that case, and the work to avoid creating .ready
> files too early [0] seems to make it more likely.  Perhaps we should
> also force a directory scan when we detect that we are creating a
> .ready file for a segment that is older than the "next anticipated"
> segment.
> 
> Nathan
> 
> [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com

It works the current way always at the first iteration of
pgarch_ArchiveCopyLoop() becuse in the last iteration of
pgarch_ArchiveCopyLoop(), pgarch_readyXlog() erases the last
anticipated segment.  The shortcut works only when
pgarch_ArchiveCopyLoop archives more than once successive segments at
once.  If the anticipated next segment found to be missing a .ready
file while archiving multiple files, pgarch_readyXLog falls back to
the regular way.

So I don't see the danger to happen perhaps you are considering.

In the first place, .ready are added while holding WALWriteLock in
XLogWrite, and while removing old segments after a checkpoint (which
happens while recovery). Assuming that no one manually remove .ready
files on an active server, the former is the sole place doing that. So
I don't see a chance that .ready files are created out-of-order way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/5/21, 6:26 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> It works the current way always at the first iteration of
> pgarch_ArchiveCopyLoop() becuse in the last iteration of
> pgarch_ArchiveCopyLoop(), pgarch_readyXlog() erases the last
> anticipated segment.  The shortcut works only when
> pgarch_ArchiveCopyLoop archives more than once successive segments at
> once.  If the anticipated next segment found to be missing a .ready
> file while archiving multiple files, pgarch_readyXLog falls back to
> the regular way.
>
> So I don't see the danger to happen perhaps you are considering.

I think my concern is that there's no guarantee that we will ever do
another directory scan.  A server that's generating a lot of WAL could
theoretically keep us in the next-anticipated-log code path
indefinitely.

> In the first place, .ready are added while holding WALWriteLock in
> XLogWrite, and while removing old segments after a checkpoint (which
> happens while recovery). Assuming that no one manually remove .ready
> files on an active server, the former is the sole place doing that. So
> I don't see a chance that .ready files are created out-of-order way.

Perhaps a more convincing example is when XLogArchiveNotify() fails.
AFAICT this can fail without ERROR-ing, in which case the server can
continue writing WAL and creating .ready files for later segments.  At
some point, the checkpointer process will call RemoveOldXlogFiles()
and try to create the missing .ready file.

Nathan


Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Thu, 5 Aug 2021 21:53:30 +0530, Dipesh Pandit <dipesh.pandit@gmail.com> wrote in 
> > I'm not sure. I think we need the value to be accurate during
> > recovery, so I'm not sure whether replayEndTLI would get us there.
> > Another approach might be to set ThisTimeLineID on standbys also.
> > Actually just taking a fast look at the code I'm not quite sure why
> > that isn't happening already. Do you have any understanding of that?
> 
> During investigation I found that the current timeline ID (ThisTimeLineID)
> gets updated in XLogCtl’s ThisTimeLineID once it gets finalised as part
> of archive recovery.
> 
>         /*
>          * Write the timeline history file, and have it archived. After this
>          * point (or rather, as soon as the file is archived), the timeline
>          * will appear as "taken" in the WAL archive and to any standby
>          * servers.  If we crash before actually switching to the new
>          * timeline, standby servers will nevertheless think that we
> switched
>          * to the new timeline, and will try to connect to the new timeline.
>          * To minimize the window for that, try to do as little as possible
>          * between here and writing the end-of-recovery record.
>          */
> 
> In case of Standby this happens only when it gets promoted.
> 
> If Standby is in recovery mode then replayEndTLI points to the most
> recent TLI corresponding to the replayed records. Also, if replying a
> record causes timeline switch then replayEndTLI gets updated with
> the new timeline. As long as it is in recovery mode replayEndTLI should
> point to the current timeline ID on Standby. Thoughts?

As I mentioned in another branch of this thread, pgarch_readyXlog()
always goes into the fall back path at the first iteration of
pgarch_ArchiverCopyLoop() and the current (or expected) TLI is
informed there.  So no need of shared timeline ID at that time.

When pgarch_ArchiverCopyLoop meets a timeline switch, the short cut
path fails to find the next anticipated .ready file then goes into the
fallback path, which should find the history file for the next TLI
(unless any timing misalignment I'm not aware of happens).

So the shared timeline id works only to let the fast path give way to
the fall back path to find the just created history file as earlier as
possible.  Notifying the timeline ID that the startup process
recognizes to archiver makes thing more complex than requied.
Currently archiver doesn't use SIGINT, so I think we can use sigint
for the purpose.

Furthermore, it seems to me that we can make the TLI and the next
anticipated segment number function-local static variables.  It would
be workable assuming that the only caller pgarch_ArchiverCopyLoop
obeys the contract that it must call pgarch_readyXlog() until it
returns false.  However, there seems to be no reason for it not to
work even otherwise, unless I'm missing something (that's likely),
though.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Fri, 6 Aug 2021 02:34:24 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/5/21, 6:26 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > It works the current way always at the first iteration of
> > pgarch_ArchiveCopyLoop() becuse in the last iteration of
> > pgarch_ArchiveCopyLoop(), pgarch_readyXlog() erases the last
> > anticipated segment.  The shortcut works only when
> > pgarch_ArchiveCopyLoop archives more than once successive segments at
> > once.  If the anticipated next segment found to be missing a .ready
> > file while archiving multiple files, pgarch_readyXLog falls back to
> > the regular way.
> >
> > So I don't see the danger to happen perhaps you are considering.
> 
> I think my concern is that there's no guarantee that we will ever do
> another directory scan.  A server that's generating a lot of WAL could
> theoretically keep us in the next-anticipated-log code path
> indefinitely.

Theoretically possible. Supposing that .ready may be created
out-of-order (for the following reason, as a possibility), when once
the fast path bailed out then the fallback path finds that the second
oldest file has .ready, the succeeding fast path continues running
leaving the oldest file.

> > In the first place, .ready are added while holding WALWriteLock in
> > XLogWrite, and while removing old segments after a checkpoint (which
> > happens while recovery). Assuming that no one manually remove .ready
> > files on an active server, the former is the sole place doing that. So
> > I don't see a chance that .ready files are created out-of-order way.
> 
> Perhaps a more convincing example is when XLogArchiveNotify() fails.
> AFAICT this can fail without ERROR-ing, in which case the server can
> continue writing WAL and creating .ready files for later segments.  At
> some point, the checkpointer process will call RemoveOldXlogFiles()
> and try to create the missing .ready file.

Mmm. Assuming that could happen, a history file gets cursed to lose a
chance to be archived forever once that disaster falls onto it.  Apart
from this patch, maybe we need a measure to notify the history files
that are once missed a chance.

Assuming that all such forgotten files would be finally re-marked as
.ready anywhere, they can be re-found by archiver by explicitly
triggering the fallback path.  Currently the trigger fires implicitly
by checking shared timeline movement, but by causing the trigger by,
for example by a signal as mentioned in a nearby message, that
behavior would be easily to implement.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

Thanks for the feedback.

The possible path that archiver can take for each cycle is either a fast
path or a fall-back patch. The fast path involves checking availability of
next anticipated log segment and decide the next target for archival or
a fall-back path which involves full directory scan to get the next log segment.
We need a mechanism that enables the archiver to select the desired path
for each cycle.

This can be achieved by maintaining a shared memory flag. If this flag is set
then archiver should take the fall-back path otherwise it should continue with
the fast path.

This flag can be set by backend in case if an action like timeline switch,
.ready files created out of order,...  requires archiver to perform a full
directory scan.

I have incorporated these changes and updated a new patch. PFA patch v6.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
+     * This .ready file is created out of order, notify archiver to perform
+     * a full directory scan to archive corresponding WAL file.
+     */
+    StatusFilePath(archiveStatusPath, xlog, ".ready");
+    if (stat(archiveStatusPath, &stat_buf) == 0)
+        PgArchEnableDirScan();

We may want to call PgArchWakeup() after setting the flag.

+     * Perform a full directory scan to identify the next log segment. There
+     * may be one of the following scenarios which may require us to perform a
+     * full directory scan.
...
+     * - The next anticipated log segment is not available.

I wonder if we really need to perform a directory scan in this case.
Unless there are other cases where the .ready files are created out of
order, I think this just causes an unnecessary directory scan every
time the archiver catches up.

+     * Flag to enable/disable directory scan. If this flag is set then it
+     * forces archiver to perform a full directory scan to get the next log
+     * segment.
+     */
+    pg_atomic_flag dirScan;

I personally don't think it's necessary to use an atomic here.  A
spinlock or LWLock would probably work just fine, as contention seems
unlikely.  If we use a lock, we also don't have to worry about memory
barriers.

Nathan


Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/15/21, 9:52 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> +     * Perform a full directory scan to identify the next log segment. There
> +     * may be one of the following scenarios which may require us to perform a
> +     * full directory scan.
> ...
> +     * - The next anticipated log segment is not available.
>
> I wonder if we really need to perform a directory scan in this case.
> Unless there are other cases where the .ready files are created out of
> order, I think this just causes an unnecessary directory scan every
> time the archiver catches up.

Thinking further, I suppose this is necessary for when lastSegNo gets
reset after processing an out-of-order .ready file.

Nathan


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Thanks for the feedback.

> +       StatusFilePath(archiveStatusPath, xlog, ".ready");
> +       if (stat(archiveStatusPath, &stat_buf) == 0)
> +               PgArchEnableDirScan();

> We may want to call PgArchWakeup() after setting the flag.

Yes, added a call to wake up archiver.

> > +      * - The next anticipated log segment is not available.
> >
> > I wonder if we really need to perform a directory scan in this case.
> > Unless there are other cases where the .ready files are created out of
> > order, I think this just causes an unnecessary directory scan every
> > time the archiver catches up.

> Thinking further, I suppose this is necessary for when lastSegNo gets
> reset after processing an out-of-order .ready file.

Also, this is necessary when lastTLI gets reset after switching to a new
timeline.

> +       pg_atomic_flag dirScan;

> I personally don't think it's necessary to use an atomic here.  A
> spinlock or LWLock would probably work just fine, as contention seems
> unlikely.  If we use a lock, we also don't have to worry about memory
> barriers.

History file should be archived as soon as it gets created. The atomic flag
here will make sure that there is no reordering of read/write instructions while
accessing the flag in shared memory. Archiver needs to read this flag at the
beginning of each cycle. Write to atomic flag is synchronized and it provides
a lockless read. I think an atomic flag here is an efficient choice unless I am
missing something.

Please find the attached patch v7.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/17/21, 5:53 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
>> I personally don't think it's necessary to use an atomic here.  A
>> spinlock or LWLock would probably work just fine, as contention seems
>> unlikely.  If we use a lock, we also don't have to worry about memory
>> barriers.
>
> History file should be archived as soon as it gets created. The atomic flag
> here will make sure that there is no reordering of read/write instructions while
> accessing the flag in shared memory. Archiver needs to read this flag at the 
> beginning of each cycle. Write to atomic flag is synchronized and it provides 
> a lockless read. I think an atomic flag here is an efficient choice unless I am 
> missing something.

Sorry, I think my note was not very clear.  I agree that a flag should
be used for this purpose, but I think we should just use a regular
bool protected by a spinlock or LWLock instead of an atomic.  The file
atomics.h has the following note:

 * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
 * whenever possible. Writing correct code using these facilities is hard.

IOW I don't think the extra complexity is necessary.  From a
performance standpoint, contention seems unlikely.  We only need to
read the flag roughly once per WAL segment, and we only ever set it in
uncommon scenarios such as a timeline switch or the creation of an
out-of-order .ready file.

Nathan


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 12:33 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> Sorry, I think my note was not very clear.  I agree that a flag should
> be used for this purpose, but I think we should just use a regular
> bool protected by a spinlock or LWLock instead of an atomic.  The file
> atomics.h has the following note:
>
>  * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
>  * whenever possible. Writing correct code using these facilities is hard.
>
> IOW I don't think the extra complexity is necessary.  From a
> performance standpoint, contention seems unlikely.  We only need to
> read the flag roughly once per WAL segment, and we only ever set it in
> uncommon scenarios such as a timeline switch or the creation of an
> out-of-order .ready file.

In the interest of full disclosure, I think that I was probably the
one who suggested to Dipesh that he should look into using atomics,
although I can't quite remember right now why I thought we might want
to do that.

I do not on general principle very much like code that does
LWLockAcquire(whatever);
exactly-one-assignment-statement-that-modifies-a-1-2-or-4-byte-quantity;
LWLockRelease(whatever). If you had two assignments in there, then you
know why you have a lock: it's to make those behave as an atomic,
indivisible unit. But when you only have one, what are you protecting
against? You're certainly not making anything atomic that would not
have been anyway, so you must be using the LWLock as a memory barrier.
But then you really kind of have to think about memory barriers
anyway: why do you need one at all, and what things need to be
separated? It's not clear that spelling pg_memory_barrier() as
LWLockAcquire() and/or LWLockRelease() is actually saving you anything
in terms of notional complexity.

In this patch, it appears to me that the atomic flag is only ever
being read unlocked, so I think that we're actually getting no benefit
at all from the use of pg_atomic_flag here. We're not making anything
atomic, because there's only one bit of shared state, and we're not
getting any memory barrier semantics, because it looks to me like the
flag is only ever tested using pg_atomic_unlocked_test_flag, which is
documented not to have barrier semantics. So as far as I can see,
there's no point in using either an LWLock or atomics here. We could
just use bool with no lock and the code would do exactly what it does
now. So I guess the question is whether that's correct or whether we
need some kind of synchronization and, if so, of what sort.

I can't actually see that there's any kind of hard synchronization
requirement here at all. What we're trying to do is guarantee that if
the timeline changes, we'll pick up the timeline history for the new
timeline next, and that if files are archived out of order, we'll
switch to archiving the oldest file that is now present rather than
continuing with consecutive files. But suppose we just use an
unsynchronized bool. The worst case is that we'll archive one extra
file proceeding in order before we jump to the file that we were
supposed to archive next. It's not evident to me that this is all that
bad. The same thing would have happened if the previous file had been
archived slightly faster than it actually was, so that we began
archiving the next file just before, rather than just after, the
notification was sent. And if it is bad, wrapping an LWLock around the
accesses to the flag variable, or using an atomic, does nothing to
stop it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/17/21, 11:28 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> I can't actually see that there's any kind of hard synchronization
> requirement here at all. What we're trying to do is guarantee that if
> the timeline changes, we'll pick up the timeline history for the new
> timeline next, and that if files are archived out of order, we'll
> switch to archiving the oldest file that is now present rather than
> continuing with consecutive files. But suppose we just use an
> unsynchronized bool. The worst case is that we'll archive one extra
> file proceeding in order before we jump to the file that we were
> supposed to archive next. It's not evident to me that this is all that
> bad. The same thing would have happened if the previous file had been
> archived slightly faster than it actually was, so that we began
> archiving the next file just before, rather than just after, the
> notification was sent. And if it is bad, wrapping an LWLock around the
> accesses to the flag variable, or using an atomic, does nothing to
> stop it.

I am inclined to agree.  The archiver only ever reads the flag and
sets it to false (if we are doing a directory scan).  Others only ever
set the flag to true.  The only case I can think of where we might
miss the timeline switch or out-of-order .ready file is when the
archiver sets the flag to false and then ReadDir() fails.  However,
that seems to cause the archiver process to restart, and we always
start with a directory scan at first.

Nathan


Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/17/21, 12:11 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 8/17/21, 11:28 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
>> I can't actually see that there's any kind of hard synchronization
>> requirement here at all. What we're trying to do is guarantee that if
>> the timeline changes, we'll pick up the timeline history for the new
>> timeline next, and that if files are archived out of order, we'll
>> switch to archiving the oldest file that is now present rather than
>> continuing with consecutive files. But suppose we just use an
>> unsynchronized bool. The worst case is that we'll archive one extra
>> file proceeding in order before we jump to the file that we were
>> supposed to archive next. It's not evident to me that this is all that
>> bad. The same thing would have happened if the previous file had been
>> archived slightly faster than it actually was, so that we began
>> archiving the next file just before, rather than just after, the
>> notification was sent. And if it is bad, wrapping an LWLock around the
>> accesses to the flag variable, or using an atomic, does nothing to
>> stop it.
>
> I am inclined to agree.  The archiver only ever reads the flag and
> sets it to false (if we are doing a directory scan).  Others only ever
> set the flag to true.  The only case I can think of where we might
> miss the timeline switch or out-of-order .ready file is when the
> archiver sets the flag to false and then ReadDir() fails.  However,
> that seems to cause the archiver process to restart, and we always
> start with a directory scan at first.

Thinking further, I think the most important thing to ensure is that
resetting the flag happens before we begin the directory scan.
Consider the following scenario in which a timeline history file would
potentially be lost:

        1. Archiver completes directory scan.
        2. A timeline history file is created and the flag is set.
        3. Archiver resets the flag.

I don't think there's any problem with the archiver reading a stale
value for the flag.  It should eventually be updated and route us to
the directory scan code path.

I'd also note that we're depending on the directory scan logic for
picking up all timeline history files and out-of-order .ready files
that may have been created each time the flag is set.  AFAICT that is
safe since we prioritize timeline history files and reset the archiver
state anytime we do a directory scan.  We'll first discover timeline
history files via directory scans, and then we'll move on to .ready
files, starting at the one with the lowest segment number.  If a new
timeline history file or out-of-order .ready file is created, the
archiver is notified, and we start over.

Nathan


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

Thanks for the feedback. I have incorporated the suggestion
to use an unsynchronized boolean flag to force directory scan.
This flag is being set if there is a timeline switch or .ready file
is created out of order. Archiver resets this flag in case if it is
being set before it begins directory scan.

PFA patch v8.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, Aug 17, 2021 at 4:19 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> Thinking further, I think the most important thing to ensure is that
> resetting the flag happens before we begin the directory scan.
> Consider the following scenario in which a timeline history file would
> potentially be lost:
>
>         1. Archiver completes directory scan.
>         2. A timeline history file is created and the flag is set.
>         3. Archiver resets the flag.

Dipesh says in his latest email that the archiver resets the flag just
before it begins a directory scan. If that's accurate, then I think
this sequence of events can't occur.

If there is a race condition here with setting the flag, then an
alternative design would be to use a counter - either a plain old
uint64 or perhaps pg_atomic_uint64 - and have the startup process
increment the counter when it wants to trigger a scan. In this design,
the archiver would never modify the counter itself, but just remember
the last value that it saw. If it later sees a different value it
knows that a full scan is required. I think this kind of system is
extremely robust against the general class of problems that you're
talking about here, but I'm not sure whether we need it, because I'm
not sure whether there is a race with just the bool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
Thanks for the new version of the patch.  Overall, I think it is on
the right track.

+    /*
+     * This .ready file is created out of order, notify archiver to perform
+     * a full directory scan to archive corresponding WAL file.
+     */
+    StatusFilePath(archiveStatusPath, xlog, ".ready");
+    if (stat(archiveStatusPath, &stat_buf) == 0)
+    {
+        PgArchEnableDirScan();
+        PgArchWakeup();
+    }

Should we have XLogArchiveNotify(), writeTimeLineHistory(), and
writeTimeLineHistoryFile() enable the directory scan instead?  Else,
we have to exhaustively cover all such code paths, which may be
difficult to maintain.  Another reason I am bringing this up is that
my patch for adjusting .ready file creation [0] introduces more
opportunities for .ready files to be created out-of-order.

+    /*
+     * This is a fall-back path, check if we are here due to the unavailability
+     * of next anticipated log segment or the archiver is being forced to
+     * perform a full directory scan. Reset the flag in shared memory only if
+     * it has been enabled to force a full directory scan and then proceed with
+     * directory scan.
+     */
+    if (PgArch->dirScan)
+        PgArch->dirScan = false;

Why do we need to check that the flag is set before we reset it?  I
think we could just always reset it since we are about to do a
directory scan anyway.

On 8/18/21, 7:25 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Tue, Aug 17, 2021 at 4:19 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> Thinking further, I think the most important thing to ensure is that
>> resetting the flag happens before we begin the directory scan.
>> Consider the following scenario in which a timeline history file would
>> potentially be lost:
>>
>>         1. Archiver completes directory scan.
>>         2. A timeline history file is created and the flag is set.
>>         3. Archiver resets the flag.
>
> Dipesh says in his latest email that the archiver resets the flag just
> before it begins a directory scan. If that's accurate, then I think
> this sequence of events can't occur.
>
> If there is a race condition here with setting the flag, then an
> alternative design would be to use a counter - either a plain old
> uint64 or perhaps pg_atomic_uint64 - and have the startup process
> increment the counter when it wants to trigger a scan. In this design,
> the archiver would never modify the counter itself, but just remember
> the last value that it saw. If it later sees a different value it
> knows that a full scan is required. I think this kind of system is
> extremely robust against the general class of problems that you're
> talking about here, but I'm not sure whether we need it, because I'm
> not sure whether there is a race with just the bool.

I'm not sure, either.  Perhaps it would at least be worth adding a
pg_memory_barrier() after setting dirScan to false to avoid the
scenario I mentioned (which may or may not be possible).  IMO this
stuff would be much easier to reason about if we used a lock instead,
even if the synchronization was not strictly necessary.  However, I
don't want to hold this patch up too much on this point.

Nathan

[0] https://postgr.es/m/05AD5FE2-9A53-4D11-A3F8-3A83EBB0EB93%40amazon.com


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

Thanks for the feedback.

> Should we have XLogArchiveNotify(), writeTimeLineHistory(), and
> writeTimeLineHistoryFile() enable the directory scan instead?  Else,
> we have to exhaustively cover all such code paths, which may be
> difficult to maintain.  Another reason I am bringing this up is that
> my patch for adjusting .ready file creation [0] introduces more
> opportunities for .ready files to be created out-of-order.

XLogArchiveNotify() notifies Archiver when a log segment is ready for
archival by creating a .ready file. This function is being called for each
log segment and placing a call to enable directory scan here will result
in directory scan for each log segment.

We can have writeTimeLineHistory() and writeTimeLineHistoryFile() to
enable directory scan to handle the scenarios related to timeline switch.

However, in other scenarios, I think we have to explicitly call PgArchEnableDirScan()
to enable directory scan. PgArchEnableDirScan() takes care of waking up
archiver so that the caller of this function need not have to nudge the archiver.

> +    /*
> +     * This is a fall-back path, check if we are here due to the unavailability
> +     * of next anticipated log segment or the archiver is being forced to
> +     * perform a full directory scan. Reset the flag in shared memory only if
> +     * it has been enabled to force a full directory scan and then proceed with
> +     * directory scan.
> +     */
> +    if (PgArch->dirScan)
> +        PgArch->dirScan = false;

> Why do we need to check that the flag is set before we reset it?  I
> think we could just always reset it since we are about to do a
> directory scan anyway

Yes, I agree.

> > If there is a race condition here with setting the flag, then an
> > alternative design would be to use a counter - either a plain old
> > uint64 or perhaps pg_atomic_uint64 - and have the startup process
> > increment the counter when it wants to trigger a scan. In this design,
> > the archiver would never modify the counter itself, but just remember
> > the last value that it saw. If it later sees a different value it
> > knows that a full scan is required. I think this kind of system is
> > extremely robust against the general class of problems that you're
> > talking about here, but I'm not sure whether we need it, because I'm
> > not sure whether there is a race with just the bool.

> I'm not sure, either.  Perhaps it would at least be worth adding a
> pg_memory_barrier() after setting dirScan to false to avoid the
> scenario I mentioned (which may or may not be possible).  IMO this
> stuff would be much easier to reason about if we used a lock instead,
> even if the synchronization was not strictly necessary.  However, I
> don't want to hold this patch up too much on this point.

There is one possible scenario where it may run into a race condition. If
archiver has just finished archiving all .ready files and the next anticipated
log segment is not available then in this case archiver takes the fall-back
path to scan directory. It resets the flag before it begins directory scan.
Now, if a directory scan is enabled by a timeline switch or .ready file created
out of order in parallel to the event that the archiver resets the flag then this
might result in a race condition. But in this case also archiver is eventually
going to perform a directory scan and the desired file will be archived as part
of directory scan. Apart of this I can't think of any other scenario which may
result into a race condition unless I am missing something.

I have incorporated the suggestions and updated a new patch. PFA patch v9.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/19/21, 5:42 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
>> Should we have XLogArchiveNotify(), writeTimeLineHistory(), and
>> writeTimeLineHistoryFile() enable the directory scan instead?  Else,
>> we have to exhaustively cover all such code paths, which may be
>> difficult to maintain.  Another reason I am bringing this up is that
>> my patch for adjusting .ready file creation [0] introduces more
>> opportunities for .ready files to be created out-of-order.
>
> XLogArchiveNotify() notifies Archiver when a log segment is ready for
> archival by creating a .ready file. This function is being called for each 
> log segment and placing a call to enable directory scan here will result
> in directory scan for each log segment. 

Could we have XLogArchiveNotify() check the archiver state and only
trigger a directory scan if we detect that we are creating an out-of-
order .ready file?

> There is one possible scenario where it may run into a race condition. If
> archiver has just finished archiving all .ready files and the next anticipated
> log segment is not available then in this case archiver takes the fall-back 
> path to scan directory. It resets the flag before it begins directory scan. 
> Now, if a directory scan is enabled by a timeline switch or .ready file created
> out of order in parallel to the event that the archiver resets the flag then this
> might result in a race condition. But in this case also archiver is eventually 
> going to perform a directory scan and the desired file will be archived as part
> of directory scan. Apart of this I can't think of any other scenario which may 
> result into a race condition unless I am missing something.

What do you think about adding an upper limit to the number of files
we can archive before doing a directory scan?  The more I think about
the directory scan flag, the more I believe it is a best-effort tool
that will remain prone to race conditions.  If we have a guarantee
that a directory scan will happen within the next N files, there's
probably less pressure to make sure that it's 100% correct.

On an unrelated note, do we need to add some extra handling for backup
history files and partial WAL files?

Nathan


Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 5/4/21, 7:07 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
>> > I have two possible ideas for addressing this; perhaps other people
>> > will have further suggestions. A relatively non-invasive fix would be
>> > to teach pgarch.c how to increment a WAL file name. After archiving
>> > segment N, check using stat() whether there's an .ready file for
>> > segment N+1. If so, do that one next. If not, then fall back to
>> > performing a full directory scan.
>>
>> Hm. I wonder if it'd not be better to determine multiple files to be
>> archived in one readdir() pass?
>
> I think both methods have some merit. If we had a way to pass a range
> of files to archive_command instead of just one, then your way is
> distinctly better, and perhaps we should just go ahead and invent such
> a thing. If not, your way doesn't entirely solve the O(n^2) problem,
> since you have to choose some upper bound on the number of file names
> you're willing to buffer in memory, but it may lower it enough that it
> makes no practical difference. I am somewhat inclined to think that it
> would be good to start with the method I'm proposing, since it is a
> clear-cut improvement over what we have today and can be done with a
> relatively limited amount of code change and no redesign, and then
> perhaps do something more ambitious afterward.

I was curious about this, so I wrote a patch (attached) to store
multiple files per directory scan and tested it against the latest
patch in this thread (v9) [0].  Specifically, I set archive_command to
'false', created ~20K WAL segments, then restarted the server with
archive_command set to 'true'.  Both the v9 patch and the attached
patch completed archiving all segments in just under a minute.  (I
tested the attached patch with NUM_FILES_PER_DIRECTORY_SCAN set to 64,
128, and 256 and didn't observe any significant difference.)  The
existing logic took over 4 minutes to complete.

I'm hoping to do this test again with many more (100K+) status files,
as I believe that the v9 patch will be faster at that scale, but I'm
not sure how much faster it will be.

Nathan

[0] https://www.postgresql.org/message-id/attachment/125543/v9-0001-mitigate-directory-scan-for-WAL-archiver.patch


Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/21/21, 9:29 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> I was curious about this, so I wrote a patch (attached) to store
> multiple files per directory scan and tested it against the latest
> patch in this thread (v9) [0].  Specifically, I set archive_command to
> 'false', created ~20K WAL segments, then restarted the server with
> archive_command set to 'true'.  Both the v9 patch and the attached
> patch completed archiving all segments in just under a minute.  (I
> tested the attached patch with NUM_FILES_PER_DIRECTORY_SCAN set to 64,
> 128, and 256 and didn't observe any significant difference.)  The
> existing logic took over 4 minutes to complete.
>
> I'm hoping to do this test again with many more (100K+) status files,
> as I believe that the v9 patch will be faster at that scale, but I'm
> not sure how much faster it will be.

I ran this again on a bigger machine with 200K WAL files pending
archive.  The v9 patch took ~5.5 minutes, the patch I sent took ~8
minutes, and the existing logic took just under 3 hours.

Nathan


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Sun, Aug 22, 2021 at 10:31 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> I ran this again on a bigger machine with 200K WAL files pending
> archive.  The v9 patch took ~5.5 minutes, the patch I sent took ~8
> minutes, and the existing logic took just under 3 hours.

Hmm. On the one hand, 8 minutes > 5.5 minutes, and presumably the gap
would only get wider if the number of files were larger or if reading
the directory were slower. I am pretty sure that reading the directory
must be much slower in some real deployments where this problem has
come up. On the other hand, 8.8 minutes << 3 hours, and your patch
would win if somehow we had a ton of gaps in the sequence of files.
I'm not sure how likely that is to be the cause - probably not very
likely at all if you aren't using an archive command that cheats, but
maybe really common if you are. Hmm, but I think if the
archive_command cheats by marking a bunch of files done when it is
tasked with archiving just one, your patch will break, because, unless
I'm missing something, it doesn't re-evaluate whether things have
changed on every pass through the loop as Dipesh's patch does. So I
guess I'm not quite sure I understand why you think this might be the
way to go?

Maintaining the binary heap in lowest-priority-first order is very
clever, and the patch does look quite elegant. I'm just not sure I
understand the point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:

On 8/23/21, 6:42 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Sun, Aug 22, 2021 at 10:31 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> I ran this again on a bigger machine with 200K WAL files pending
>> archive.  The v9 patch took ~5.5 minutes, the patch I sent took ~8
>> minutes, and the existing logic took just under 3 hours.
>
> Hmm. On the one hand, 8 minutes > 5.5 minutes, and presumably the gap
> would only get wider if the number of files were larger or if reading
> the directory were slower. I am pretty sure that reading the directory
> must be much slower in some real deployments where this problem has
> come up. On the other hand, 8.8 minutes << 3 hours, and your patch
> would win if somehow we had a ton of gaps in the sequence of files.
> I'm not sure how likely that is to be the cause - probably not very
> likely at all if you aren't using an archive command that cheats, but
> maybe really common if you are. Hmm, but I think if the
> archive_command cheats by marking a bunch of files done when it is
> tasked with archiving just one, your patch will break, because, unless
> I'm missing something, it doesn't re-evaluate whether things have
> changed on every pass through the loop as Dipesh's patch does. So I
> guess I'm not quite sure I understand why you think this might be the
> way to go?

To handle a "cheating" archive command, I'd probably need to add a
stat() for every time pgarch_readyXLog() returned something from
arch_files.  I suspect something similar might be needed in Dipesh's
patch to handle backup history files and partial WAL files.

In any case, I think Dipesh's patch is the way to go.  It obviously
will perform better in the extreme cases discussed in this thread.  I
think it's important to make sure the patch doesn't potentially leave
files behind to be picked up by a directory scan that might not
happen, but there are likely ways to handle that.  In the worst case,
perhaps we need to force a directory scan every N files to make sure
nothing gets left behind.  But maybe we can do better.

> Maintaining the binary heap in lowest-priority-first order is very
> clever, and the patch does look quite elegant. I'm just not sure I
> understand the point.

This was mostly an exploratory exercise to get some numbers for the
different approaches discussed in this thread.

Nathan


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Mon, Aug 23, 2021 at 11:50 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> To handle a "cheating" archive command, I'd probably need to add a
> stat() for every time pgarch_readyXLog() returned something from
> arch_files.  I suspect something similar might be needed in Dipesh's
> patch to handle backup history files and partial WAL files.

I think he's effectively got that already, although it's probably
inside of pgarch_readyXLog(). The idea there is that instead of having
a cache of files to be returned (as in your case) he just checks
whether the next file in sequence happens to be present and if so
returns that file name. To see whether it's present, he uses stat().

> In any case, I think Dipesh's patch is the way to go.  It obviously
> will perform better in the extreme cases discussed in this thread.  I
> think it's important to make sure the patch doesn't potentially leave
> files behind to be picked up by a directory scan that might not
> happen, but there are likely ways to handle that.  In the worst case,
> perhaps we need to force a directory scan every N files to make sure
> nothing gets left behind.  But maybe we can do better.

It seems to me that we can handle that by just having the startup
process notify the archiver every time some file is ready for
archiving that's not the next one in the sequence. We have to make
sure we cover all the relevant code paths, but that seems like it
should be doable, and we have to decide on the synchronization
details, but that also seems pretty manageable, even if we haven't
totally got it sorted yet. The thing is, as soon as you go back to
forcing a directory scan every N files, you've made it formally O(N^2)
again, which might not matter in practice if the constant factor is
low enough, but I don't think it will be. Either you force the scans
every, say, 1000 files, in which case it's going to make the whole
mechanism a lot less effective in terms of getting out from under
problem cases -- or you force scans every, say, 1000000 files, in
which case it's not really going to cause any missed files to get
archived soon enough to make anyone happy. I doubt there is really a
happy medium in there.

I suppose the two approaches could be combined, too - remember the
first N files you think you'll encounter and then after that try
successive filenames until one is missing. That would be more
resilient against O(N^2) behavior in the face of frequent gaps. But it
might also be more engineering than is strictly required.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/23/21, 10:49 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Mon, Aug 23, 2021 at 11:50 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>> To handle a "cheating" archive command, I'd probably need to add a
>> stat() for every time pgarch_readyXLog() returned something from
>> arch_files.  I suspect something similar might be needed in Dipesh's
>> patch to handle backup history files and partial WAL files.
>
> I think he's effectively got that already, although it's probably
> inside of pgarch_readyXLog(). The idea there is that instead of having
> a cache of files to be returned (as in your case) he just checks
> whether the next file in sequence happens to be present and if so
> returns that file name. To see whether it's present, he uses stat().

IIUC partial WAL files are handled because the next file in the
sequence with the given TimeLineID won't be there, so we will fall
back to a directory scan and pick it up.  Timeline history files are
handled by forcing a directory scan, which should work because they
always have the highest priority.  Backup history files, however, do
not seem to be handled.  I think one approach to fixing that is to
also treat backup history files similarly to timeline history files.
If one is created, we force a directory scan, and the directory scan
logic will consider backup history files as higher priority than
everything but timeline history files.

I've been looking at the v9 patch with fresh eyes, and I still think
we should be able to force the directory scan as needed in
XLogArchiveNotify().  Unless the file to archive is a regular WAL file
that is > our stored location in archiver memory, we should force a
directory scan.  I think it needs to be > instead of >= because we
don't know if the archiver has just completed a directory scan and
found a later segment to use to update the archiver state (but hasn't
yet updated the state in shared memory).

Also, I think we need to make sure to set PgArch->dirScan back to true
at the end of pgarch_readyXlog() unless we've found a new regular WAL
file that we can use to reset the archiver's stored location.  This
ensures that we'll keep doing directory scans as long as there are
timeline/backup history files to process.  

Nathan


Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Tue, 24 Aug 2021 00:03:37 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/23/21, 10:49 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> > On Mon, Aug 23, 2021 at 11:50 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> >> To handle a "cheating" archive command, I'd probably need to add a
> >> stat() for every time pgarch_readyXLog() returned something from
> >> arch_files.  I suspect something similar might be needed in Dipesh's
> >> patch to handle backup history files and partial WAL files.
> >
> > I think he's effectively got that already, although it's probably
> > inside of pgarch_readyXLog(). The idea there is that instead of having
> > a cache of files to be returned (as in your case) he just checks
> > whether the next file in sequence happens to be present and if so
> > returns that file name. To see whether it's present, he uses stat().
> 
> IIUC partial WAL files are handled because the next file in the
> sequence with the given TimeLineID won't be there, so we will fall
> back to a directory scan and pick it up.  Timeline history files are
> handled by forcing a directory scan, which should work because they
> always have the highest priority.  Backup history files, however, do
> not seem to be handled.  I think one approach to fixing that is to
> also treat backup history files similarly to timeline history files.
> If one is created, we force a directory scan, and the directory scan
> logic will consider backup history files as higher priority than
> everything but timeline history files.

Backup history files are (currently) just informational and they are
finally processed at the end of a bulk-archiving performed by the fast
path.  However, I feel that it is cleaner to trigger a directory scan
every time we add an other-than-a-regular-WAL-file, as base-backup or
promotion are not supposed happen so infrequently.

> I've been looking at the v9 patch with fresh eyes, and I still think
> we should be able to force the directory scan as needed in
> XLogArchiveNotify().  Unless the file to archive is a regular WAL file
> that is > our stored location in archiver memory, we should force a
> directory scan.  I think it needs to be > instead of >= because we
> don't know if the archiver has just completed a directory scan and
> found a later segment to use to update the archiver state (but hasn't
> yet updated the state in shared memory).

I'm afraid that it can be seen as a violation of modularity. I feel
that wal-emitter side should not be aware of that datail of
archiving. Instead, I would prefer to keep directory scan as far as it
found an smaller segment id than the next-expected segment id ever
archived by the fast-path (if possible).  This would be
less-performant in the case out-of-order segments are frequent but I
think the overall objective of the original patch will be kept.

> Also, I think we need to make sure to set PgArch->dirScan back to true
> at the end of pgarch_readyXlog() unless we've found a new regular WAL
> file that we can use to reset the archiver's stored location.  This
> ensures that we'll keep doing directory scans as long as there are
> timeline/backup history files to process.  

Right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
(sigh..)

At Tue, 24 Aug 2021 11:35:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > IIUC partial WAL files are handled because the next file in the
> > sequence with the given TimeLineID won't be there, so we will fall
> > back to a directory scan and pick it up.  Timeline history files are
> > handled by forcing a directory scan, which should work because they
> > always have the highest priority.  Backup history files, however, do
> > not seem to be handled.  I think one approach to fixing that is to
> > also treat backup history files similarly to timeline history files.
> > If one is created, we force a directory scan, and the directory scan
> > logic will consider backup history files as higher priority than
> > everything but timeline history files.
> 
> Backup history files are (currently) just informational and they are
> finally processed at the end of a bulk-archiving performed by the fast
> path.  However, I feel that it is cleaner to trigger a directory scan
> every time we add an other-than-a-regular-WAL-file, as base-backup or
- promotion are not supposed happen so infrequently.
+ promotion are not supposed happen so frequently.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Thanks for the feedback.

> > > IIUC partial WAL files are handled because the next file in the
> > > sequence with the given TimeLineID won't be there, so we will fall
> > > back to a directory scan and pick it up.  Timeline history files are
> > > handled by forcing a directory scan, which should work because they
> > > always have the highest priority.  Backup history files, however, do
> > > not seem to be handled.  I think one approach to fixing that is to
> > > also treat backup history files similarly to timeline history files.
> > > If one is created, we force a directory scan, and the directory scan
> > > logic will consider backup history files as higher priority than
> > > everything but timeline history files.
> >
> > Backup history files are (currently) just informational and they are
> > finally processed at the end of a bulk-archiving performed by the fast
> > path.  However, I feel that it is cleaner to trigger a directory scan
> > every time we add an other-than-a-regular-WAL-file, as base-backup or
> - promotion are not supposed happen so infrequently.
> + promotion are not supposed happen so frequently.

I have incorporated the changes to trigger a directory scan in case of a
backup history file. Also, updated archiver to prioritize archiving a backup
history file over regular WAL files during directory scan to make sure that
backup history file gets archived before the directory scan gets disabled
as part of archiving a regular WAL file.

> > I've been looking at the v9 patch with fresh eyes, and I still think
> > we should be able to force the directory scan as needed in
> > XLogArchiveNotify().  Unless the file to archive is a regular WAL file
> > that is > our stored location in archiver memory, we should force a
> > directory scan.  I think it needs to be > instead of >= because we
> > don't know if the archiver has just completed a directory scan and
> > found a later segment to use to update the archiver state (but hasn't
> > yet updated the state in shared memory).
>
> I'm afraid that it can be seen as a violation of modularity. I feel
> that wal-emitter side should not be aware of that datail of
> archiving. Instead, I would prefer to keep directory scan as far as it
> found an smaller segment id than the next-expected segment id ever
> archived by the fast-path (if possible).  This would be
> less-performant in the case out-of-order segments are frequent but I
> think the overall objective of the original patch will be kept.

Archiver selects the file with lowest segment number as part of directory
scan and the next segment number gets resets based on this file. It starts
a new sequence from here and check the availability of the next file. If
there are holes then it will continue to fall back to directory scan. This will
continue until it finds the next sequence in order. I think this is already
handled unless I am missing something.

> Also, I think we need to make sure to set PgArch->dirScan back to true
> > at the end of pgarch_readyXlog() unless we've found a new regular WAL
> > file that we can use to reset the archiver's stored location.  This
> > ensures that we'll keep doing directory scans as long as there are
> > timeline/backup history files to process. 
>
> Right.

Done.

Please find the attached patch v10.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/24/21, 5:31 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
>> > I've been looking at the v9 patch with fresh eyes, and I still think
>> > we should be able to force the directory scan as needed in
>> > XLogArchiveNotify().  Unless the file to archive is a regular WAL file
>> > that is > our stored location in archiver memory, we should force a
>> > directory scan.  I think it needs to be > instead of >= because we
>> > don't know if the archiver has just completed a directory scan and
>> > found a later segment to use to update the archiver state (but hasn't
>> > yet updated the state in shared memory).
>> 
>> I'm afraid that it can be seen as a violation of modularity. I feel
>> that wal-emitter side should not be aware of that datail of
>> archiving. Instead, I would prefer to keep directory scan as far as it
>> found an smaller segment id than the next-expected segment id ever
>> archived by the fast-path (if possible).  This would be
>> less-performant in the case out-of-order segments are frequent but I
>> think the overall objective of the original patch will be kept.
>
> Archiver selects the file with lowest segment number as part of directory 
> scan and the next segment number gets resets based on this file. It starts
> a new sequence from here and check the availability of the next file. If 
> there are holes then it will continue to fall back to directory scan. This will 
> continue until it finds the next sequence in order. I think this is already 
> handled unless I am missing something.

I'm thinking of the following scenario:
        1. Status file 2.ready is created.
        2. Archiver finds 2.ready and uses it to update its state.
        3. Status file 1.ready is created.

At this point, the archiver will look for 3.ready next.  If it finds
3.ready, it'll look for 4.ready.  Let's say it keeps finding status
files up until 1000000.ready.  In this case, the archiver won't go
back and archive segment 1 until we've archived ~1M files.  I'll admit
this is a contrived example, but I think it demonstrates how certain
assumptions could fail with this approach.

I think Horiguchi-san made a good point that the .ready file creators
should ideally not need to understand archiving details.  However, I
think this approach requires them to be inextricably linked.  In the
happy case, the archiver will follow the simple path of processing
each consecutive WAL file without incurring a directory scan.  Any
time there is something other than a regular WAL file to archive, we
need to take special action to make sure it is picked up.

This sort of problem doesn't really show up in the always-use-
directory-scan approaches.  If you imagine the .ready file creators as
throwing status files over a fence at random times and in no
particular order, directory scans are ideal because you are
essentially starting with a clean slate each time.  The logic to
prioritize timeline history files is nice to have, but even if it
wasn't there, the archiver would still pick it up eventually.  IOW
there's no situation (except perhaps infinite timeline history file
generation) that puts us in danger of skipping files indefinitely.
Even if we started creating a completely new type of status file, the
directory scan approaches would probably work without any changes.

Nathan


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, Aug 24, 2021 at 1:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> I think Horiguchi-san made a good point that the .ready file creators
> should ideally not need to understand archiving details.  However, I
> think this approach requires them to be inextricably linked.  In the
> happy case, the archiver will follow the simple path of processing
> each consecutive WAL file without incurring a directory scan.  Any
> time there is something other than a regular WAL file to archive, we
> need to take special action to make sure it is picked up.

I think they should be inextricably linked, really. If we know
something - like that there's a file ready to be archived - then it
seems like we should not throw that information away and force
somebody else to rediscover it through an expensive process. The whole
problem here comes from the fact that we're using the filesystem as an
IPC mechanism, and it's sometimes a very inefficient one.

I can't quite decide whether the problems we're worrying about here
are real issues or just kind of hypothetical. I mean, today, it seems
to be possible that we fail to mark some file ready for archiving,
emit a log message, and then a huge amount of time could go by before
we try again to mark it ready for archiving. Are the problems we're
talking about here objectively worse than that, or just different? Is
it a problem in practice, or just in theory?

I really want to avoid getting backed into a corner where we decide
that the status quo is the best we can do, because I'm pretty sure
that has to be the wrong conclusion. If we think that
get-a-bunch-of-files-per-readdir approach is better than the
keep-trying-the-next-file approach, I mean that's OK with me; I just
want to do something about this. I am not sure whether or not that's
the right course of action.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/24/21, 12:09 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> I can't quite decide whether the problems we're worrying about here
> are real issues or just kind of hypothetical. I mean, today, it seems
> to be possible that we fail to mark some file ready for archiving,
> emit a log message, and then a huge amount of time could go by before
> we try again to mark it ready for archiving. Are the problems we're
> talking about here objectively worse than that, or just different? Is
> it a problem in practice, or just in theory?

If a .ready file is created out of order, the directory scan logic
will pick it up about as soon as possible based on its priority.  If
the archiver is keeping up relatively well, there's a good chance such
a file will have the highest archival priority and will be picked up
the next time the archiver looks for a file to archive.  With the
patch proposed in this thread, an out-of-order .ready file has no such
guarantee.  As long as the archiver never has to fall back to a
directory scan, it won't be archived.  The proposed patch handles the
case where RemoveOldXlogFiles() creates missing .ready files by
forcing a directory scan, but I'm not sure this is enough.  I think we
have to check the archiver state each time we create a .ready file to
see whether we're creating one out-of-order.

While this may be an extremely rare problem in practice, archiving
something after the next checkpoint completes seems better than never
archiving it at all.  IMO this isn't an area where there is much space
to take risks.

> I really want to avoid getting backed into a corner where we decide
> that the status quo is the best we can do, because I'm pretty sure
> that has to be the wrong conclusion. If we think that
> get-a-bunch-of-files-per-readdir approach is better than the
> keep-trying-the-next-file approach, I mean that's OK with me; I just
> want to do something about this. I am not sure whether or not that's
> the right course of action.

I certainly think we can do better.  The get-a-bunch-of-files-per-
readdir approach can help us cut down on the directory scans by one or
two orders of magnitude, which is still a huge win.  Plus, such an
approach retains much of the resilience of the current implementation
(although there may be bit more delay for the special cases).

That being said, I still think the keep-trying-the-next-file approach
is worth exploring, but I think it's really important to consider that
there is no guarantee that a directory scan will happen anytime soon.

Nathan


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
> If a .ready file is created out of order, the directory scan logic
> will pick it up about as soon as possible based on its priority.  If
> the archiver is keeping up relatively well, there's a good chance such
> a file will have the highest archival priority and will be picked up
> the next time the archiver looks for a file to archive.  With the
> patch proposed in this thread, an out-of-order .ready file has no such
> guarantee.  As long as the archiver never has to fall back to a
> directory scan, it won't be archived.  The proposed patch handles the
> case where RemoveOldXlogFiles() creates missing .ready files by
> forcing a directory scan, but I'm not sure this is enough.  I think we
> have to check the archiver state each time we create a .ready file to
> see whether we're creating one out-of-order.

We can handle the scenario where .ready file is created out of order
in XLogArchiveNotify(). This way we can avoid making an explicit call
to enable directory scan from different code paths which may result
into creating an out of order .ready file.

Archiver can store the segment number corresponding to the last or most
recent .ready file found. When a .ready file is created in XLogArchiveNotify(),
the log segment number of the current .ready file can be compared with the
segment number of the last .ready file found at archiver to detect if this file is
created out of order. A directory scan can be forced if required.

I have incorporated these changes in patch v11.

> While this may be an extremely rare problem in practice, archiving
> something after the next checkpoint completes seems better than never
> archiving it at all.  IMO this isn't an area where there is much space
> to take risks.

An alternate approach could be to force a directory scan at checkpoint to
break the infinite wait for a .ready file which is being missed due to the
fact that it is created out of order. This will make sure that the file
gets archived within the checkpoint boundaries.

Thoughts?

Please find attached patch v11.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/25/21, 4:11 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
> Please find attached patch v11.

Apologies for the delay.  I still intend to review this.

Nathan


Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 8/25/21, 4:11 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
> An alternate approach could be to force a directory scan at checkpoint to
> break the infinite wait for a .ready file which is being missed due to the 
> fact that it is created out of order. This will make sure that the file
> gets archived within the checkpoint boundaries.

I think this is a good idea.

> Please find attached patch v11.

Thanks for the new version of the patch.

+    /*
+     * History files or a .ready file created out of order requires archiver to
+     * perform a full directory scan.
+     */
+    if (IsTLHistoryFileName(xlog) || IsBackupHistoryFileName(xlog) ||
+            fileOutOfOrder)
+        PgArchEnableDirScan();

I think we should force a directory scan for everything that isn't a
regular WAL file.  IOW we can use !IsXLogFileName(xlog) instead of
enumerating all the different kinds of files we might want to archive.

+    /*
+     * Segment number of the most recent .ready file found by archiver,
+     * protected by WALArchiveLock.
+     */
+    XLogSegNo    lastReadySegNo;
 } PgArchData;
 
+/*
+ * Segment number and timeline ID to identify the next file in a WAL sequence
+ */
+typedef struct readyXLogState
+{
+    XLogSegNo    lastSegNo;
+    TimeLineID    lastTLI;
+} readyXLogState;

lastSegNo and lastReadySegNo appear to be the same thing.  Couldn't we
just use the value in PgArchData?

+    return (curSegNo < lastSegNo) ? true : false;

I think this needs to be <=.  If the two values are equal,
pgarch_readyXlog() may have just completed a directory scan and might
be just about to set PgArch->lastSegNo to a greater value.

+    LWLockAcquire(WALArchiveLock, LW_EXCLUSIVE);
+    PgArch->lastReadySegNo = segNo;
+    LWLockRelease(WALArchiveLock);

IMO we should just use a spinlock instead of introducing a new LWLock.
It looks like you really only need the lock for a couple of simple
functions.  I still think protecting PgArch->dirScan with a spinlock
is a good idea, if for no other reason than it makes it easier to
reason about this logic.

+        if (stat(xlogready, &st) == 0)

I think we should ERROR if stat() fails for any other reason than
ENOENT.

+        ishistory = IsTLHistoryFileName(basename) ||
+            IsBackupHistoryFileName(basename);

I suspect we still want to prioritize timeline history files over
backup history files.  TBH I find the logic below this point for
prioritizing history files to be difficult to follow, and I think we
should refactor it into some kind of archive priority comparator
function.

+            /*
+             * Reset the flag only when we found a regular WAL file to make
+             * sure that we are done with processing history files.
+             */
+            PgArch->dirScan = false;

I think we really want to unset dirScan before we start the directory
scan, and then we set it to true afterwards if we didn't find a
regular WAL file.  If someone requests a directory scan in the middle
of an ongoing directory scan, we don't want to lose that request.

I attached two patches that demonstrate what I'm thinking this change
should look like.  One is my take on the keep-trying-the-next-file
approach, and the other is a new version of the multiple-files-per-
readdir approach (with handling for "cheating" archive commands).  I
personally feel that the multiple-files-per-readdir approach winds up
being a bit cleaner and more resilient than the keep-trying-the-next-
file approach.  However, the keep-trying-the-next-file approach will
certainly be more efficient (especially for the extreme cases
discussed in this thread), and I don't have any concrete concerns with
this approach that seem impossible to handle.

Nathan


Attachment

Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

Thanks for the feedback.

> I attached two patches that demonstrate what I'm thinking this change
> should look like.  One is my take on the keep-trying-the-next-file
> approach, and the other is a new version of the multiple-files-per-
> readdir approach (with handling for "cheating" archive commands).  I
> personally feel that the multiple-files-per-readdir approach winds up
> being a bit cleaner and more resilient than the keep-trying-the-next-
> file approach.  However, the keep-trying-the-next-file approach will
> certainly be more efficient (especially for the extreme cases
> discussed in this thread), and I don't have any concrete concerns with
> this approach that seem impossible to handle.

I agree that multiple-files-pre-readdir is cleaner and has the resilience of the
current implementation. However, I have a few suggestion on keep-trying-the
-next-file approach patch shared in previous thread.

+   /* force directory scan the first time we call pgarch_readyXlog() */
+   PgArchForceDirScan();
+

We should not force a directory in pgarch_ArchiverCopyLoop(). This gets called
whenever archiver wakes up from the wait state. This will result in a
situation where the archiver performs a full directory scan despite having the
accurate information about the next anticipated log segment.
Instead we can check if lastSegNo is initialized and continue directory scan
until it gets initialized in pgarch_readyXlog().

+           return lastSegNo;
We should return "true" here.

I am thinking if we can add a log message for files which are
archived as part of directory scan. This will be useful for diagnostic purpose
to check if desired files gets archived as part of directory scan in special
scenarios. I also think that we should add a few comments in pgarch_readyXlog().

I have incorporated these changes and attached a patch
v1-0001-keep-trying-the-next-file-approach.patch.

+       /*
+        * We must use <= because the archiver may have just completed a
+        * directory scan and found a later segment (but hasn't updated
+        * shared memory yet).
+        */
+       if (this_segno <= arch_segno)
+           PgArchForceDirScan();

I still think that we should use '<' operator here because
arch_segno represents the segment number of the most recent
.ready file found by the archiver. This gets updated in shared
memory only if archiver has successfully found a .ready file.
In a normal scenario this_segno will be greater than arch_segno
whereas in cases where a .ready file is created out of order
this_segno may be less than arch_segno. I am wondering
if there is a scenario where arch_segno is equal to this_segno
unless I am missing something.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/2/21, 6:22 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
> I agree that multiple-files-pre-readdir is cleaner and has the resilience of the
> current implementation. However, I have a few suggestion on keep-trying-the
> -next-file approach patch shared in previous thread.

Which approach do you think we should use?  I think we have decent
patches for both approaches at this point, so perhaps we should see if
we can get some additional feedback from the community on which one we
should pursue further.

> +   /* force directory scan the first time we call pgarch_readyXlog() */
> +   PgArchForceDirScan();
> +
>
> We should not force a directory in pgarch_ArchiverCopyLoop(). This gets called
> whenever archiver wakes up from the wait state. This will result in a
> situation where the archiver performs a full directory scan despite having the
> accurate information about the next anticipated log segment. 
> Instead we can check if lastSegNo is initialized and continue directory scan 
> until it gets initialized in pgarch_readyXlog().

The problem I see with this is that pgarch_archiveXlog() might end up
failing.  If it does, we won't retry archiving the file until we do a
directory scan.  I think we could try to avoid forcing a directory
scan outside of these failure cases and archiver startup, but I'm not
sure it's really worth it.  When pgarch_readyXlog() returns false, it
most likely means that there are no .ready files present, so I'm not
sure we are gaining a whole lot by avoiding a directory scan in that
case.  I guess it might help a bit if there are a ton of .done files,
though.

> +           return lastSegNo;
> We should return "true" here.

Oops.  Good catch.

> I am thinking if we can add a log message for files which are 
> archived as part of directory scan. This will be useful for diagnostic purpose
> to check if desired files gets archived as part of directory scan in special 
> scenarios. I also think that we should add a few comments in pgarch_readyXlog().

I agree, but it should probably be something like DEBUG3 instead of
LOG.

> +       /*
> +        * We must use <= because the archiver may have just completed a
> +        * directory scan and found a later segment (but hasn't updated
> +        * shared memory yet).
> +        */
> +       if (this_segno <= arch_segno)
> +           PgArchForceDirScan();
>
> I still think that we should use '<' operator here because
> arch_segno represents the segment number of the most recent
> .ready file found by the archiver. This gets updated in shared 
> memory only if archiver has successfully found a .ready file.
> In a normal scenario this_segno will be greater than arch_segno 
> whereas in cases where a .ready file is created out of order 
> this_segno may be less than arch_segno. I am wondering
> if there is a scenario where arch_segno is equal to this_segno
> unless I am missing something.

The pg_readyXlog() logic looks a bit like this:

        1. Try to skip directory scan.  If that succeeds, we're done.
        2. Do a directory scan.
        3. If we found a regular WAL file, update PgArch and return
           what we found.

Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
yet.  The following directory scan ends up finding 11.ready.  Just
before we update the PgArch state, XLogArchiveNotify() is called and
creates 10.ready.  However, pg_readyXlog() has already decided to
return WAL segment 11 and update the state to look for 12 next.  If we
just used '<', we won't force a directory scan, and segment 10 will
not be archived until the next one happens.  If we use '<=', I don't
think we have the same problem.

I've also thought about another similar scenario.  Let's say step 1
looks for WAL file 10, but it doesn't exist yet (just like the
previous example).  The following directory scan ends up finding
12.ready, but just before we update PgArch, we create 11.ready.  In
this case, we'll still skip forcing a directory scan until 10.ready is
created later on.  I believe it all eventually works out as long as we
can safely assume that all files that should have .ready files will
eventually get them.

Nathan


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

Thanks for the feedback.

> Which approach do you think we should use?  I think we have decent
> patches for both approaches at this point, so perhaps we should see if
> we can get some additional feedback from the community on which one we
> should pursue further.

In my opinion both the approaches have benefits over current implementation.
I think in keep-trying-the-next-file approach we have handled all rare and specific
scenarios which requires us to force a directory scan to archive the desired files.
In addition to this with the recent change to force a directory scan at checkpoint
we can avoid an infinite wait for a file which is still being missed out despite
handling the special scenarios. It is also more efficient in extreme scenarios
as discussed in this thread. However, multiple-files-per-readdir approach is
cleaner with resilience of current implementation.

I agree that we should decide on which approach to pursue further based on
additional feedback from the community.

> The problem I see with this is that pgarch_archiveXlog() might end up
> failing.  If it does, we won't retry archiving the file until we do a
> directory scan.  I think we could try to avoid forcing a directory
> scan outside of these failure cases and archiver startup, but I'm not
> sure it's really worth it.  When pgarch_readyXlog() returns false, it
> most likely means that there are no .ready files present, so I'm not
> sure we are gaining a whole lot by avoiding a directory scan in that
> case.  I guess it might help a bit if there are a ton of .done files,
> though.

Yes, I think it will be useful when we have a bunch of .done files and
the frequency of .ready files is such that the archiver goes to wait
state before the next WAL file is ready for archival.

> I agree, but it should probably be something like DEBUG3 instead of
> LOG.

I will update it in the next patch.

Thanks,
Dipesh

Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Fri, 3 Sep 2021 18:31:46 +0530, Dipesh Pandit <dipesh.pandit@gmail.com> wrote in 
> Hi,
> 
> Thanks for the feedback.
> 
> > Which approach do you think we should use?  I think we have decent
> > patches for both approaches at this point, so perhaps we should see if
> > we can get some additional feedback from the community on which one we
> > should pursue further.
> 
> In my opinion both the approaches have benefits over current implementation.
> I think in keep-trying-the-next-file approach we have handled all rare and
> specific
> scenarios which requires us to force a directory scan to archive the
> desired files.
> In addition to this with the recent change to force a directory scan at
> checkpoint
> we can avoid an infinite wait for a file which is still being missed out
> despite
> handling the special scenarios. It is also more efficient in extreme
> scenarios
> as discussed in this thread. However, multiple-files-per-readdir approach
> is
> cleaner with resilience of current implementation.
> 
> I agree that we should decide on which approach to pursue further based on
> additional feedback from the community.


I was thinking that the multple-files approch would work efficiently
but the the patch still runs directory scans every 64 files.  As
Robert mentioned it is still O(N^2).  I'm not sure the reason for the
limit, but if it were to lower memory consumption or the cost to sort,
we can resolve that issue by taking trying-the-next approach ignoring
the case of having many gaps (discussed below). If it were to cause
voluntary checking of out-of-order files, almost the same can be
achieved by running directory scans every 64 files in the
trying-the-next approach (and we would suffer O(N^2) again).  On the
other hand, if archiving is delayed by several segments, the
multiple-files method might reduce the cost to scan the status
directory but it won't matter since the directory contains only
several files.  (I think that it might be better that we don't go to
trying-the-next path if we found only several files by running a
directory scan.)  The multiple-files approach reduces the number of
directory scans if there were many gaps in the WAL file
sequence. Alghouth theoretically the last max_backend(+alpha?)
segemnts could be written out-of-order, but I suppose we only have
gaps only among the several latest files in reality. I'm not sure,
though..

In short, the trying-the-next approach seems to me to be the way to
go, for the reason that it is simpler but it can cover the possible
failures by almost the same measures with the muliple-files approach.

> > The problem I see with this is that pgarch_archiveXlog() might end up
> > failing.  If it does, we won't retry archiving the file until we do a
> > directory scan.  I think we could try to avoid forcing a directory
> > scan outside of these failure cases and archiver startup, but I'm not
> > sure it's really worth it.  When pgarch_readyXlog() returns false, it
> > most likely means that there are no .ready files present, so I'm not
> > sure we are gaining a whole lot by avoiding a directory scan in that
> > case.  I guess it might help a bit if there are a ton of .done files,
> > though.
> 
> Yes, I think it will be useful when we have a bunch of .done files and
> the frequency of .ready files is such that the archiver goes to wait
> state before the next WAL file is ready for archival.
> 
> > I agree, but it should probably be something like DEBUG3 instead of
> > LOG.
> 
> I will update it in the next patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/7/21, 1:42 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> I was thinking that the multple-files approch would work efficiently
> but the the patch still runs directory scans every 64 files.  As
> Robert mentioned it is still O(N^2).  I'm not sure the reason for the
> limit, but if it were to lower memory consumption or the cost to sort,
> we can resolve that issue by taking trying-the-next approach ignoring
> the case of having many gaps (discussed below). If it were to cause
> voluntary checking of out-of-order files, almost the same can be
> achieved by running directory scans every 64 files in the
> trying-the-next approach (and we would suffer O(N^2) again).  On the
> other hand, if archiving is delayed by several segments, the
> multiple-files method might reduce the cost to scan the status
> directory but it won't matter since the directory contains only
> several files.  (I think that it might be better that we don't go to
> trying-the-next path if we found only several files by running a
> directory scan.)  The multiple-files approach reduces the number of
> directory scans if there were many gaps in the WAL file
> sequence. Alghouth theoretically the last max_backend(+alpha?)
> segemnts could be written out-of-order, but I suppose we only have
> gaps only among the several latest files in reality. I'm not sure,
> though..
>
> In short, the trying-the-next approach seems to me to be the way to
> go, for the reason that it is simpler but it can cover the possible
> failures by almost the same measures with the muliple-files approach.

Thanks for chiming in.  The limit of 64 in the multiple-files-per-
directory-scan approach was mostly arbitrary.  My earlier testing [0]
with different limits didn't reveal any significant difference, but
using a higher limit might yield a small improvement when there are
several hundred thousand .ready files.  IMO increasing the limit isn't
really worth it for this approach.  For 500,000 .ready files,
ordinarily you'd need 500,000 directory scans.  When 64 files are
archived for each directory scan, you need ~8,000 directory scans.
With 128 files per directory scan, you need ~4,000.  With 256, you
need ~2000.  The difference between 8,000 directory scans and 500,000
is quite significant.  The difference between 2,000 and 8,000 isn't
nearly as significant in comparison.

Nathan

[0] https://www.postgresql.org/message-id/3ECC212F-88FD-4FB2-BAF1-C2DD1563E310%40amazon.com


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, Sep 7, 2021 at 1:28 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> Thanks for chiming in.  The limit of 64 in the multiple-files-per-
> directory-scan approach was mostly arbitrary.  My earlier testing [0]
> with different limits didn't reveal any significant difference, but
> using a higher limit might yield a small improvement when there are
> several hundred thousand .ready files.  IMO increasing the limit isn't
> really worth it for this approach.  For 500,000 .ready files,
> ordinarily you'd need 500,000 directory scans.  When 64 files are
> archived for each directory scan, you need ~8,000 directory scans.
> With 128 files per directory scan, you need ~4,000.  With 256, you
> need ~2000.  The difference between 8,000 directory scans and 500,000
> is quite significant.  The difference between 2,000 and 8,000 isn't
> nearly as significant in comparison.

That's certainly true.

I guess what I don't understand about the multiple-files-per-dirctory
scan implementation is what happens when something happens that would
require the keep-trying-the-next-file approach to perform a forced
scan. It seems to me that you still need to force an immediate full
scan, because if the idea is that you want to, say, prioritize
archiving of new timeline files over any others, a cached list of
files that you should archive next doesn't accomplish that, just like
keeping on trying the next file in sequence doesn't accomplish that.

So I'm wondering if in the end the two approaches converge somewhat,
so that with either patch you get (1) some kind of optimization to
scan the directory less often, plus (2) some kind of notification
mechanism to tell you when you need to avoid applying that
optimization. If you wanted to, (1) could even include both batching
and then, when the batch is exhausted, trying files in sequence. I'm
not saying that's the way to go, but you could. In the end, it seems
less important that we do any particular thing here and more important
that we do something - but if prioritizing timeline history files is
important, then we have to preserve that behavior.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/7/21, 10:54 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> I guess what I don't understand about the multiple-files-per-dirctory
> scan implementation is what happens when something happens that would
> require the keep-trying-the-next-file approach to perform a forced
> scan. It seems to me that you still need to force an immediate full
> scan, because if the idea is that you want to, say, prioritize
> archiving of new timeline files over any others, a cached list of
> files that you should archive next doesn't accomplish that, just like
> keeping on trying the next file in sequence doesn't accomplish that.

Right.  The latest patch for that approach [0] does just that.  In
fact, I think timeline files are the only files for which we need to
force an immediate directory scan in the multiple-files-per-scan
approach.  For the keep-trying-the-next-file approach, we have to
force a directory scan for anything but a regular WAL file that is
ahead of our archiver state.

> So I'm wondering if in the end the two approaches converge somewhat,
> so that with either patch you get (1) some kind of optimization to
> scan the directory less often, plus (2) some kind of notification
> mechanism to tell you when you need to avoid applying that
> optimization. If you wanted to, (1) could even include both batching
> and then, when the batch is exhausted, trying files in sequence. I'm
> not saying that's the way to go, but you could. In the end, it seems
> less important that we do any particular thing here and more important
> that we do something - but if prioritizing timeline history files is
> important, then we have to preserve that behavior.

Yeah, I would agree that the approaches basically converge into some
form of "do fewer directory scans."

Nathan

[0]
https://www.postgresql.org/message-id/attachment/125980/0001-Improve-performance-of-pgarch_readyXlog-with-many-st.patch


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Tue, Sep 7, 2021 at 2:13 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> Right.  The latest patch for that approach [0] does just that.  In
> fact, I think timeline files are the only files for which we need to
> force an immediate directory scan in the multiple-files-per-scan
> approach.  For the keep-trying-the-next-file approach, we have to
> force a directory scan for anything but a regular WAL file that is
> ahead of our archiver state.

Yeah, that makes sense.

> Yeah, I would agree that the approaches basically converge into some
> form of "do fewer directory scans."

I guess we still have to pick one or the other, but I don't really
know how to do that, since both methods seem to be relatively fine,
and the scenarios where one is better than the other all feel a little
bit contrived. I guess if no clear consensus emerges in the next week
or so, I'll just pick one and commit it. Not quite sure yet how I'll
do the picking, but we seem to all agree that something is better than
nothing, so hopefully nobody will be too sad if I make an arbitrary
decision. And if some clear agreement emerges before then, even
better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/7/21, 11:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> I guess we still have to pick one or the other, but I don't really
> know how to do that, since both methods seem to be relatively fine,
> and the scenarios where one is better than the other all feel a little
> bit contrived. I guess if no clear consensus emerges in the next week
> or so, I'll just pick one and commit it. Not quite sure yet how I'll
> do the picking, but we seem to all agree that something is better than
> nothing, so hopefully nobody will be too sad if I make an arbitrary
> decision. And if some clear agreement emerges before then, even
> better.

I will be happy to see this fixed either way.

Nathan


Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Tue, 7 Sep 2021 18:40:24 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 9/7/21, 11:31 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> > I guess we still have to pick one or the other, but I don't really
> > know how to do that, since both methods seem to be relatively fine,
> > and the scenarios where one is better than the other all feel a little
> > bit contrived. I guess if no clear consensus emerges in the next week
> > or so, I'll just pick one and commit it. Not quite sure yet how I'll
> > do the picking, but we seem to all agree that something is better than
> > nothing, so hopefully nobody will be too sad if I make an arbitrary
> > decision. And if some clear agreement emerges before then, even
> > better.
> 
> I will be happy to see this fixed either way.

+1.  I agree about the estimation on performance gain to Nathan.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
> > I guess we still have to pick one or the other, but I don't really
> > know how to do that, since both methods seem to be relatively fine,
> > and the scenarios where one is better than the other all feel a little
> > bit contrived. I guess if no clear consensus emerges in the next week
> > or so, I'll just pick one and commit it. Not quite sure yet how I'll
> > do the picking, but we seem to all agree that something is better than
> > nothing, so hopefully nobody will be too sad if I make an arbitrary
> > decision. And if some clear agreement emerges before then, even
> > better.
>
> I will be happy to see this fixed either way.

+1

> > I agree, but it should probably be something like DEBUG3 instead of
> > LOG.
>
> I will update it in the next patch.

Updated log level to DEBUG3 and rebased the patch. PFA patch.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/8/21, 10:49 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
> Updated log level to DEBUG3 and rebased the patch. PFA patch.

Thanks for the new patch.

+ * by checking the availability of next WAL file. "xlogState" specifies the
+ * segment number and timeline ID corresponding to the next WAL file.

"xlogState" probably needs to be updated here.

As noted before [0], I think we need to force a directory scan at the
beginning of pgarch_MainLoop() and when pgarch_ArchiverCopyLoop()
returns before we exit the "while" loop.  Else, there's probably a
risk that we skip archiving a file until the next directory scan.  IMO
forcing a directory scan at the beginning of pgarch_ArchiverCopyLoop()
is a simpler way to do roughly the same thing.  I'm skeptical that
persisting the next-anticipated state between calls to
pgarch_ArchiverCopyLoop() is worth the complexity.

Nathan

[0] https://www.postgresql.org/message-id/AC78607B-9DA6-41F4-B253-840D3DD964BF%40amazon.com


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

Thanks for the feedback.

> + * by checking the availability of next WAL file. "xlogState" specifies the
> + * segment number and timeline ID corresponding to the next WAL file.
>
> "xlogState" probably needs to be updated here.

Yes, I updated the comment.

> As noted before [0], I think we need to force a directory scan at the
> beginning of pgarch_MainLoop() and when pgarch_ArchiverCopyLoop()
> returns before we exit the "while" loop.  Else, there's probably a
> risk that we skip archiving a file until the next directory scan.  IMO
> forcing a directory scan at the beginning of pgarch_ArchiverCopyLoop()
> is a simpler way to do roughly the same thing.  I'm skeptical that
> persisting the next-anticipated state between calls to
> pgarch_ArchiverCopyLoop() is worth the complexity.

I think if we force a directory scan in pgarch_ArchiverCopyLoop() when it
returns before we exit the "while" loop or outside the loop then it may
result in directory scan for all WAL files in one of the scenarios that I can think of.

There could be two possible scenarios, first scenario in which the archiver
is always lagging and the second scenario in which archiver is in sync or
ahead with the rate at which WAL files are generated.

If we focus on the second scenario, then consider a case where the archiver has
just archived file 1.ready and is about to check the availability of 2.ready but the
file 2.ready is not available in archive status directory. Archiver performs a directory
scan as a fall-back mechanism and goes to wait state.(The current implementation
relies on notifying the archiver by creating a .ready file on disk. It may happen that
the file is ready file archival but due to slow notification mechanism there is a delay
in notification and archiver goes to wait state.) When file 2.ready is created on disk
archive is notified, it wakes up and calls pgarch_ArchiverCopyLoop(). Now if we
unconditionally force a directory scan in pgarch_ArchiverCopyLoop() then it may
result in directory scan for all WAL files in this scenario. In this case we have the
next anticipated log segment number and we can prevent an additional directory
scan. I have tested this with a small setup by creating ~2000 WAL files and it has
resulted in directory scan for each file.

I agree that the the failure scenario discussed in [0] will require a WAL file to
wait until the next directory scan. However, this can be avoided by forcing a
directory scan in pgarch_ArchiverCopyLoop() only in case of failure scenario.
This will make sure that when the archiver wakes up for the next cycle it
performs a full directory leaving out any risk of missing a file due to archive
failure. Additionally, it will also avoid additional directory scans mentioned in
above scenario.

I have incorporated the changes and updated a new patch. PFA patch.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, Sep 2, 2021 at 5:52 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> The pg_readyXlog() logic looks a bit like this:
>
>         1. Try to skip directory scan.  If that succeeds, we're done.
>         2. Do a directory scan.
>         3. If we found a regular WAL file, update PgArch and return
>            what we found.
>
> Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
> yet.  The following directory scan ends up finding 11.ready.  Just
> before we update the PgArch state, XLogArchiveNotify() is called and
> creates 10.ready.  However, pg_readyXlog() has already decided to
> return WAL segment 11 and update the state to look for 12 next.  If we
> just used '<', we won't force a directory scan, and segment 10 will
> not be archived until the next one happens.  If we use '<=', I don't
> think we have the same problem.

The latest post on this thread contained a link to this one, and it
made me want to rewind to this point in the discussion. Suppose we
have the following alternative scenario:

Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
yet.  The following directory scan ends up finding 12.ready.  Just
before we update the PgArch state, XLogArchiveNotify() is called and
creates 11.ready.  However, pg_readyXlog() has already decided to
return WAL segment 12 and update the state to look for 13 next.

Now, if I'm not mistaken, using <= doesn't help at all.

In my opinion, the problem here is that the natural way to ask "is
this file being archived out of order?" is to ask yourself "is the
file that I'm marking as ready for archiving now the one that
immediately follows the last one I marked as ready for archiving?" and
then invert the result. That is, if I last marked 10 as ready, and now
I'm marking 11 as ready, then it's in order, but if I'm now marking
anything else whatsoever, then it's out of order. But that's not what
this does. Instead of comparing what it's doing now to what it did
last, it compares what it did now to what the archiver did last.

And it's really not obvious that that's correct. I think that the
above argument actually demonstrates a flaw in the logic, but even if
not, or even if it's too small a flaw to be a problem in practice, it
seems a lot harder to reason about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/13/21, 1:14 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Thu, Sep 2, 2021 at 5:52 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
>> yet.  The following directory scan ends up finding 11.ready.  Just
>> before we update the PgArch state, XLogArchiveNotify() is called and
>> creates 10.ready.  However, pg_readyXlog() has already decided to
>> return WAL segment 11 and update the state to look for 12 next.  If we
>> just used '<', we won't force a directory scan, and segment 10 will
>> not be archived until the next one happens.  If we use '<=', I don't
>> think we have the same problem.
>
> The latest post on this thread contained a link to this one, and it
> made me want to rewind to this point in the discussion. Suppose we
> have the following alternative scenario:
>
> Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
> yet.  The following directory scan ends up finding 12.ready.  Just
> before we update the PgArch state, XLogArchiveNotify() is called and
> creates 11.ready.  However, pg_readyXlog() has already decided to
> return WAL segment 12 and update the state to look for 13 next.
>
> Now, if I'm not mistaken, using <= doesn't help at all.

I think this is the scenario I was trying to touch on in the paragraph
immediately following the one you mentioned.  My theory was that we'll
still skip forcing a directory scan until 10.ready is created, so it
would eventually work out as long as we can safely assume that all
.ready files that should be created eventually will be.  Thinking
further, I don't think that's right.  We might've already renamed
10.ready to 10.done and removed it long ago, so there's a chance that
we wouldn't go back and pick up 11.ready until one of our "fallback"
directory scans forced by the checkpointer.  So, yes, I think you are
right.

> In my opinion, the problem here is that the natural way to ask "is
> this file being archived out of order?" is to ask yourself "is the
> file that I'm marking as ready for archiving now the one that
> immediately follows the last one I marked as ready for archiving?" and
> then invert the result. That is, if I last marked 10 as ready, and now
> I'm marking 11 as ready, then it's in order, but if I'm now marking
> anything else whatsoever, then it's out of order. But that's not what
> this does. Instead of comparing what it's doing now to what it did
> last, it compares what it did now to what the archiver did last.
>
> And it's really not obvious that that's correct. I think that the
> above argument actually demonstrates a flaw in the logic, but even if
> not, or even if it's too small a flaw to be a problem in practice, it
> seems a lot harder to reason about.

I certainly agree that it's harder to reason about.  If we were to go
the keep-trying-the-next-file route, we could probably minimize a lot
of the handling for these rare cases by banking on the "fallback"
directory scans.  Provided we believe these situations are extremely
rare, some extra delay for an archive every once in a while might be
acceptable.

Nathan


Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Thanks for the feedback.

> The latest post on this thread contained a link to this one, and it
> made me want to rewind to this point in the discussion. Suppose we
> have the following alternative scenario:
>
> Let's say step 1 looks for WAL file 10, but 10.ready doesn't exist
> yet.  The following directory scan ends up finding 12.ready.  Just
> before we update the PgArch state, XLogArchiveNotify() is called and
> creates 11.ready.  However, pg_readyXlog() has already decided to
> return WAL segment 12 and update the state to look for 13 next.
>
> Now, if I'm not mistaken, using <= doesn't help at all.
>
> In my opinion, the problem here is that the natural way to ask "is
> this file being archived out of order?" is to ask yourself "is the
> file that I'm marking as ready for archiving now the one that
> immediately follows the last one I marked as ready for archiving?" and
> then invert the result. That is, if I last marked 10 as ready, and now
> I'm marking 11 as ready, then it's in order, but if I'm now marking
> anything else whatsoever, then it's out of order. But that's not what
> this does. Instead of comparing what it's doing now to what it did
> last, it compares what it did now to what the archiver did last.

I agree that when we are creating a .ready file we should compare
the current .ready file with the last .ready file to check if this file is
created out of order. We can store the state of the last .ready file
in shared memory and compare it with the current .ready file. I
believe that archiver specific shared memory area can be used
to store the state of the last .ready file unless I am missing
something and this needs to be stored in a separate shared
memory area.

With this change, we have the flexibility to move the current archiver
state out of shared memory and keep it local to archiver. I have
incorporated these changes and updated a new patch.


> > And it's really not obvious that that's correct. I think that the
> > above argument actually demonstrates a flaw in the logic, but even if
> > not, or even if it's too small a flaw to be a problem in practice, it
> > seems a lot harder to reason about.
>
> I certainly agree that it's harder to reason about.  If we were to go
> the keep-trying-the-next-file route, we could probably minimize a lot
> of the handling for these rare cases by banking on the "fallback"
> directory scans.  Provided we believe these situations are extremely
> rare, some extra delay for an archive every once in a while might be
> acceptable.

+1. We are forcing a directory scan at the checkpoint and it will make sure
that any missing file gets archived within the checkpoint boundaries.

Please find the attached patch.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/14/21, 7:23 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
> I agree that when we are creating a .ready file we should compare 
> the current .ready file with the last .ready file to check if this file is 
> created out of order. We can store the state of the last .ready file 
> in shared memory and compare it with the current .ready file. I
> believe that archiver specific shared memory area can be used
> to store the state of the last .ready file unless I am missing
> something and this needs to be stored in a separate shared
> memory area.
>
> With this change, we have the flexibility to move the current archiver
> state out of shared memory and keep it local to archiver. I have 
> incorporated these changes and updated a new patch.

I wonder if this can be simplified even further.  If we don't bother
trying to catch out-of-order .ready files in XLogArchiveNotify() and
just depend on the per-checkpoint/restartpoint directory scans, we can
probably remove lastReadySegNo from archiver state completely.

+    /* Force a directory scan if we are archiving anything but a regular
+     * WAL file or if this WAL file is being created out-of-order.
+     */
+    if (!IsXLogFileName(xlog))
+        PgArchForceDirScan();
+    else
+    {
+        TimeLineID tli;
+        XLogSegNo last_segno;
+        XLogSegNo this_segno;
+
+        last_segno = PgArchGetLastReadySegNo();
+        XLogFromFileName(xlog, &tli, &this_segno, wal_segment_size);
+
+        /*
+         * Force a directory scan in case if this .ready file created out of
+         * order.
+         */
+        last_segno++;
+        if (last_segno != this_segno)
+            PgArchForceDirScan();
+
+        PgArchSetLastReadySegNo(this_segno);
+    }

This is an interesting idea, but the "else" block here seems prone to
race conditions.  I think we'd have to hold arch_lck to prevent that.
But as I mentioned above, if we are okay with depending on the
fallback directory scans, I think we can remove the "else" block
completely.

+    /* Initialize the current state of archiver */
+    xlogState.lastSegNo = MaxXLogSegNo;
+    xlogState.lastTli = MaxTimeLineID;

It looks like we have two ways to force a directory scan.  We can
either set force_dir_scan to true, or lastSegNo can be set to
MaxXLogSegNo.  Why not just set force_dir_scan to true here so that we
only have one way to force a directory scan?

+                    /*
+                     * Failed to archive, make sure that archiver performs a
+                     * full directory scan in the next cycle to avoid missing
+                     * the WAL file which could not be archived due to some
+                     * failure in current cycle.
+                     */
+                    PgArchForceDirScan();

Don't we also need to force a directory scan in the other cases we
return early from pgarch_ArchiverCopyLoop()?  We will have already
advanced the archiver state in pgarch_readyXlog(), so I think we'd end
up skipping files if we didn't.  For example, if archive_command isn't
set, we'll just return, and the next call to pgarch_readyXlog() might
return the next file.

+            /* Continue directory scan until we find a regular WAL file */
+            SpinLockAcquire(&PgArch->arch_lck);
+            PgArch->force_dir_scan = true;
+            SpinLockRelease(&PgArch->arch_lck);

nitpick: I think we should just call PgArchForceDirScan() here.

Nathan


Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/14/21, 9:18 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> This is an interesting idea, but the "else" block here seems prone to
> race conditions.  I think we'd have to hold arch_lck to prevent that.
> But as I mentioned above, if we are okay with depending on the
> fallback directory scans, I think we can remove the "else" block
> completely.

Thinking further, we probably need to hold a lock even when we are
creating the .ready file to avoid race conditions.

Nathan


Re: .ready and .done files considered harmful

From
Kyotaro Horiguchi
Date:
At Tue, 14 Sep 2021 18:07:31 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 9/14/21, 9:18 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> > This is an interesting idea, but the "else" block here seems prone to
> > race conditions.  I think we'd have to hold arch_lck to prevent that.
> > But as I mentioned above, if we are okay with depending on the
> > fallback directory scans, I think we can remove the "else" block
> > completely.
> 
> Thinking further, we probably need to hold a lock even when we are
> creating the .ready file to avoid race conditions.

The race condition surely happens, but even if that happens, all
competing processes except one of them detect out-of-order and will
enforce directory scan.  But I'm not sure how it behaves under more
complex situation so I'm not sure I like that behavior.

We could just use another lock for the logic there, but instead
couldn't we merge PgArchGetLastReadySegNo and PgArchSetLastReadySegNo
into one atomic test-and-(check-and-)set function?  Like this.

====
   XLogFromFileName(xlog, &tli, &this_segno, wal_segment_size);
   if (!PgArchReadySegIsInOrder(this_segno))
      PgArchForceDirScan();

bool
PgArchReadySegIsInOrder(XLogSegNo this_segno)
{
    bool in_order = true;

    SpinLockAcquire(&PgArch->arch_lck);
    if (PgArch->lastReadySegNo + 1 != this_segno)
       in_order = false;
    PgArch->lastReadySegNo = this_segno;
    SpinLockRelease(&PgArch->arch_lck);

    return in_order;
}
====

By the way, it seems to me that we only need to force directory scan
when notification seg number steps back.  If this is correct, we can
reduce the number how many times we enforce directory scans.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

Thanks for the feedback.

> I wonder if this can be simplified even further.  If we don't bother
> trying to catch out-of-order .ready files in XLogArchiveNotify() and
> just depend on the per-checkpoint/restartpoint directory scans, we can
> probably remove lastReadySegNo from archiver state completely.

If we agree that some extra delay in archiving these files is acceptable
then we don't require any special handling for this scenario otherwise
we may need to handle it separately.

> +       /* Initialize the current state of archiver */
> +       xlogState.lastSegNo = MaxXLogSegNo;
> +       xlogState.lastTli = MaxTimeLineID;
>
> It looks like we have two ways to force a directory scan.  We can
> either set force_dir_scan to true, or lastSegNo can be set to
> MaxXLogSegNo.  Why not just set force_dir_scan to true here so that we
> only have one way to force a directory scan?

make sense, I have updated it.

> Don't we also need to force a directory scan in the other cases we
> return early from pgarch_ArchiverCopyLoop()?  We will have already
> advanced the archiver state in pgarch_readyXlog(), so I think we'd end
> up skipping files if we didn't.  For example, if archive_command isn't
> set, we'll just return, and the next call to pgarch_readyXlog() might
> return the next file.

I agree, we should do it for all early return paths.

> nitpick: I think we should just call PgArchForceDirScan() here.

Yes, that's right.

> > > This is an interesting idea, but the "else" block here seems prone to
> > > race conditions.  I think we'd have to hold arch_lck to prevent that.
> > > But as I mentioned above, if we are okay with depending on the
> > > fallback directory scans, I think we can remove the "else" block
> > > completely.

Ohh I didn't realize the race condition here. The competing processes
can read the same value of lastReadySegNo.
 
> > Thinking further, we probably need to hold a lock even when we are
> > creating the .ready file to avoid race conditions.
>
> The race condition surely happens, but even if that happens, all
> competing processes except one of them detect out-of-order and will
> enforce directory scan.  But I'm not sure how it behaves under more
> complex situation so I'm not sure I like that behavior.
>
> We could just use another lock for the logic there, but instead
> couldn't we merge PgArchGetLastReadySegNo and PgArchSetLastReadySegNo
> into one atomic test-and-(check-and-)set function?  Like this.

I agree that we can merge the existing "Get" and "Set" functions into
an atomic test-and-check-and-set function to avoid a race condition.

I have incorporated these changes and updated a new patch. PFA patch.

Thanks,
Dipesh
Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/15/21, 4:28 AM, "Dipesh Pandit" <dipesh.pandit@gmail.com> wrote:
> I have incorporated these changes and updated a new patch. PFA patch.

Thanks for the new patch.

I've attached my take on the latest version.  My main focus this time
was simplifying the patch for this approach as much as possible.
Specifically, I've done the following:

  1. I've removed several calls to PgArchForceDirScan() in favor of
     calling it at the top of pgarch_ArchiverCopyLoop().  I believe
     there is some disagreement about this change, but I don't think
     we gain enough to justify the complexity.  The main reason we
     exit pgarch_ArchiverCopyLoop() should ordinarily be that we've
     run out of files to archive, so incurring a directory scan the
     next time it is called doesn't seem like it would normally be too
     bad.  I'm sure there are exceptions (e.g., lots of .done files,
     archive failures), but the patch is still not making things any
     worse than they presently are for these cases.
  2. I removed all the logic that attempted to catch out-of-order
     .ready files.  Instead, XLogArchiveNotify() only forces a
     directory scan for files other than regular WAL files, and we
     depend on our periodic directory scans to pick up anything that's
     been left behind.
  3. I moved the logic that forces directory scans every once in a
     while.  We were doing that in the checkpoint/restartpoint logic,
     which, upon further thought, might not be the best idea.  The
     checkpoint interval can vary widely, and IIRC we won't bother
     creating checkpoints at all if database activity stops.  Instead,
     I've added logic in pgarch_readyXlog() that forces a directory
     scan if one hasn't happened in a few minutes.
  4. Finally, I've tried to ensure comments are clear and that the
     logic is generally easy to reason about.

What do you think?

Nathan


Attachment

Re: .ready and .done files considered harmful

From
Dipesh Pandit
Date:
Hi,

> 1. I've removed several calls to PgArchForceDirScan() in favor of
>     calling it at the top of pgarch_ArchiverCopyLoop().  I believe
>     there is some disagreement about this change, but I don't think
>     we gain enough to justify the complexity.  The main reason we
>     exit pgarch_ArchiverCopyLoop() should ordinarily be that we've
>     run out of files to archive, so incurring a directory scan the
>     next time it is called doesn't seem like it would normally be too
>     bad.  I'm sure there are exceptions (e.g., lots of .done files,
>     archive failures), but the patch is still not making things any
>     worse than they presently are for these cases.

Yes, I think when archiver is lagging behind then a call to force
directory scan at the top of pgarch_ArchiverCopyLoop() does not
have any impact. This may result into a directory scan in next cycle
only when the archiver is ahead or in sync but in that case also a
directory scan may not incur too much cost since the archiver is
ahead.I agree that we can remove the separate calls to force a
directory scan in failure scenarios with a single call at the top of
PgArchForceDirScan().

> 2. I removed all the logic that attempted to catch out-of-order
>     .ready files.  Instead, XLogArchiveNotify() only forces a
>     directory scan for files other than regular WAL files, and we
>     depend on our periodic directory scans to pick up anything that's
>     been left behind.
> 3. I moved the logic that forces directory scans every once in a
>     while.  We were doing that in the checkpoint/restartpoint logic,
>     which, upon further thought, might not be the best idea.  The
>     checkpoint interval can vary widely, and IIRC we won't bother
>     creating checkpoints at all if database activity stops.  Instead,
>     I've added logic in pgarch_readyXlog() that forces a directory
>     scan if one hasn't happened in a few minutes.
> 4. Finally, I've tried to ensure comments are clear and that the
>     logic is generally easy to reason about.
>
> What do you think?

I agree, If we force a periodic directory scan then we may not
require any special logic for handling scenarios where a .ready file
is created out of order in XLogArchiveNotify(). We need to force a
directory scan only in case of a non-regular WAL file in
XLogArchiveNotify().

Overall I think the periodic directory scan simplifies the patch and
makes sure that any missing file gets archived within a few mins.

Thanks,
Dipesh

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>   1. I've removed several calls to PgArchForceDirScan() in favor of
>      calling it at the top of pgarch_ArchiverCopyLoop().  I believe
>      there is some disagreement about this change, but I don't think
>      we gain enough to justify the complexity.  The main reason we
>      exit pgarch_ArchiverCopyLoop() should ordinarily be that we've
>      run out of files to archive, so incurring a directory scan the
>      next time it is called doesn't seem like it would normally be too
>      bad.  I'm sure there are exceptions (e.g., lots of .done files,
>      archive failures), but the patch is still not making things any
>      worse than they presently are for these cases.

I was thinking that this might increase the number of directory scans
by a pretty large amount when we repeatedly catch up, then 1 new file
gets added, then we catch up, etc.

But I guess your thought process is that such directory scans, even if
they happen many times per second, can't really be that expensive,
since the directory can't have much in it. Which seems like a fair
point. I wonder if there are any situations in which there's not much
to archive but the archive_status directory still contains tons of
files.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Alvaro Herrera
Date:
On 2021-Sep-20, Robert Haas wrote:

> I was thinking that this might increase the number of directory scans
> by a pretty large amount when we repeatedly catch up, then 1 new file
> gets added, then we catch up, etc.

I was going to say that perhaps we can avoid repeated scans by having a
bitmap of future files that were found by a scan; so if we need to do
one scan, we keep track of the presence of the next (say) 64 files in
our timeline, and then we only have to do another scan when we need to
archive a file that wasn't present the last time we scanned.  However:

> But I guess your thought process is that such directory scans, even if
> they happen many times per second, can't really be that expensive,
> since the directory can't have much in it. Which seems like a fair
> point. I wonder if there are any situations in which there's not much
> to archive but the archive_status directory still contains tons of
> files.

(If we take this stance, which seems reasonable to me, then we don't
need to optimize.)  But perhaps we should complain if we find extraneous
files in archive_status -- Then it'd be on the users' heads not to leave
tons of files that would slow down the scan.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/20/21, 1:42 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Sep-20, Robert Haas wrote:
>
>> I was thinking that this might increase the number of directory scans
>> by a pretty large amount when we repeatedly catch up, then 1 new file
>> gets added, then we catch up, etc.
>
> I was going to say that perhaps we can avoid repeated scans by having a
> bitmap of future files that were found by a scan; so if we need to do
> one scan, we keep track of the presence of the next (say) 64 files in
> our timeline, and then we only have to do another scan when we need to
> archive a file that wasn't present the last time we scanned.  However:

This sounds a bit like the other approach discussed earlier in this
thread [0].

>> But I guess your thought process is that such directory scans, even if
>> they happen many times per second, can't really be that expensive,
>> since the directory can't have much in it. Which seems like a fair
>> point. I wonder if there are any situations in which there's not much
>> to archive but the archive_status directory still contains tons of
>> files.
>
> (If we take this stance, which seems reasonable to me, then we don't
> need to optimize.)  But perhaps we should complain if we find extraneous
> files in archive_status -- Then it'd be on the users' heads not to leave
> tons of files that would slow down the scan.

The simplest situation I can think of that might be a problem is when
checkpointing is stuck and the .done files are adding up.  However,
after the lengthy directory scan, you should still be able to archive
several files without a scan of archive_status.  And if you are
repeatedly catching up, the extra directory scans probably aren't
hurting anything.  At the very least, this patch doesn't make things
any worse in this area.

BTW I attached a new version of the patch with a couple of small
changes.  Specifically, I adjusted some of the comments and moved the
assignment of last_dir_scan to after the directory scan completes.
Before, we were resetting it before the directory scan, so if the
directory scan took too long, you'd still end up scanning
archive_status for every file.  I think that's still possible if your
archive_command is especially slow, but archiving isn't going to keep
up anyway in that case.

Nathan

[0]
https://www.postgresql.org/message-id/attachment/125980/0001-Improve-performance-of-pgarch_readyXlog-with-many-st.patch


Attachment

Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Mon, Sep 20, 2021 at 4:42 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I was going to say that perhaps we can avoid repeated scans by having a
> bitmap of future files that were found by a scan; so if we need to do
> one scan, we keep track of the presence of the next (say) 64 files in
> our timeline, and then we only have to do another scan when we need to
> archive a file that wasn't present the last time we scanned.

There are two different proposed patches on this thread. One of them
works exactly that way, and the other one tries to optimize by
assuming that if we just optimized WAL file N, we likely will next
want to archive WAL file N+1. It's been hard to decide which way is
better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> What do you think?

I think this is committable. I also went back and looked at your
previous proposal to do files in batches, and I think that's also
committable. After some reflection, I think I have a slight preference
for the batching approach.
It seems like it might lend itself to archiving multiple files in a
single invocation of the archive_command, and Alvaro just suggested it
again apparently not having realized that it had been previously
proposed by Andres, so I guess it has the further advantage of being
the thing that several committers intuitively feel like we ought to be
doing to solve this problem.

So what I am inclined to do is commit
v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch.
However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has
perhaps evolved a bit more than the other one, so I thought I should
first ask whether any of those changes have influenced your thinking
about the batching approach and whether you want to make any updates
to that patch first. I don't really see that this is needed, but I
might be missing something.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
David Steele
Date:
On 9/24/21 12:28 PM, Robert Haas wrote:
> On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> What do you think?
> 
> I think this is committable. I also went back and looked at your
> previous proposal to do files in batches, and I think that's also
> committable. After some reflection, I think I have a slight preference
> for the batching approach.
> It seems like it might lend itself to archiving multiple files in a
> single invocation of the archive_command, and Alvaro just suggested it
> again apparently not having realized that it had been previously
> proposed by Andres, so I guess it has the further advantage of being
> the thing that several committers intuitively feel like we ought to be
> doing to solve this problem.

I also prefer this approach. Reducing directory scans is an excellent 
optimization, but from experience I know that execution time for the 
archive_command can also be a significant bottleneck. Begin able to 
archive multiple segments per execution would be a big win in certain 
scenarios.

> So what I am inclined to do is commit
> v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch.

I read the patch and it looks good to me.

I do wish we had a way to test that history files get archived first, 
but as I recall I was not able to figure out how to do reliably for [1] 
without writing a custom archive_command just for testing. That is 
something we might want to consider as we make this logic more complex.

Regards,
-- 
-David
david@pgmasters.net

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b981df4cc09aca978c5ce55e437a74913d09cccc



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/24/21, 9:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> So what I am inclined to do is commit
> v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch.
> However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has
> perhaps evolved a bit more than the other one, so I thought I should
> first ask whether any of those changes have influenced your thinking
> about the batching approach and whether you want to make any updates
> to that patch first. I don't really see that this is needed, but I
> might be missing something.

Besides sprucing up the comments a bit, I don't think there is
anything that needs to be changed.  The only other thing I considered
was getting rid of the arch_files array.  Instead, I would swap the
comparator function the heap uses with a reverse one, rebuild the
heap, and then have pgarch_readyXlog() return files via
binaryheap_remove_first().  However, this seemed like a bit more
complexity than necessary.

Attached is a new version of the patch with some expanded comments.

Nathan


Attachment

Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 9/27/21, 11:06 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 9/24/21, 9:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
>> So what I am inclined to do is commit
>> v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch.
>> However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has
>> perhaps evolved a bit more than the other one, so I thought I should
>> first ask whether any of those changes have influenced your thinking
>> about the batching approach and whether you want to make any updates
>> to that patch first. I don't really see that this is needed, but I
>> might be missing something.
>
> Besides sprucing up the comments a bit, I don't think there is
> anything that needs to be changed.  The only other thing I considered
> was getting rid of the arch_files array.  Instead, I would swap the
> comparator function the heap uses with a reverse one, rebuild the
> heap, and then have pgarch_readyXlog() return files via
> binaryheap_remove_first().  However, this seemed like a bit more
> complexity than necessary.
>
> Attached is a new version of the patch with some expanded comments.

I just wanted to gently bump this thread in case there is any
additional feedback.  I should have some time to work on it this week.
Also, it's looking more and more like this patch will nicely assist
the batching/loadable backup module stuff [0].

Nathan

[0] https://postgr.es/m/E9035E94-EC76-436E-B6C9-1C03FBD8EF54%40amazon.com


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Fri, Sep 24, 2021 at 12:28 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 16, 2021 at 7:26 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> > What do you think?
>
> I think this is committable. I also went back and looked at your
> previous proposal to do files in batches, and I think that's also
> committable. After some reflection, I think I have a slight preference
> for the batching approach.
> It seems like it might lend itself to archiving multiple files in a
> single invocation of the archive_command, and Alvaro just suggested it
> again apparently not having realized that it had been previously
> proposed by Andres, so I guess it has the further advantage of being
> the thing that several committers intuitively feel like we ought to be
> doing to solve this problem.
>
> So what I am inclined to do is commit
> v1-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch.
> However, v6-0001-Do-fewer-directory-scans-of-archive_status.patch has
> perhaps evolved a bit more than the other one, so I thought I should
> first ask whether any of those changes have influenced your thinking
> about the batching approach and whether you want to make any updates
> to that patch first. I don't really see that this is needed, but I
> might be missing something.

Nathan, I just realized we never closed the loop on this. Do you have
any thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 10/19/21, 5:59 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> Nathan, I just realized we never closed the loop on this. Do you have
> any thoughts?

IMO the patch is in decent shape.  Happy to address any feedback you
might have on the latest patch [0].

Nathan

[0]
https://www.postgresql.org/message-id/attachment/126789/v3-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch


Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 10/19/21, 7:53 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 10/19/21, 5:59 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
>> Nathan, I just realized we never closed the loop on this. Do you have
>> any thoughts?
>
> IMO the patch is in decent shape.  Happy to address any feedback you
> might have on the latest patch [0].

This thread seems to have lost traction.  The cfbot entry for the
latest patch [0] is still all green, so I think it is still good to
go.  I'm happy to address any additional feedback, though.

Nathan

[0] https://postgr.es/m/attachment/126789/v3-0001-Improve-performance-of-pgarch_readyXlog-with-many.patch


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, Nov 11, 2021 at 10:37 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> On 10/19/21, 7:53 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> > On 10/19/21, 5:59 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> >> Nathan, I just realized we never closed the loop on this. Do you have
> >> any thoughts?
> >
> > IMO the patch is in decent shape.  Happy to address any feedback you
> > might have on the latest patch [0].
>
> This thread seems to have lost traction.  The cfbot entry for the
> latest patch [0] is still all green, so I think it is still good to
> go.  I'm happy to address any additional feedback, though.

Somehow I didn't see your October 19th response previously. The
threading in gmail seems to have gotten broken, which may have
contributed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, Nov 11, 2021 at 2:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Somehow I didn't see your October 19th response previously. The
> threading in gmail seems to have gotten broken, which may have
> contributed.

And actually I also missed the September 27th email where you sent v3. Oops.

Committed now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
"Bossart, Nathan"
Date:
On 11/11/21, 12:23 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Thu, Nov 11, 2021 at 2:49 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> Somehow I didn't see your October 19th response previously. The
>> threading in gmail seems to have gotten broken, which may have
>> contributed.
>
> And actually I also missed the September 27th email where you sent v3. Oops.
>
> Committed now.

Thanks!  I figured it was something like that.  Sorry if I caused the
thread breakage.

Nathan


Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Thu, Nov 11, 2021 at 3:58 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> Thanks!  I figured it was something like that.  Sorry if I caused the
> thread breakage.

I think it was actually that the thread went over 100 emails ... which
usually causes Google to break it, but I don't know why it broke it
into three pieces instead of two, or why I missed the new ones.
Anyway, I don't think it was your fault, but no worries either way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: .ready and .done files considered harmful

From
Alvaro Herrera
Date:
Hello, sorry for necro-posting here:

On 2021-May-03, Robert Haas wrote:

> I and various colleagues of mine have from time to time encountered
> systems that got a bit behind on WAL archiving, because the
> archive_command started failing and nobody noticed right away.

We've recently had a couple of cases where the archiver hasn't been able
to keep up on systems running 13 and 14 because of this problem, causing
serious production outages.  Obviously that's not a great experience.  I
understand that this has been significantly improved in branch 15 by
commit beb4e9ba1652, the fix in this thread; we hypothesize that both
these production problems wouldn't have occurred, if the users had been
running the optimized pgarch.c code.

However, that commit was not backpatched.  I think that was the correct
decision at the time, because it wasn't a trivial fix.  It was
significantly modified by 1fb17b190341 a month later, both to fix a
critical bug as well as to make some efficiency improvements.

Now that the code has been battle-tested, I think we can consider
putting it into the older branches.  I did a quick cherry-pick
experiment, and I found that it backpatches cleanly to 14.  It doesn't
to 13, for lack of d75288fb27b8, which is far too invasive to backpatch,
and I don't think we should rewrite the code so that it works on the
previous state.  Fortunately 13 only has one more year to live, so I
don't feel too bad about leaving it as is.

So, my question now is, would there be much opposition to backpatching
beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE?


(On the other hand, we can always blame users for failing to implement
WAL archiving "correctly" ...  but from my perspective, this is an
embarrasing Postgres problem, and one that's relatively easy to solve
with very low risk.)

Thanks,

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Para tener más hay que desear menos"



Re: .ready and .done files considered harmful

From
Robert Haas
Date:
On Wed, Nov 13, 2024 at 11:05 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So, my question now is, would there be much opposition to backpatching
> beb4e9ba1652 + 1fb17b190341 to REL_14_STABLE?

It seems like it's been long enough now that if the new logic had
major problems we probably would have found them by now; so I feel
like it's probably pretty safe. Perhaps it's questionable how many
people we'll help by back-patching this into one additional release,
but if you feel it's worth it I wouldn't be inclined to argue.

--
Robert Haas
EDB: http://www.enterprisedb.com