Re: .ready and .done files considered harmful - Mailing list pgsql-hackers

From Robert Haas
Subject Re: .ready and .done files considered harmful
Date
Msg-id CA+TgmobBoMqU_fwOD48e-JY69y87SKG6=POR0FODw03soaJOUA@mail.gmail.com
Whole thread Raw
In response to Re: .ready and .done files considered harmful  (Andres Freund <andres@anarazel.de>)
Responses Re: .ready and .done files considered harmful
Re: .ready and .done files considered harmful
List pgsql-hackers
On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:
> On 2021-05-03 16:49:16 -0400, Robert Haas wrote:
> > I have two possible ideas for addressing this; perhaps other people
> > will have further suggestions. A relatively non-invasive fix would be
> > to teach pgarch.c how to increment a WAL file name. After archiving
> > segment N, check using stat() whether there's an .ready file for
> > segment N+1. If so, do that one next. If not, then fall back to
> > performing a full directory scan.
>
> Hm. I wonder if it'd not be better to determine multiple files to be
> archived in one readdir() pass?

I think both methods have some merit. If we had a way to pass a range
of files to archive_command instead of just one, then your way is
distinctly better, and perhaps we should just go ahead and invent such
a thing. If not, your way doesn't entirely solve the O(n^2) problem,
since you have to choose some upper bound on the number of file names
you're willing to buffer in memory, but it may lower it enough that it
makes no practical difference. I am somewhat inclined to think that it
would be good to start with the method I'm proposing, since it is a
clear-cut improvement over what we have today and can be done with a
relatively limited amount of code change and no redesign, and then
perhaps do something more ambitious afterward.

> There's definitely gaps in practice :(. Due to the massive performance
> issues with archiving there are several tools that archive multiple
> files as part of one archive command invocation (and mark the additional
> archived files as .done immediately).

Good to know.

> > However, that's still pretty wasteful. Every time we have to wait for
> > the next file to be ready for archiving, we'll basically fall back to
> > repeatedly scanning the whole directory, waiting for it to show up.
>
> Hm. That seems like it's only an issue because .done and .ready are in
> the same directory? Otherwise the directory would be empty while we're
> waiting for the next file to be ready to be archived.

I think that's right.

> I hate that that's
> a thing but given teh serial nature of archiving, with high per-call
> overhead, I don't think it'd be ok to just break that without a
> replacement :(.

I don't know quite what you mean by this. Moving .done files to a
separate directory from .ready files could certainly be done and I
don't think it even would be that hard. It does seem like a bit of a
half measure though. If we're going to redesign this I think we ought
to be more ambitious than that.

> > But perhaps we could work around this by allowing pgarch.c to access
> > shared memory, in which case it could examine the current timeline
> > whenever it wants, and probably also whatever LSNs it needs to know
> > what's safe to archive.
>
> FWIW, the shared memory stats patch implies doing that, since the
> archiver reports stats.

Are you planning to commit that for v15? If so, will it be early in
the cycle, do you think?

> What kind of shared memory mechanism are you thinking of? Due to
> timelines and history files I don't think simple position counters would
> be quite enough.

I was thinking of simple position counters, but we could do something
more sophisticated. I don't even care if we stick with .ready/.done
for low-frequency stuff like timeline and history files. But I think
we'd be better off avoiding it for WAL files, because there are just
too many of them, and it's too hard to create a system that actually
scales. Or else we need a way for a single .ready file to cover many
WAL files in need of being archived, rather than just one.

> I think there is no fundamental for avoiding shared memory in the
> archiver. I guess there's a minor robustness advantage, because the
> forked shell to start the archvive command won't be attached to shared
> memory. But that's only until the child exec()s to the archive command.

That doesn't seem like a real issue because we're not running
user-defined code between fork() and exec().

> There is some minor performance advantage as well, not having to process
> the often large and contended memory mapping for shared_buffers is
> probably measurable - but swamped by the cost of needing to actually
> archive the segment.

Process it how?

Another option would be to have two processes. You could have one that
stayed connected to shared memory and another that JUST ran the
archive_command, and they could talk over a socket or something. But
that would add a bunch of extra complexity, so I don't want to do it
unless we actually need to do it.

> My only "concern" with doing anything around this is that I think the
> whole approach of archive_command is just hopelessly broken, with even
> just halfway busy servers only able to keep up archiving if they muck
> around with postgres internal data during archive command execution. Add
> to that how hard it is to write a robust archive command (e.g. the one
> in our docs still suggests test ! -f && cp, which means that copy
> failing in the middle yields an incomplete archive)...
>
> While I don't think it's all that hard to design a replacement, it's
> however likely still more work than addressing the O(n^2) issue, so ...

I think it is probably a good idea to fix the O(n^2) issue first, and
then as a separate step try to redefine things so that a decent
archive command doesn't have to poke around as much at internal stuff.
Part of that should probably involve having a way to pass a range of
files to archive_command instead of a single file. I was also
wondering whether we should go further and allow for the archiving to
be performed by C code running inside the backend rather than shelling
out to an external command.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Simplify backend terminate and wait logic in postgres_fdw test
Next
From: Bharath Rupireddy
Date:
Subject: Re: Identify missing publications from publisher while create/alter subscription.