Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: O(n) tasks cause lengthy startups and checkpoints
Date
Msg-id 20220217041404.GA3243546@nathanxps13
Whole thread Raw
In response to Re: O(n) tasks cause lengthy startups and checkpoints  (Andres Freund <andres@anarazel.de>)
Responses Re: O(n) tasks cause lengthy startups and checkpoints
List pgsql-hackers
Hi Andres,

I appreciate the feedback.

On Wed, Feb 16, 2022 at 05:50:52PM -0800, Andres Freund wrote:
>> +        /* Since not using PG_TRY, must reset error stack by hand */
>> +    if (sigsetjmp(local_sigjmp_buf, 1) != 0)
>> +    {
> 
> I also think it's a bad idea to introduce even more copies of the error
> handling body.  I think we need to unify this. And yes, it's unfair to stick
> you with it, but it's been a while since a new aux process has been added.

+1, I think this is useful refactoring.  I might spin this off to its own
thread.

>> +        /*
>> +         * These operations are really just a minimal subset of
>> +         * AbortTransaction().  We don't have very many resources to worry
>> +         * about.
>> +         */
> 
> Given what you're proposing this for, are you actually confident that we don't
> need more than this?

I will give this a closer look.

>> +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
>> +                            bool unlink_all);
> 
> I don't like functions with multiple consecutive booleans, they tend to get
> swapped around. Why not just split unlink_all=true/false into different
> functions?

Will do.

>> Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
>>
>> First, pgsql_tmp directories will be renamed to stage them for
>> removal.
> 
> What if the target name already exists?

The integer at the end of the target name is incremented until we find a
unique name.

>> Note that temporary relation files cannot be cleaned up via the
>> aforementioned strategy and will not be offloaded to the custodian.
> 
> This should be in the prior commit message, otherwise people will ask the same
> question as I did.

Will do.

>> +    /*
>> +     * Find a name for the stage directory.  We just increment an integer at the
>> +     * end of the name until we find one that doesn't exist.
>> +     */
>> +    for (int n = 0; n <= INT_MAX; n++)
>> +    {
>> +        snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
>> +                 PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
> 
> Uninterruptible loops up to INT_MAX do not seem like a good idea.

I modeled this after ChooseRelationName() in indexcmds.c.  Looking again, I
see that it loops forever until a unique name is found.  I suspect this is
unlikely to be a problem in practice.  What strategy would you recommend
for choosing a unique name?  Should we just append a couple of random
characters?

>> +        dir = AllocateDir(stage_path);
>> +        if (dir == NULL)
>> +        {
> 
> Why not just use stat()? That's cheaper, and there's no
> time-to-check-time-to-use issue here, we're the only one writing.

I'm not sure why I didn't use stat().  I will update this.

>> -    while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> +    while (!ShutdownRequestPending &&
>> +           (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
> 
> Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
> not do their jobs when ShutdownRequestPending is set, particularly without a
> huge fat comment.

The idea was to avoid delaying shutdown because we're waiting for the
custodian to finish relatively nonessential tasks.  Another option might be
to just exit immediately when the custodian receives a shutdown request.

>> +    /*
>> +     * If we just staged some pgsql_tmp directories for removal, wake up the
>> +     * custodian process so that it deletes all the files in the staged
>> +     * directories as well as the directories themselves.
>> +     */
>> +    if (stage && ProcGlobal->custodianLatch)
>> +        SetLatch(ProcGlobal->custodianLatch);
> 
> Just signalling without letting the custodian know what it's expected to do
> strikes me as a bad idea.

Good point.  I will work on that.

>> From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
>> From: Nathan Bossart <bossartn@amazon.com>
>> Date: Sun, 5 Dec 2021 22:02:40 -0800
>> Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian.
>>
>> This was only done during checkpoints because it was a convenient
>> place to put it.  However, if there are many snapshots to remove,
>> it can significantly extend checkpoint time.  To avoid this, move
>> this work to the newly-introduced custodian process.
>> ---
>>  src/backend/access/transam/xlog.c           |  2 --
>>  src/backend/postmaster/custodian.c          | 11 +++++++++++
>>  src/backend/replication/logical/snapbuild.c | 13 +++++++------
>>  src/include/replication/snapbuild.h         |  2 +-
>>  4 files changed, 19 insertions(+), 9 deletions(-)
> 
> Why does this not open us up to new xid wraparound issues? Before there was a
> hard bound on how long these files could linger around. Now there's not
> anymore.

Sorry, I'm probably missing something obvious, but I'm not sure how this
adds transaction ID wraparound risk.  These files are tied to LSNs, and
AFAIK they won't impact slots' xmins.

>> +#ifdef HAVE_SYNCFS
>> +
>> +    /*
>> +     * If we are doing a shutdown or end-of-recovery checkpoint, let's use
>> +     * syncfs() to flush the mappings to disk instead of flushing each one
>> +     * individually.  This may save us quite a bit of time when there are many
>> +     * such files to flush.
>> +     */
> 
> I am doubtful this is a good idea. This will cause all dirty files to be
> written back, even ones we don't need to be written back. At once. Very
> possibly *slowing down* the shutdown.
> 
> What is even the theory of the case here? That there's so many dirty mapping
> files that fsyncing them will take too long? That iterating would take too
> long?

Well, yes.  My idea was to model this after 61752af, which allows using
syncfs() instead of individually fsync-ing every file in the data
directory.  However, I would likely need to introduce a GUC because 1) as
you pointed out, it might be slower and 2) syncfs() doesn't report errors
on older versions of Linux.

TBH I do feel like this one is a bit of a stretch, so I am okay with
leaving it out for now.

>> 5 files changed, 317 insertions(+), 9 deletions(-)
> 
> This seems such an increase in complexity and fragility that I really doubt
> this is a good idea.

I think that's a fair point.  I'm okay with leaving this one out for now,
too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: [Proposal] Add foreign-server health checks infrastructure
Next
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: [Proposal] Add foreign-server health checks infrastructure