Re: O(n) tasks cause lengthy startups and checkpoints - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: O(n) tasks cause lengthy startups and checkpoints
Date
Msg-id CALj2ACXYDPwhR6BkQuKTaJjO-y4i8kQCQxydTcBZy96UjN6FqA@mail.gmail.com
Whole thread Raw
In response to O(n) tasks cause lengthy startups and checkpoints  ("Bossart, Nathan" <bossartn@amazon.com>)
Responses Re: O(n) tasks cause lengthy startups and checkpoints
List pgsql-hackers
On Thu, Dec 2, 2021 at 1:54 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> Hi hackers,
>
> Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
> avoid individually syncing all database files after a crash.  However,
> as noted earlier this year [0], there are still a number of O(n) tasks
> that affect startup and checkpointing that I'd like to improve.
> Below, I've attempted to summarize each task and to offer ideas for
> improving matters.  I'll likely split each of these into its own
> thread, given there is community interest for such changes.
>
> 1) CheckPointSnapBuild(): This function loops through
>    pg_logical/snapshots to remove all snapshots that are no longer
>    needed.  If there are many entries in this directory, this can take
>    a long time.  The note above this function indicates that this is
>    done during checkpoints simply because it is convenient.  IIUC
>    there is no requirement that this function actually completes for a
>    given checkpoint.  My current idea is to move this to a new
>    maintenance worker.
> 2) CheckPointLogicalRewriteHeap(): This function loops through
>    pg_logical/mappings to remove old mappings and flush all remaining
>    ones.  IIUC there is no requirement that the "remove old mappings"
>    part must complete for a given checkpoint, but the "flush all
>    remaining" portion allows replay after a checkpoint to only "deal
>    with the parts of a mapping that have been written out after the
>    checkpoint started."  Therefore, I think we should move the "remove
>    old mappings" part to a new maintenance worker (probably the same
>    one as for 1), and we should consider using syncfs() for the "flush
>    all remaining" part.  (I suspect the main argument against the
>    latter will be that it could cause IO spikes.)
> 3) RemovePgTempFiles(): This step can delay startup if there are many
>    temporary files to individually remove.  This step is already
>    optionally done after a crash via the remove_temp_files_after_crash
>    GUC.  I propose that we have startup move the temporary file
>    directories aside and create new ones, and then a separate worker
>    (probably the same one from 1 and 2) could clean up the old files.
> 4) StartupReorderBuffer(): This step deletes logical slot data that
>    has been spilled to disk.  This code appears to be written to avoid
>    deleting different types of files in these directories, but AFAICT
>    there shouldn't be any other files.  Therefore, I think we could do
>    something similar to 3 (i.e., move the directories aside during
>    startup and clean them up via a new maintenance worker).
>
> I realize adding a new maintenance worker might be a bit heavy-handed,
> but I think it would be nice to have somewhere to offload tasks that
> really shouldn't impact startup and checkpointing.  I imagine such a
> process would come in handy down the road, too.  WDYT?

+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some
marker files (for instance, it can write snapshot_<cutofflsn> files
with file name itself representing the cutoff lsn so that the new bg
cleaner can remove the snapshot files, similarly it can write marker
files for other file removals). Having said that, a new bg cleaner
deleting the files asynchronously on behalf of checkpoint can look an
overkill until we have some numbers that we could save with this
approach. For this purpose, I did a small experiment to figure out how
much usually file deletion takes [1] on a SSD, for 1million files
8seconds, I'm sure it will be much more on HDD.

The bg cleaner can also be used for RemovePgTempFiles, probably the
postmaster just renaming the pgsql_temp to something
pgsql_temp_delete, then proceeding with the server startup, the bg
cleaner can then delete the files.
Also, we could do something similar for removing/recycling old xlog
files and StartupReorderBuffer.

Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)

[1] on SSD:
deletion of 1000000 files took 7.930380 seconds
deletion of 500000 files took 3.921676 seconds
deletion of 100000 files took 0.768772 seconds
deletion of 50000 files took 0.400623 seconds
deletion of 10000 files took 0.077565 seconds
deletion of 1000 files took 0.006232 seconds

Regards,
Bharath Rupireddy.



pgsql-hackers by date:

Previous
From: Greg Nancarrow
Date:
Subject: Re: Optionally automatically disable logical replication subscriptions on error
Next
From: Greg Nancarrow
Date:
Subject: Re: Data is copied twice when specifying both child and parent table in publication