Thread: Two fsync related performance issues?

Two fsync related performance issues?

From

Paul Guo

Date:

12 May 2020, 00:42:23

Hello hackers,

1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip some directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code.

if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
{
RemoveTempXlogFiles();
SyncDataDirectory();
}

I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to start wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery.

2. CheckPointTwoPhase()

This may be a small issue.

See the code below,

for (i = 0; i < TwoPhaseState->numPrepXacts; i++)

RecreateTwoPhaseFile(gxact->xid, buf, len);

RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync for all files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents. If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not test them yet but this should be able to speed up checkpoint/restartpoint a bit.

Any thoughts?

Regards.

Re: Two fsync related performance issues?

From

Fujii Masao

Date:

12 May 2020, 03:55:37


On 2020/05/12 9:42, Paul Guo wrote:
> Hello hackers,
> 
> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip
somedirectories (at least the pg_log/, table directories) since wal, etc could ensure consistency.
 

I agree that we can skip log directory but I'm not sure if skipping
table directory is really safe. Also ISTM that we can skip the directories
that those contents are removed or zeroed during recovery,
for example, pg_snapshots, pg_substrans, etc.

> Here is the related code.
> 
>        if (ControlFile->state != DB_SHUTDOWNED &&
>            ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
>        {
>            RemoveTempXlogFiles();
>            SyncDataDirectory();
>        }
> 
> I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to
startwal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common
issuein postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds
fsyncis much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable
fsyncon the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash
recovery.
> 
> 2.  CheckPointTwoPhase()
> 
> This may be a small issue.
> 
> See the code below,
> 
> for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
>      RecreateTwoPhaseFile(gxact->xid, buf, len);
> 
> RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync
forall files once after writing them, given the kernel is able to do asynchronous flush when writing those file
contents.If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not
testthem yet but this should be able to speed up checkpoint/restartpoint a bit.
 
> 
> Any thoughts?

It seems worth making the patch and measuring the performance improvement.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Two fsync related performance issues?

From

Michael Paquier

Date:

12 May 2020, 06:04:30

On Tue, May 12, 2020 at 12:55:37PM +0900, Fujii Masao wrote:
> On 2020/05/12 9:42, Paul Guo wrote:
>> 1. StartupXLOG() does fsync on the whole data directory early in
>> the crash recovery. I'm wondering if we could skip some
>> directories (at least the pg_log/, table directories) since wal,
>> etc could ensure consistency.
>
> I agree that we can skip log directory but I'm not sure if skipping
> table directory is really safe. Also ISTM that we can skip the directories
> that those contents are removed or zeroed during recovery,
> for example, pg_snapshots, pg_substrans, etc.

Basically excludeDirContents[] as of basebackup.c.

>> RecreateTwoPhaseFile() writes a state file for a prepared
>> transaction and does fsync. It might be good to do fsync for all
>> files once after writing them, given the kernel is able to do
>> asynchronous flush when writing those file contents. If
>> the TwoPhaseState->numPrepXacts is large we could do batching to
>> avoid the fd resource limit. I did not test them yet but this
>> should be able to speed up checkpoint/restartpoint a bit.
>
> It seems worth making the patch and measuring the performance improvement.

You would need to do some micro-benchmarking here, so you could
plug-in some pg_rusage_init() & co within this code path with many 2PC
files present at the same time.  However, I would believe that this is
not really worth the potential code complications.
--
Michael

Attachment

signature.asc

Re: Two fsync related performance issues?

From

Paul Guo

Date:

18 May 2020, 15:12:32

Thanks for the replies.

On Tue, May 12, 2020 at 2:04 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, May 12, 2020 at 12:55:37PM +0900, Fujii Masao wrote:
> On 2020/05/12 9:42, Paul Guo wrote:
>> 1. StartupXLOG() does fsync on the whole data directory early in
>> the crash recovery. I'm wondering if we could skip some
>> directories (at least the pg_log/, table directories) since wal,
>> etc could ensure consistency.
>
> I agree that we can skip log directory but I'm not sure if skipping
> table directory is really safe. Also ISTM that we can skip the directories
> that those contents are removed or zeroed during recovery,
> for example, pg_snapshots, pg_substrans, etc.

Basically excludeDirContents[] as of basebackup.c.

table directories & wal fsync probably dominates the fsync time. Do we

know any possible real scenario that requires table directory fsync?

Re: Two fsync related performance issues?

From

Tom Lane

Date:

18 May 2020, 18:26:55

Paul Guo <pguo@pivotal.io> writes:
> table directories & wal fsync probably dominates the fsync time. Do we
> know any possible real scenario that requires table directory fsync?

Yes, there are filesystems where that's absolutely required.  See
past discussions that led to putting in those fsyncs (we did not
always have them).

            regards, tom lane

Re: Two fsync related performance issues?

From

Robert Haas

Date:

19 May 2020, 12:50:42

On Mon, May 11, 2020 at 8:43 PM Paul Guo <pguo@pivotal.io> wrote:
> I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to
startwal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common
issuein postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds
fsyncis much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable
fsyncon the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery.

I've seen this problem be way worse than that. Running fsync() on all
the files and performing the unlogged table cleanup steps can together
take minutes or, I think, even tens of minutes. What I think sucks
most in this area is that we don't even emit any log messages if the
process takes a long time, so the user has no idea why things are
apparently hanging. I think we really ought to try to figure out some
way to give the user a periodic progress indication when this kind of
thing is underway, so that they at least have some idea what's
happening.

As Tom says, I don't think there's any realistic way that we can
disable it altogether, but maybe there's some way we could make it
quicker, like some kind of parallelism, or by overlapping it with
other things. It seems to me that we have to complete the fsync pass
before we can safely checkpoint, but I don't know that it needs to be
done any sooner than that... not sure though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

19 May 2020, 20:30:33

On Wed, May 20, 2020 at 12:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 11, 2020 at 8:43 PM Paul Guo <pguo@pivotal.io> wrote:
> > I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds
tostart wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a
commonissue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public
cloudsfsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least
disablefsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash
recovery.
>
> I've seen this problem be way worse than that. Running fsync() on all
> the files and performing the unlogged table cleanup steps can together
> take minutes or, I think, even tens of minutes. What I think sucks
> most in this area is that we don't even emit any log messages if the
> process takes a long time, so the user has no idea why things are
> apparently hanging. I think we really ought to try to figure out some
> way to give the user a periodic progress indication when this kind of
> thing is underway, so that they at least have some idea what's
> happening.
>
> As Tom says, I don't think there's any realistic way that we can
> disable it altogether, but maybe there's some way we could make it
> quicker, like some kind of parallelism, or by overlapping it with
> other things. It seems to me that we have to complete the fsync pass
> before we can safely checkpoint, but I don't know that it needs to be
> done any sooner than that... not sure though.

I suppose you could with the whole directory tree what
register_dirty_segment() does, for the pathnames that you recognise as
regular md.c segment names.  Then it'd be done as part of the next
checkpoint, though you might want to bring the pre_sync_fname() stuff
back into it somehow to get more I/O parallelism on Linux (elsewhere
it does nothing).  Of course that syscall could block, and the
checkpointer queue can fill up and then you have to do it
synchronously anyway, so you'd have to look into whether that's
enough.

The whole concept of SyncDataDirectory() bothers me a bit though,
because although it's apparently trying to be safe by being
conservative, it assumes a model of write back error handling that we
now know to be false on Linux.  And then it thrashes the inode cache
to make it more likely that error state is forgotten, just for good
measure.

What would a precise version of this look like?  Maybe we really only
need to fsync relation files that recovery modifies (as we already
do), plus those that it would have touched but didn't because of the
page LSN (a new behaviour to catch segment files that were written by
the last run but not yet flushed, which I guess in practice would only
happen with full_page_writes=off)?  (If you were paranoid enough to
believe that the buffer cache were actively trying to trick you and
marked dirty pages clean and lost the error state so you'll never hear
about it, you might even want to rewrite such pages once.)

Re: Two fsync related performance issues?

From

Robert Haas

Date:

20 May 2020, 17:51:09

On Tue, May 19, 2020 at 4:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> What would a precise version of this look like?  Maybe we really only
> need to fsync relation files that recovery modifies (as we already
> do), plus those that it would have touched but didn't because of the
> page LSN (a new behaviour to catch segment files that were written by
> the last run but not yet flushed, which I guess in practice would only
> happen with full_page_writes=off)?  (If you were paranoid enough to
> believe that the buffer cache were actively trying to trick you and
> marked dirty pages clean and lost the error state so you'll never hear
> about it, you might even want to rewrite such pages once.)

I suspect there was also a worry that perhaps we'd been running before
with fsync=off, or that maybe we'd just created this data directory
with a non-fsyncing tool like 'cp' or 'tar -xvf'. In normal cases we
shouldn't need to be nearly that conservative, but it's unclear how we
can know when it's needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Two fsync related performance issues?

From

Craig Ringer

Date:

26 May 2020, 12:30:49

On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote:

Hello hackers,

1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip some directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code.

if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
{
RemoveTempXlogFiles();
SyncDataDirectory();
}

This would actually be a good candidate for a thread pool. Dispatch sync requests and don't wait. Come back later when they're done.

Unsure if that's at all feasible given that pretty much all the Pg APIs aren't thread safe though. No palloc, no elog/ereport, etc. However I don't think we're ready to run bgworkers or use shm_mq etc at that stage.

Of course if OSes would provide asynchronous IO interfaces that weren't utterly vile and broken, we wouldn't have to worry...

RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync for all files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents. If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not test them yet but this should be able to speed up checkpoint/restartpoint a bit.

I seem to recall some hints we can set on a FD or mmapped range that encourage dirty buffers to be written without blocking us, too. I'll have to look them up...

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

02 September 2020, 23:30:50

On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote:
> On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote:
>> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip
somedirectories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related
code.
>>
>>       if (ControlFile->state != DB_SHUTDOWNED &&
>>           ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
>>       {
>>           RemoveTempXlogFiles();
>>           SyncDataDirectory();
>>       }
>
> This would actually be a good candidate for a thread pool. Dispatch sync requests and don't wait. Come back later
whenthey're done.

>
> Unsure if that's at all feasible given that pretty much all the Pg APIs aren't thread safe though. No palloc, no
elog/ereport,etc. However I don't think we're ready to run bgworkers or use shm_mq etc at that stage.

We could run auxiliary processes.  I think it'd be very useful if we
had a general purpose worker pool that could perform arbitrary tasks
and could even replace current and future dedicated launcher and
worker processes, but in this case I think the problem is quite
closely linked to infrastructure that we already have.  I think we
should:

1.  Run the checkpointer during crash recovery (see
https://commitfest.postgresql.org/29/2706/).
2.  In walkdir(), don't call stat() on all the files, so that we don't
immediately fault in all 42 bazillion inodes synchronously on a cold
system (see
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BFzxupGGN4GpUdbzZN%2Btn6FQPHo8w0Q%2BAPH5Wz8RG%2Bww%40mail.gmail.com).
3.  In src/common/relpath.c, add a function ParseRelationPath() that
does the opposite of GetRelationPath().
4.  In datadir_fsync_fname(), if ParseRelationPath() is able to
recognise a file as being a relation file, build a FileTag and call
RegisterSyncRequest() to tell the checkpointer to sync this file as
part of the end checkpoint (currently the end-of-recovery checkpoint,
but that could also be relaxed).

There are a couple of things I'm not sure about though, which is why I
don't have a patch for 3 and 4:
1.  You have to move a few things around to avoid hideous modularity
violations: it'd be weird if fd.c knew how to make md.c's sync
requests.  So you'd need to find a new place to put the crash-recovery
data-dir-sync routine, but then it can't use walkdir().
2.  I don't know how to fit the pre_sync_fname() part into this
scheme.  Or even if you still need it: if recovery is short, it
probably doesn't help much, and if it's long then the kernel is likely
to have started writing back before the checkpoint anyway due to dirty
writeback timer policies.  On the other hand, it'd be nice to start
the work of *opening* the files sooner than the the start of the
checkpoint, if cold inode access is slow, so perhaps a little bit more
infrastructure is needed; a new way to queue a message for the
checkpointer that says "hey, why don't you start presyncing stuff".
On the third hand, it's arguably better to wait for more pages to get
dirtied by recovery before doing any pre-sync work anyway, because the
WAL will likely be redirtying the same pages again we'd ideally not
like to write our data out twice, which is one of the reasons to want
to collapse the work into the next checkpoint.  So I'm not sure what
the best plan is here.

As I mentioned earlier, I think it's also possible to do smarter
analysis based on WAL information so that we don't even need to open
all 42 bazillion files, but just the ones touched since the last
checkpoint, if you're prepared to ignore the
previously-running-with-fsync=off scenario Robert mentioned.  I'm not
too sure about that.  But something like the above scheme would at
least get some more concurrency going, without changing the set of
hazards we believe our scheme protects against (I mean, you could
argue that SyncDataDirectory() is a bit like using a sledgehammer to
crack an unspecified nut, and then not even quite hitting it if it's a
Linux nut, but I'm trying to practise kai-zen here).

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

03 September 2020, 00:09:29

On Tue, May 12, 2020 at 12:43 PM Paul Guo <pguo@pivotal.io> wrote:
> 2.  CheckPointTwoPhase()
>
> This may be a small issue.
>
> See the code below,
>
> for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
>     RecreateTwoPhaseFile(gxact->xid, buf, len);
>
> RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync
forall files once after writing them, given the kernel is able to do asynchronous flush when writing those file
contents.If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not
testthem yet but this should be able to speed up checkpoint/restartpoint a bit. 

Hi Paul,

I hadn't previously focused on this second part of your email.  I
think the fsync() call in RecreateTwoPhaseFile() might be a candidate
for processing by the checkpoint code through the new facilities in
sync.c, which effectively does something like what you describe.  Take
a look at the code I'm proposing for slru.c in
https://commitfest.postgresql.org/29/2669/ and also in md.c.  You'd
need a way to describe the path of these files in a FileTag struct, so
you can defer the work; it will invoke your callback to sync the file
as part of the next checkpoint (or panic if it can't).  You also need
to make sure to tell it to forget the sync request before you unlink
the file.  Although it still has to call fsync one-by-one on the files
at checkpoint time, by deferring it until then there is a good chance
the kernel has already done the work so you don't have to go off-CPU
at all. I hope that means we'd often never have to perform the fsync
at all, because the file is usually gone before we reach the
checkpoint.  Do I have that right?

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

09 September 2020, 03:49:55

On Thu, Sep 3, 2020 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote:
> > On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote:
> >> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could
skipsome directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the
relatedcode.
 
> >>
> >>       if (ControlFile->state != DB_SHUTDOWNED &&
> >>           ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
> >>       {
> >>           RemoveTempXlogFiles();
> >>           SyncDataDirectory();
> >>       }

> 4.  In datadir_fsync_fname(), if ParseRelationPath() is able to
> recognise a file as being a relation file, build a FileTag and call
> RegisterSyncRequest() to tell the checkpointer to sync this file as
> part of the end checkpoint (currently the end-of-recovery checkpoint,
> but that could also be relaxed).

For the record, Andres Freund mentioned a few problems with this
off-list and suggested we consider calling Linux syncfs() for each top
level directory that could potentially be on a different filesystem.
That seems like a nice idea to look into.

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

10 September 2020, 11:40:22

On Thu, Sep 3, 2020 at 12:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, May 12, 2020 at 12:43 PM Paul Guo <pguo@pivotal.io> wrote:
> >     RecreateTwoPhaseFile(gxact->xid, buf, len);

> I hadn't previously focused on this second part of your email.  I
> think the fsync() call in RecreateTwoPhaseFile() might be a candidate
> for processing by the checkpoint code through the new facilities in
> sync.c, which effectively does something like what you describe.  Take

I looked at this more closely and realised that I misunderstood; I was
thinking of a problem like the one that was already solved years ago
with commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71.  Sorry for the
noise.

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

05 October 2020, 01:38:32

On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Sep 3, 2020 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote:
> > > On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote:
> > >> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could
skipsome directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the
relatedcode.
 
> > >>
> > >>       if (ControlFile->state != DB_SHUTDOWNED &&
> > >>           ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
> > >>       {
> > >>           RemoveTempXlogFiles();
> > >>           SyncDataDirectory();
> > >>       }
>
> > 4.  In datadir_fsync_fname(), if ParseRelationPath() is able to
> > recognise a file as being a relation file, build a FileTag and call
> > RegisterSyncRequest() to tell the checkpointer to sync this file as
> > part of the end checkpoint (currently the end-of-recovery checkpoint,
> > but that could also be relaxed).
>
> For the record, Andres Freund mentioned a few problems with this
> off-list and suggested we consider calling Linux syncfs() for each top
> level directory that could potentially be on a different filesystem.
> That seems like a nice idea to look into.

Here's an experimental patch to try that.  One problem is that before
Linux 5.8, syncfs() doesn't report failures[1].  I'm not sure what to
think about that; in the current coding we just log them and carry on
anyway, but any theoretical problems that causes for BLK_DONE should
be moot anyway because of FPIs which result in more writes and syncs.
Another is that it may affect other files that aren't under pgdata as
collateral damage, but that seems acceptable.  It also makes me a bit
sad that this wouldn't help other OSes.

(Archeological note:  The improved syncfs() error reporting is linked
to 2018 PostgreSQL/Linux hacker discussions[2], because it was thought
that syncfs() might be useful for checkpointing, though I believe
since then things have moved on and the new thinking is that we'd use
a new proposed interface to read per-filesystem I/O error counters
while checkpointing.)

[1] https://man7.org/linux/man-pages/man2/sync.2.html
[2] https://lwn.net/Articles/752063/

Attachment

0001-Use-syncfs-for-SyncDataDirectory-on-Linux.patch

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

07 October 2020, 05:17:27

On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > For the record, Andres Freund mentioned a few problems with this
> > off-list and suggested we consider calling Linux syncfs() for each top
> > level directory that could potentially be on a different filesystem.
> > That seems like a nice idea to look into.
>
> Here's an experimental patch to try that.  One problem is that before
> Linux 5.8, syncfs() doesn't report failures[1].  I'm not sure what to
> think about that; in the current coding we just log them and carry on
> anyway, but any theoretical problems that causes for BLK_DONE should
> be moot anyway because of FPIs which result in more writes and syncs.
> Another is that it may affect other files that aren't under pgdata as
> collateral damage, but that seems acceptable.  It also makes me a bit
> sad that this wouldn't help other OSes.

... and for comparison/discussion, here is an alternative patch that
figures out precisely which files need to be fsync'd using information
in the WAL.  On a system with full_page_writes=on, this effectively
means that we don't have to do anything at all for relation files, as
described in more detail in the commit message.  You still need to
fsync the WAL files to make sure you can't replay records from the log
that were written but not yet fdatasync'd, addressed in the patch.
I'm not yet sure which other kinds of special files might need special
treatment.

Some thoughts:
1.  Both patches avoid the need to open many files.  With 1 million
tables this can take over a minute even on a fast system with warm
caches and/or fast local storage, before replay begins, and for a cold
system with high latency storage it can be a serious problem.
2.  The syncfs() one is comparatively simple, but it only works on
Linux.  If you used sync() (or sync(); sync()) instead, you'd be
relying on non-POSIX behaviour, because POSIX says it doesn't wait for
completion and indeed many non-Linux systems really are like that.
3.  Though we know of kernel/filesystem pairs that can mark buffers
clean while retaining the dirty contents (which would cause corruption
with the current code in master, or either of these patches),
fortunately those systems can't possibly run with full_page_writes=off
so such problems are handled the same way torn pages are fixed.
4.  Perhaps you could set a flag in the buffer to say 'needs sync' as
a way to avoid repeatedly requesting syncs for the same page, but it
might not be worth the hassle involved.

Some other considerations that have been mentioned to me by colleagues
I talked to about this:
1.  The ResetUnloggedRelations() code still has to visit all relation
files looking for init forks after a crash.  But that turns out to be
OK, it only reads directory entries in a straight line.  It doesn't
stat() or open() files with non-matching names, so unless you have
very many unlogged tables, the problem is already avoided.  (It also
calls fsync() on the result, which could perhaps be replaced with a
deferred request too, not sure, for another day.)
2.  There may be some more directories that need special fsync()
treatment.  SLRUs are already covered (either handled by checkpoint or
they hold ephemeral data), and I think pg_tblspc changes will be
redone.  pg_logical?  I am not sure.

Attachment

0001-Optimize-fsync-calls-in-crash-recovery.patch

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

07 October 2020, 21:51:20

On Wed, Oct 7, 2020 at 6:17 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > For the record, Andres Freund mentioned a few problems with this
> > > off-list and suggested we consider calling Linux syncfs() for each top
> > > level directory that could potentially be on a different filesystem.
> > > That seems like a nice idea to look into.

> ... and for comparison/discussion, here is an alternative patch that
> figures out precisely which files need to be fsync'd using information
> in the WAL. [...]

Michael Banck reported[1] a system that spent 20 minute in
SyncDataDirectory().  His summary caused me to go back and read the
discussions[1][2] that produced the current behaviour via commits
2ce439f3 and d8179b00, and I wanted to add a couple more observations
about the two draft patches mentioned above.

About the need to sync files that were dirtied during a previous run:

1.  The syncfs() patch has the same ignore errors-and-press-on
behaviour as d8179b00 gave us, though on Linux < 5.8 it would not even
report them at LOG level.

2.  The "precise" fsync() patch defers the work until after redo, but
if you get errors while processing the queued syncs, you would not be
able to complete the end-of-recovery checkpoint.  This is correct
behaviour in my opinion; any such checkpoint that is allowed to
complete would be a lie, and would make the corruption permanent,
releasing the WAL that was our only hope of recovering.  If you want
to run a so-damaged system for forensic purposes, you could always
bring it up with fsync=off, or consider the idea from a nearby thread
to allow the end-of-recovery checkpoint to be disabled (then the
eventual first checkpoint will likely take down your system, but
that's the case with the current
ignore-errors-and-hope-for-the-best-after-crash coding for the *next*
checkpoint, assuming your damaged filesystem continues to produce
errors, it's just more principled IMHO).

I recognise that this sounds an absolutist argument that might attract
some complaints on practical grounds, but I don't think it really
makes too much difference in practice.  Consider a typical Linux
filesystem:  individual errors aren't going to be reported more than
once, and full_page_writes must be on on such a system so we'll be
writing out all affected pages again and then trying to fsync again in
the end-of-recovery checkpoint, so despite our attempt at creating a
please-ignore-errors-and-corrupt-my-database-and-play-on mode, you'll
likely panic again if I/O errors persist, or survive and continue
without corruption if the error-producing-conditions were fleeting and
transient (as in the thin provisioning EIO or NFS ENOSPC conditions
discussed in other threads).

About the need to fsync everything in sight on a system that
previously ran with fsync=off, as discussed in the earlier threads:

1.  The syncfs() patch provides about the same weak guarantee as the
current coding.  Something like: it can convert all checkpoints that
were logged in the time since the kernel started from fiction to fact,
except those corrupted by (unlikely) I/O errors, which may only be
reported in the kernel logs if at all.

2.  The "precise" fsync() patch provides no such weak guarantee.  It
takes the last checkpoint at face value, and can't help you with
anything that happened before that.

The problem I have with this is that the current coding *only does it
for crash scenarios*.  So, if you're moving a system from fsync=off to
fsync=on with a clean shutdown in between, we already don't help you.
Effectively, you need to run sync(1) (but see man page for caveats and
kernel logs for errors) to convert your earlier checkpoints from
fiction to fact.  So all we're discussing here is what we do with a
system that crashed.  Why is that a sane time to transition from
fsync=off to fsync=on, and, supposing someone does that, why should we
offer any more guarantees about that than we do when you make the same
transition on a system that shut down cleanly?

[1] https://www.postgresql.org/message-id/flat/4a5d233fcd28b5f1008aec79119b02b5a9357600.camel%40credativ.de
[2]
https://www.postgresql.org/message-id/flat/20150114105908.GK5245%40awork2.anarazel.de#1525fab691dbaaef35108016f0b99467
[3] https://www.postgresql.org/message-id/flat/20150523172627.GA24277%40msg.df7cb.de

Re: Two fsync related performance issues?

From

Michael Banck

Date:

09 October 2020, 13:49:56

Hi,

Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro:
> On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > For the record, Andres Freund mentioned a few problems with this
> > > off-list and suggested we consider calling Linux syncfs() for each top
> > > level directory that could potentially be on a different filesystem.
> > > That seems like a nice idea to look into.
> > 
> > Here's an experimental patch to try that.  One problem is that before
> > Linux 5.8, syncfs() doesn't report failures[1].  I'm not sure what to
> > think about that; in the current coding we just log them and carry on
> > anyway, but any theoretical problems that causes for BLK_DONE should
> > be moot anyway because of FPIs which result in more writes and syncs.
> > Another is that it may affect other files that aren't under pgdata as
> > collateral damage, but that seems acceptable.  It also makes me a bit
> > sad that this wouldn't help other OSes.
> 
> ... and for comparison/discussion, here is an alternative patch that
> figures out precisely which files need to be fsync'd using information
> in the WAL.  On a system with full_page_writes=on, this effectively
> means that we don't have to do anything at all for relation files, as
> described in more detail in the commit message.  You still need to
> fsync the WAL files to make sure you can't replay records from the log
> that were written but not yet fdatasync'd, addressed in the patch.
> I'm not yet sure which other kinds of special files might need special
> treatment.
> 
> Some thoughts:
> 1.  Both patches avoid the need to open many files.  With 1 million
> tables this can take over a minute even on a fast system with warm
> caches and/or fast local storage, before replay begins, and for a cold
> system with high latency storage it can be a serious problem.

You mention "serious problem" and "over a minute", but I don't recall
you mentioning how long it takes with those two patches (or maybe I
missed it), so here goes:

I created an instance with 250 databases on 250 tablespaces on a SAN
storage. This is on 12.4, the patches can be backpatched with minimal
changes.

After pg_ctl stop -m immediate, a pg_ctl start -w (or rather the time
between the two log messages "database system was interrupted; last
known up at %s" and "database system was not properly shut down;
automatic recovery in progress" took

1. 12-13 seconds on vanilla
2. usually < 10 ms, sometimes 70-80 ms with the syncfs patch
3. 4 ms with the optimized sync patch

That's a dramatic improvement, but maybe also a best case scenario as no
traffic happened since the last checkpoint. I did some light pgbench
before killing the server again, but couldn't get the optimzid sync
patch to take any longer, might try harder at some point but won't have
much more time to test today.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Two fsync related performance issues?

From

Michael Banck

Date:

13 October 2020, 11:53:49

Hi,

Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro:
> ... and for comparison/discussion, here is an alternative patch that
> figures out precisely which files need to be fsync'd using information
> in the WAL. 

One question about this: Did you consider the case of a basebackup being
copied/restored somewhere and the restore/PITR being started? Shouldn't
Postgres then sync the whole data directory first in order to assure
durability, or do we consider this to be on the tool that does the
copying? Or is this not needed somehow?

My understanding is that Postgres would go through the same code path
during PITR:

     if (ControlFile->state != DB_SHUTDOWNED &&
         ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
    {
        RemoveTempXlogFiles();
        SyncDataDirectory();

If I didn't miss anything, that would be a point for the syncfs patch?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Two fsync related performance issues?

From

Thomas Munro

Date:

14 October 2020, 01:48:18

On Wed, Oct 14, 2020 at 12:53 AM Michael Banck
<michael.banck@credativ.de> wrote:
> Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro:
> > ... and for comparison/discussion, here is an alternative patch that
> > figures out precisely which files need to be fsync'd using information
> > in the WAL.
>
> One question about this: Did you consider the case of a basebackup being
> copied/restored somewhere and the restore/PITR being started? Shouldn't
> Postgres then sync the whole data directory first in order to assure
> durability, or do we consider this to be on the tool that does the
> copying? Or is this not needed somehow?

To go with precise fsyncs, we'd have to say that it's the job of the
creator of the secondary copy.  Unfortunately that's not a terribly
convenient thing to do (or at least the details vary).

[The devil's advocate enters the chat]

Let me ask you this:  If you copy the pgdata directory of a system
that has shut down cleanly, for example with cp or rsync as described
in the manual, who will sync the files before you start up the
cluster?  Not us, anyway, because SyncDataDirectory() only runs after
a crash.  A checkpoint is, after all, a promise that all changes up to
some LSN are durably on disk, and we leave it up to people who are
copying files around underneath us while we're not running to worry
about what happens if you make that untrue.  Now, why is a database
that crashed any different?

> My understanding is that Postgres would go through the same code path
> during PITR:
>
>         if (ControlFile->state != DB_SHUTDOWNED &&
>                 ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
>         {
>                 RemoveTempXlogFiles();
>                 SyncDataDirectory();
>
> If I didn't miss anything, that would be a point for the syncfs patch?

Yeah.

Re: Two fsync related performance issues?

From

Michael Paquier

Date:

14 October 2020, 05:06:47

On Wed, Oct 14, 2020 at 02:48:18PM +1300, Thomas Munro wrote:
> On Wed, Oct 14, 2020 at 12:53 AM Michael Banck
> <michael.banck@credativ.de> wrote:
>> One question about this: Did you consider the case of a basebackup being
>> copied/restored somewhere and the restore/PITR being started? Shouldn't
>> Postgres then sync the whole data directory first in order to assure
>> durability, or do we consider this to be on the tool that does the
>> copying? Or is this not needed somehow?
>
> To go with precise fsyncs, we'd have to say that it's the job of the
> creator of the secondary copy.  Unfortunately that's not a terribly
> convenient thing to do (or at least the details vary).

Yeah, it is safer to assume that it is the responsability of the
backup tool to ensure that because it could be possible that a host is
unplugged just after taking a backup, and having Postgres do this work
at the beginning of archive recovery would not help in most cases.
IMO this comes back to the point where we usually should not care much
how long a backup  takes as long as it is done right.  Users care much
more about how long a restore takes until consistency is reached.  And
this is in line with things that have been done via bc34223b or
96a7128.
--
Michael

Attachment

signature.asc

Re: Two fsync related performance issues?

From

Michael Banck

Date:

14 October 2020, 08:14:26

Hi,

Am Mittwoch, den 14.10.2020, 14:06 +0900 schrieb Michael Paquier:
> On Wed, Oct 14, 2020 at 02:48:18PM +1300, Thomas Munro wrote:
> > On Wed, Oct 14, 2020 at 12:53 AM Michael Banck
> > <michael.banck@credativ.de> wrote:
> > > One question about this: Did you consider the case of a basebackup being
> > > copied/restored somewhere and the restore/PITR being started? Shouldn't
> > > Postgres then sync the whole data directory first in order to assure
> > > durability, or do we consider this to be on the tool that does the
> > > copying? Or is this not needed somehow?
> > 
> > To go with precise fsyncs, we'd have to say that it's the job of the
> > creator of the secondary copy.  Unfortunately that's not a terribly
> > convenient thing to do (or at least the details vary).
> 
> Yeah, it is safer to assume that it is the responsability of the
> backup tool to ensure that because it could be possible that a host is
> unplugged just after taking a backup, and having Postgres do this work
> at the beginning of archive recovery would not help in most cases.
> IMO this comes back to the point where we usually should not care much
> how long a backup  takes as long as it is done right.  Users care much
> more about how long a restore takes until consistency is reached.  And
> this is in line with things that have been done via bc34223b or
> 96a7128.

I agree that the backup tool should make sure the backup is durable and
this is out of scope.

I was worried more about the restore part, right now, 
https://www.postgresql.org/docs/13/continuous-archiving.html#BACKUP-PITR-RECOVERY
says this in point 4:

|Restore the database files from your file system backup. Be sure that
|they are restored with the right ownership (the database system user,
|not root!) and with the right permissions. If you are using
|tablespaces, you should verify that the symbolic links in pg_tblspc/
|were correctly restored.

There's no word of running sync afterwards or making otherwise sure that
everything is back in a durable fashion. Currently, we run fsync on all
the files on startup anyway during a PITR, but with the second patch,
that will no longer happen. Maybe that is not a problem, but if that's
the case, it's at least not clear to me.

Also, Tom seemed to imply up-thread in 3750.1589826415@sss.pgh.pa.us
that syncing the files was necessary, but maybe I'm reading too much into his rather short mail.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Two fsync related performance issues?

From

Craig Ringer

Date:

01 December 2020, 11:39:30

On Wed, 14 Oct 2020, 13:06 Michael Paquier, <michael@paquier.xyz> wrote:

On Wed, Oct 14, 2020 at 02:48:18PM +1300, Thomas Munro wrote:
> On Wed, Oct 14, 2020 at 12:53 AM Michael Banck
> <michael.banck@credativ.de> wrote:
>> One question about this: Did you consider the case of a basebackup being
>> copied/restored somewhere and the restore/PITR being started? Shouldn't
>> Postgres then sync the whole data directory first in order to assure
>> durability, or do we consider this to be on the tool that does the
>> copying? Or is this not needed somehow?
>
> To go with precise fsyncs, we'd have to say that it's the job of the
> creator of the secondary copy. Unfortunately that's not a terribly
> convenient thing to do (or at least the details vary).

Yeah, it is safer to assume that it is the responsability of the
backup tool to ensure that because it could be possible that a host is
unplugged just after taking a backup, and having Postgres do this work
at the beginning of archive recovery would not help in most cases.

Let's document that assumption in the docs for pg_basebackup and the file system copy based replica creation docs. With a reference to initdb's datadir sync option.

IMO this comes back to the point where we usually should not care much
how long a backup takes as long as it is done right. Users care much
more about how long a restore takes until consistency is reached. And
this is in line with things that have been done via bc34223b or
96a7128.
--
Michael

Re: Two fsync related performance issues?

From

Michael Paquier

Date:

02 December 2020, 07:41:38

On Tue, Dec 01, 2020 at 07:39:30PM +0800, Craig Ringer wrote:
> On Wed, 14 Oct 2020, 13:06 Michael Paquier, <michael@paquier.xyz> wrote:
>> Yeah, it is safer to assume that it is the responsability of the
>> backup tool to ensure that because it could be possible that a host is
>> unplugged just after taking a backup, and having Postgres do this work
>> at the beginning of archive recovery would not help in most cases.
>
> Let's document that assumption in the docs for pg_basebackup and the file
> system copy based replica creation docs. With a reference to initdb's
> datadir sync option.

Do you have any suggestion about that, in the shape of a patch perhaps?
--
Michael

Attachment

signature.asc

Re: Two fsync related performance issues?

From

Craig Ringer

Date:

03 December 2020, 06:10:38

On Wed, 2 Dec 2020 at 15:41, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Dec 01, 2020 at 07:39:30PM +0800, Craig Ringer wrote:
> On Wed, 14 Oct 2020, 13:06 Michael Paquier, <michael@paquier.xyz> wrote:
>> Yeah, it is safer to assume that it is the responsability of the
>> backup tool to ensure that because it could be possible that a host is
>> unplugged just after taking a backup, and having Postgres do this work
>> at the beginning of archive recovery would not help in most cases.
>
> Let's document that assumption in the docs for pg_basebackup and the file
> system copy based replica creation docs. With a reference to initdb's
> datadir sync option.

Do you have any suggestion about that, in the shape of a patch perhaps?

I'll try to get to that, but I have some other docs patches outstanding that I'd like to resolve first.

Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise