Thread: initdb and fsync

initdb and fsync

From

Jeff Davis

Date:

27 January 2012, 20:20:01

It looks like initdb doesn't fsync all the files it creates, e.g. the
PG_VERSION file.

While it's unlikely that it would cause any real data loss, it can be
inconvenient in some testing scenarios involving VMs.

Thoughts? Would a patch to add a few fsync calls to initdb be accepted?
Is a platform-independent fsync be available at initdb time?

Regards,   Jeff Davis

Re: initdb and fsync

From

Noah Misch

Date:

28 January 2012, 00:52:40

On Fri, Jan 27, 2012 at 04:19:41PM -0800, Jeff Davis wrote:
> It looks like initdb doesn't fsync all the files it creates, e.g. the
> PG_VERSION file.
> 
> While it's unlikely that it would cause any real data loss, it can be
> inconvenient in some testing scenarios involving VMs.
> 
> Thoughts? Would a patch to add a few fsync calls to initdb be accepted?

+1.  If I'm piloting "strace -f" right, initdb currently issues *no* syncs.

We'd probably, then, want a way to re-disable the fsyncs for hacker benefit.

> Is a platform-independent fsync be available at initdb time?

Not sure.

Thanks,
nm

Re: initdb and fsync

From

Andrew Dunstan

Date:

28 January 2012, 11:32:33

On 01/27/2012 11:52 PM, Noah Misch wrote:
>> Is a platform-independent fsync be available at initdb time?
> Not sure.
>

It's a macro on Windows that calls _commit(fd), so it should be portable 
enough.

I'm curious what problem we're actually solving here, though. I've run 
the buildfarm countless thousands of times on different VMs, and five of 
my seven current animals run in VMs, and I don't think I've ever seen a 
failure ascribable to inadequately synced files from initdb.

cheers

andrew

Re: initdb and fsync

From

Jeff Davis

Date:

28 January 2012, 14:16:29

On Sat, 2012-01-28 at 10:31 -0500, Andrew Dunstan wrote:
> I'm curious what problem we're actually solving here, though. I've run 
> the buildfarm countless thousands of times on different VMs, and five of 
> my seven current animals run in VMs, and I don't think I've ever seen a 
> failure ascribable to inadequately synced files from initdb.

I believe I have seen such a problem second hand in a situation where
the VM was known to be killed harshly (not sure if you do that
regularly).

It's a little difficult for me to _prove_ that this would have solved
the problem, and I think it was only observed once (though I could
probably reproduce it if I tried). The symptom was a log message
indicating that PG_VERSION was missing or corrupt on a system that was
previously started and online (albeit briefly for a test).

Regards,Jeff Davis

Re: initdb and fsync

From

Tom Lane

Date:

28 January 2012, 14:18:18

Andrew Dunstan <andrew@dunslane.net> writes:
> I'm curious what problem we're actually solving here, though. I've run 
> the buildfarm countless thousands of times on different VMs, and five of 
> my seven current animals run in VMs, and I don't think I've ever seen a 
> failure ascribable to inadequately synced files from initdb.

Yeah.  Personally I would be sad if initdb got noticeably slower, and
I've never seen or heard of a failure that this would fix.

I wonder whether it wouldn't be sufficient to call sync(2) at the end,
anyway, rather than cluttering the entire initdb codebase with fsync
calls.
        regards, tom lane

Re: initdb and fsync

From

Jeff Janes

Date:

28 January 2012, 14:27:29

On Sat, Jan 28, 2012 at 7:31 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 01/27/2012 11:52 PM, Noah Misch wrote:
>>>
>>> Is a platform-independent fsync be available at initdb time?
>>
>> Not sure.
>>
>
> It's a macro on Windows that calls _commit(fd), so it should be portable
> enough.
>
> I'm curious what problem we're actually solving here, though. I've run the
> buildfarm countless thousands of times on different VMs, and five of my
> seven current animals run in VMs, and I don't think I've ever seen a failure
> ascribable to inadequately synced files from initdb.

I wouldn't expect you to ever see that problem on the buildfarm.  If
the OS gets thunked during the middle of a regression test, when it
comes back up the code is not going to try to pick up where it left
off, it is just going to blow away the entire install and start over
from scratch.  So any crash-recoverability problems will never be
detected.  I would guess the original poster is doing a more stringent
kind of test.

Cheers,

Jeff

Re: initdb and fsync

From

Jeff Davis

Date:

28 January 2012, 14:46:27

On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > I'm curious what problem we're actually solving here, though. I've run 
> > the buildfarm countless thousands of times on different VMs, and five of 
> > my seven current animals run in VMs, and I don't think I've ever seen a 
> > failure ascribable to inadequately synced files from initdb.
> 
> Yeah.  Personally I would be sad if initdb got noticeably slower, and
> I've never seen or heard of a failure that this would fix.
> 
> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.

I can always add a "sync" call to the test, also (rather than modifying
initdb). Or, it could be an initdb option, which might be a good
compromise. I don't have a strong opinion here.

As machines get more memory and filesystems get more lazy, I wonder if
it will be a more frequent occurrence, however. On the other hand, if
filesystems are more lazy, that also increases the cost associated with
extra "sync" calls. I think there would be a surprise factor if
sometimes initdb had a long pause at the end and caused 10GB of data to
be written out.

Regards,Jeff Davis

Re: initdb and fsync

From

Jeff Janes

Date:

28 January 2012, 14:57:21

On Sat, Jan 28, 2012 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I'm curious what problem we're actually solving here, though. I've run
>> the buildfarm countless thousands of times on different VMs, and five of
>> my seven current animals run in VMs, and I don't think I've ever seen a
>> failure ascribable to inadequately synced files from initdb.
>
> Yeah.  Personally I would be sad if initdb got noticeably slower, and
> I've never seen or heard of a failure that this would fix.
>
> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.
>
>                        regards, tom lane

Does sync(2) behave like sync(8) and flush the entire system cache, or
does it just flush the files opened by the process which called it?

The man page didn't enlighten me on that.

sometimes sync(8) never returns.  It doesn't just flush what was dirty
at the time it was called, it actually keeps running until there are
simultaneously no dirty pages anywhere in the system.  On busy
systems, this condition might never be reached.  And it can't be
interrupted, not even with kill -9.

Cheers,

Jeff

Re: initdb and fsync

From

Andrew Dunstan

Date:

28 January 2012, 15:00:23


On 01/28/2012 01:46 PM, Jeff Davis wrote:
> On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
>> Andrew Dunstan<andrew@dunslane.net>  writes:
>>> I'm curious what problem we're actually solving here, though. I've run
>>> the buildfarm countless thousands of times on different VMs, and five of
>>> my seven current animals run in VMs, and I don't think I've ever seen a
>>> failure ascribable to inadequately synced files from initdb.
>> Yeah.  Personally I would be sad if initdb got noticeably slower, and
>> I've never seen or heard of a failure that this would fix.
>>
>> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
>> anyway, rather than cluttering the entire initdb codebase with fsync
>> calls.
> I can always add a "sync" call to the test, also (rather than modifying
> initdb). Or, it could be an initdb option, which might be a good
> compromise. I don't have a strong opinion here.
>
> As machines get more memory and filesystems get more lazy, I wonder if
> it will be a more frequent occurrence, however. On the other hand, if
> filesystems are more lazy, that also increases the cost associated with
> extra "sync" calls. I think there would be a surprise factor if
> sometimes initdb had a long pause at the end and caused 10GB of data to
> be written out.
>

-1 for that. A very quick look at initdb.c suggests to me that there are 
only two places where we'd need to put fsync(), right before we call 
fclose()  in write_file() and write_version_file(). If we're going to do 
anything that seems to be the least painful and most portable way to go.

cheers

andrew

Re: initdb and fsync

From

Florian Weimer

Date:

04 February 2012, 17:20:28

* Tom Lane:

> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.

We tried to do this in the Debian package mananger.  It works as
expected on Linux systems, but it can cause a lot of data to hit the
disk, and there are kernel versions where sync(2) never completes if
the system is rather busy.

initdb is much faster with 9.1 than with 8.4.  It's so fast that you
can use it in test suites, instead of reusing an existing cluster.
I think this is a rather desirable property.

Re: initdb and fsync

From

Jeff Davis

Date:

04 February 2012, 19:41:40

On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
> Yeah.  Personally I would be sad if initdb got noticeably slower, and
> I've never seen or heard of a failure that this would fix.

I worked up a patch, and it looks like it does about 6 file fsync's and
a 7th for the PGDATA directory. That degrades the time from about 1.1s
to 1.4s on my workstation.

pg_test_fsync says this about my workstation (one 8kB write):
        open_datasync                     117.495 ops/sec
        fdatasync                         117.949 ops/sec
        fsync                              25.530 ops/sec
        fsync_writethrough                            n/a
        open_sync                          24.666 ops/sec

25 ops/sec means about 40ms per fsync, times 7 is about 280ms, so that
seems like about the right degradation for fsync. I tried with fdatasync
as well to see if it improved things, and I wasn't able to realize any
difference (not sure exactly why).

So, is it worth it? Should we make it an option that can be specified?

> I wonder whether it wouldn't be sufficient to call sync(2) at the end,
> anyway, rather than cluttering the entire initdb codebase with fsync
> calls.

It looks like there are only a few places, so I don't think clutter is
really the problem with the simple patch at this point (unless there is
a portability problem with just calling fsync).

Regards,
    Jeff Davis

Attachment

initdb-fsync.patch.gz

Re: initdb and fsync

From

Noah Misch

Date:

04 February 2012, 21:38:51

On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote:
> On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
> > Yeah.  Personally I would be sad if initdb got noticeably slower, and
> > I've never seen or heard of a failure that this would fix.
> 
> I worked up a patch, and it looks like it does about 6 file fsync's and
> a 7th for the PGDATA directory. That degrades the time from about 1.1s
> to 1.4s on my workstation.

> So, is it worth it? Should we make it an option that can be specified?

If we add fsync calls to the initdb process, they should cover the entire data
directory tree.  This patch syncs files that initdb.c writes, but we ought to
also sync files that bootstrap-mode backends had written.  An optimization
like the pg_flush_data() call in copy_file() may reduce the speed penalty.

initdb should do these syncs by default and offer an option to disable them.

Re: initdb and fsync

From

Jeff Davis

Date:

05 February 2012, 14:53:30

On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
> If we add fsync calls to the initdb process, they should cover the entire data
> directory tree.  This patch syncs files that initdb.c writes, but we ought to
> also sync files that bootstrap-mode backends had written.

It doesn't make sense for initdb to take responsibility to sync files
created by the backend. If there are important files that the backend
creates, it should be the backend's responsibility to fsync them (and
their parent directory, if needed). And if they are unimportant to the
backend, then there is no reason for initdb to fsync them.

> An optimization
> like the pg_flush_data() call in copy_file() may reduce the speed penalty.

That worked pretty well. It took it down about 100ms on my machine,
which closes the gap significantly.

> initdb should do these syncs by default and offer an option to disable them.

For test frameworks that run initdb often, that makes sense.

But for developers, it doesn't make sense to spend 0.5s typing an option
that saves you 0.3s. So, we'd need some more convenient way to choose
the no-fsync option, like an environment variable that developers can
set. Or maybe developers don't care about 0.3s?

Regards,Jeff Davis

Re: initdb and fsync

From

Tom Lane

Date:

05 February 2012, 15:50:36

Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
>> If we add fsync calls to the initdb process, they should cover the entire data
>> directory tree.  This patch syncs files that initdb.c writes, but we ought to
>> also sync files that bootstrap-mode backends had written.

> It doesn't make sense for initdb to take responsibility to sync files
> created by the backend.

No, but the more interesting question is whether bootstrap mode troubles
to fsync its writes.  I'm not too sure about that either way ...
        regards, tom lane

Re: initdb and fsync

From

Noah Misch

Date:

05 February 2012, 18:57:17

On Sun, Feb 05, 2012 at 10:53:20AM -0800, Jeff Davis wrote:
> On Sat, 2012-02-04 at 20:18 -0500, Noah Misch wrote:
> > If we add fsync calls to the initdb process, they should cover the entire data
> > directory tree.  This patch syncs files that initdb.c writes, but we ought to
> > also sync files that bootstrap-mode backends had written.
> 
> It doesn't make sense for initdb to take responsibility to sync files
> created by the backend. If there are important files that the backend
> creates, it should be the backend's responsibility to fsync them (and
> their parent directory, if needed). And if they are unimportant to the
> backend, then there is no reason for initdb to fsync them.

I meant primarily to illustrate the need to be comprehensive, not comment on
which executable should fsync a particular file.  Bootstrap-mode backends do
not sync anything during an initdb run on my system.  With your patch, we'll
fsync a small handful of files and leave nearly everything else vulnerable.

That being said, having each backend fsync its own writes will mean syncing
certain files several times within a single initdb run.  If the penalty from
that proves high enough, we may do well to instead have initdb.c sync
everything just once.

> > initdb should do these syncs by default and offer an option to disable them.
> 
> For test frameworks that run initdb often, that makes sense.
> 
> But for developers, it doesn't make sense to spend 0.5s typing an option
> that saves you 0.3s. So, we'd need some more convenient way to choose
> the no-fsync option, like an environment variable that developers can
> set. Or maybe developers don't care about 0.3s?

Developers have shell aliases/functions/scripts and command history.  I
wouldn't object to having an environment variable control it, but I would not
personally find that more convenient than a command-line switch.

Thanks,
nm

Re: initdb and fsync

From

Peter Eisentraut

Date:

10 February 2012, 16:58:23

On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote:
> > initdb should do these syncs by default and offer an option to
> disable them.
> 
> For test frameworks that run initdb often, that makes sense.
> 
> But for developers, it doesn't make sense to spend 0.5s typing an
> option
> that saves you 0.3s. So, we'd need some more convenient way to choose
> the no-fsync option, like an environment variable that developers can
> set. Or maybe developers don't care about 0.3s?
> 
You can use https://launchpad.net/libeatmydata for those cases.

Re: initdb and fsync

From

Robert Haas

Date:

13 February 2012, 09:15:09

On Fri, Feb 10, 2012 at 3:57 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On sön, 2012-02-05 at 10:53 -0800, Jeff Davis wrote:
>> > initdb should do these syncs by default and offer an option to
>> disable them.
>>
>> For test frameworks that run initdb often, that makes sense.
>>
>> But for developers, it doesn't make sense to spend 0.5s typing an
>> option
>> that saves you 0.3s. So, we'd need some more convenient way to choose
>> the no-fsync option, like an environment variable that developers can
>> set. Or maybe developers don't care about 0.3s?
>>
> You can use https://launchpad.net/libeatmydata for those cases.

That's hilarious.

But, a command-line option seems more convenient.

It also seems entirely sufficient.  The comments above suggest that it
would take too long to type the option, but any PG developers who are
worried about the speed difference surely know how to create shell
aliases, shell functions, shell scripts, ... and if anyone's really
concerned about it, we can provide a short form for the option.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: initdb and fsync

From

Jeff Davis

Date:

13 March 2012, 00:50:06

On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote:
> I meant primarily to illustrate the need to be comprehensive, not comment on
> which executable should fsync a particular file.  Bootstrap-mode backends do
> not sync anything during an initdb run on my system.  With your patch, we'll
> fsync a small handful of files and leave nearly everything else vulnerable.

Thank you for pointing that out. With that in mind, I have a new version
of the patch which just recursively fsync's the whole directory
(attached).

I also introduced a new option --nosync (-N) to disable this behavior.

The bad news is that it introduces a lot more time to initdb -- it goes
from about 1s to about 10s on my machine. I tried fsync'ing the whole
directory twice just to make sure that the second was a no-op, and
indeed it didn't make much difference (still about 10s).

That's pretty inefficient considering that

  initdb -D data --nosync && sync

only takes a couple seconds. Clearly batching the operation is a big
help. Maybe there's some more efficient way to fsync a lot of
files/directories? Or maybe I can mitigate it by avoiding files that
don't really need to be fsync'd?

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120312.patch.gz

Re: initdb and fsync

From

Andres Freund

Date:

13 March 2012, 05:45:34

On Tuesday, March 13, 2012 04:49:40 AM Jeff Davis wrote:
> On Sun, 2012-02-05 at 17:56 -0500, Noah Misch wrote:
> > I meant primarily to illustrate the need to be comprehensive, not comment
> > on which executable should fsync a particular file.  Bootstrap-mode
> > backends do not sync anything during an initdb run on my system.  With
> > your patch, we'll fsync a small handful of files and leave nearly
> > everything else vulnerable.
> 
> Thank you for pointing that out. With that in mind, I have a new version
> of the patch which just recursively fsync's the whole directory
> (attached).
> 
> I also introduced a new option --nosync (-N) to disable this behavior.
> 
> The bad news is that it introduces a lot more time to initdb -- it goes
> from about 1s to about 10s on my machine. I tried fsync'ing the whole
> directory twice just to make sure that the second was a no-op, and
> indeed it didn't make much difference (still about 10s). 
I suggest you try making it two loops:

for recursively everything in dir:  posix_fadvise(fd, POSIX_FADV_DONTNEED);

for recursively everything in dir:  fsync(fd);

In my experience that gives way much better performance due to the fact that 
it does not force its own metadata/journal commit/transaction for every file 
but can be batched. copydir() does the same since some releases...

Obviously its not that nice to use _DONTNEED but I havent found something that 
works equally well. You could try sync_file_range(fd, 0, 0, 
SYNC_FILE_RANGE_WRITE) in the first loop but my experience with that hasn't 
been that good.

Andres

Re: initdb and fsync

From

Jeff Davis

Date:

14 March 2012, 01:23:19

On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> for recursively everything in dir:
>    posix_fadvise(fd, POSIX_FADV_DONTNEED);
>
> for recursively everything in dir:
>    fsync(fd);

Wow, that made a huge difference!

  no sync:      ~ 1.0s
  sync:         ~10.0s
  fadvise+sync: ~ 1.3s

Patch attached.

Now I feel much better about it. Most people will either have fadvise, a
write cache (rightly or wrongly), or actually need the sync. Those that
have none of those can use -N.

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120313.patch.gz

Re: initdb and fsync

From

Andres Freund

Date:

14 March 2012, 06:26:40

On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote:
> On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> > for recursively everything in dir:
> >    posix_fadvise(fd, POSIX_FADV_DONTNEED);
> > 
> > for recursively everything in dir:
> >    fsync(fd);
> 
> Wow, that made a huge difference!
> 
>   no sync:      ~ 1.0s
>   sync:         ~10.0s
>   fadvise+sync: ~ 1.3s
> 
> Patch attached.
> 
> Now I feel much better about it. Most people will either have fadvise, a
> write cache (rightly or wrongly), or actually need the sync. Those that
> have none of those can use -N.
Well, while the positive effect of this are rather large it also has the bad 
effect of pushing the whole new database out of the cache. Which is not so nice 
if you want to run tests on it seconds later.
How are the results with sync_file_range(fd, 0, 0, 
SYNC_FILE_RANGE_WRITE)? 

Andres

Re: initdb and fsync

From

Jeff Davis

Date:

15 March 2012, 03:39:02

On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote:
> On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote:
> > On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> > > for recursively everything in dir:
> > >    posix_fadvise(fd, POSIX_FADV_DONTNEED);
> > >
> > > for recursively everything in dir:
> > >    fsync(fd);
> >
> > Wow, that made a huge difference!
> >
> >   no sync:      ~ 1.0s
> >   sync:         ~10.0s
> >   fadvise+sync: ~ 1.3s

I take that back. There was something wrong with my test -- fadvise
helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
hoped.

> Well, while the positive effect of this are rather large it also has the bad
> effect of pushing the whole new database out of the cache. Which is not so nice
> if you want to run tests on it seconds later.

I was unable to see a regression when it comes to starting it up after
the fadvise+fsync. My test just started the server, created a table,
then stopped the server. It was actually a hair faster with the
directory that had been fadvise'd and then fsync'd, but I assume that
was noise. Regardless, this doesn't look like an issue.

> How are the results with sync_file_range(fd, 0, 0,
> SYNC_FILE_RANGE_WRITE)?

That is much faster than using fadvise. It goes down to ~2s.

Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
the annoying side (and causes some disconcerting sounds to come from my
disk), especially when we _know_ it can be done in 2s.

Anyway, updated patch attached.

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120314.patch.gz

Re: initdb and fsync

From

Andres Freund

Date:

16 March 2012, 07:25:33

On Thursday, March 15, 2012 07:38:38 AM Jeff Davis wrote:
> On Wed, 2012-03-14 at 10:26 +0100, Andres Freund wrote:
> > On Wednesday, March 14, 2012 05:23:03 AM Jeff Davis wrote:
> > > On Tue, 2012-03-13 at 09:42 +0100, Andres Freund wrote:
> > > > for recursively everything in dir:
> > > >    posix_fadvise(fd, POSIX_FADV_DONTNEED);
> > > > 
> > > > for recursively everything in dir:
> > > >    fsync(fd);
> > > 
> > > Wow, that made a huge difference!
> > > 
> > >   no sync:      ~ 1.0s
> > >   sync:         ~10.0s
> > >   fadvise+sync: ~ 1.3s
> 
> I take that back. There was something wrong with my test -- fadvise
> helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
> hoped.
Thats surprising. I wouldn't expect such a big difference between fadvise + 
sync_file_range. Rather strange.

> > Well, while the positive effect of this are rather large it also has the
> > bad effect of pushing the whole new database out of the cache. Which is
> > not so nice if you want to run tests on it seconds later.
> 
> I was unable to see a regression when it comes to starting it up after
> the fadvise+fsync. My test just started the server, created a table,
> then stopped the server. It was actually a hair faster with the
> directory that had been fadvise'd and then fsync'd, but I assume that
> was noise. Regardless, this doesn't look like an issue.
Hm. Ok.

> > How are the results with sync_file_range(fd, 0, 0,
> > SYNC_FILE_RANGE_WRITE)?
> That is much faster than using fadvise. It goes down to ~2s.

> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
> the annoying side (and causes some disconcerting sounds to come from my
> disk), especially when we _know_ it can be done in 2s.
Its not like posix_fadvise is actually portable. So I personally don't see a 
problem with that, but...

Andres

Re: initdb and fsync

From

Robert Haas

Date:

16 March 2012, 12:47:23

On Fri, Mar 16, 2012 at 6:25 AM, Andres Freund <andres@anarazel.de> wrote:
>> > How are the results with sync_file_range(fd, 0, 0,
>> > SYNC_FILE_RANGE_WRITE)?
>> That is much faster than using fadvise. It goes down to ~2s.
>
>> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
>> the annoying side (and causes some disconcerting sounds to come from my
>> disk), especially when we _know_ it can be done in 2s.
> Its not like posix_fadvise is actually portable. So I personally don't see a
> problem with that, but...

Well, sync_file_range only works on Linux, and will probably never
work anywhere else.  posix_fadvise() at least has a chance of being
supported on other platforms, being a standard and all that.  Though I
see that my Mac has neither.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: initdb and fsync

From

Andres Freund

Date:

16 March 2012, 12:51:26

On Friday, March 16, 2012 04:47:06 PM Robert Haas wrote:
> On Fri, Mar 16, 2012 at 6:25 AM, Andres Freund <andres@anarazel.de> wrote:
> >> > How are the results with sync_file_range(fd, 0, 0,
> >> > SYNC_FILE_RANGE_WRITE)?
> >> 
> >> That is much faster than using fadvise. It goes down to ~2s.
> >> 
> >> Unfortunately, that's non-portable. Any other ideas? 6.5s a little on
> >> the annoying side (and causes some disconcerting sounds to come from my
> >> disk), especially when we _know_ it can be done in 2s.
> > 
> > Its not like posix_fadvise is actually portable. So I personally don't
> > see a problem with that, but...
> 
> Well, sync_file_range only works on Linux, and will probably never
> work anywhere else.  posix_fadvise() at least has a chance of being
> supported on other platforms, being a standard and all that.  Though I
> see that my Mac has neither.  :-(
I would suggest adding a wrapper function like:
pg_hint_writeback_flush(fd, off, len);

which then is something like

#if HAVE_SYNC_FILE_RANGE
sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);
#elseif HAVE_POSIX_FADVISE
posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);
#else
#endif

To my knowledge posix_fadvise currently is only supported on linux btw...

Andres

Re: initdb and fsync

From

Jeff Davis

Date:

16 March 2012, 14:15:08

On Fri, 2012-03-16 at 11:25 +0100, Andres Freund wrote:
> > I take that back. There was something wrong with my test -- fadvise
> > helps, but it only takes it from ~10s to ~6.5s. Not quite as good as I
> > hoped.
> Thats surprising. I wouldn't expect such a big difference between fadvise + 
> sync_file_range. Rather strange.

I discussed this with my colleague who knows linux internals, and he
pointed me directly at the problem. fadvise and sync_file_range in this
case are both trying to put the data in the io scheduler queue (still in
the kernel, not on the device). The difference is that fadvise doesn't
wait, and sync_file_range does (keep in mind, this is waiting to get in
a queue to go to the device, not waiting for the device to write it or
even receive it).

He indicated that 4096 is a normal number that one might use for the
queue size. But on my workstation at home (ubuntu 11.10), the queue is
only 128. I bumped it up to 256 and now fadvise is just as fast!

This won't be a problem on production systems, but that doesn't help us
a lot. People setting up a production system don't care about 6.5
seconds of set-up time anyway. Casual users and developers do (the
latter problem can be solved with the --nosync switch, but the former
problem is the one we're discussing).

So, it looks like fadvise is the "right" thing to do, but I expect we'll
get some widely varying results from actual users. Then again, maybe
casual users don't care much about ~10s for initdb anyway? It's a fairly
rare operation for everyone except developers.

Regards,Jeff Davis

Re: initdb and fsync

From

Cédric Villemain

Date:

17 March 2012, 13:49:05

<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">Levendredi 16 mars 2012 16:51:04, Andres Freund a écrit :<p style=" margin-top:0px;
margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> On
Friday,March 16, 2012 04:47:06 PM Robert Haas wrote:<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > On Fri, Mar 16, 2012 at 6:25 AM,
AndresFreund <andres@anarazel.de> wrote:<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> > How are the results
withsync_file_range(fd, 0, 0,<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > >> > SYNC_FILE_RANGE_WRITE)?<p style="
margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>> >> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > >> That is much faster than using fadvise. It
goesdown to ~2s.<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;
text-indent:0px;-qt-user-state:0;">> > >> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> Unfortunately, that's
non-portable.Any other ideas? 6.5s a little on<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> the annoying side (and
causessome disconcerting sounds to come from<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > >> my disk), especially when
we_know_ it can be done in 2s.<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > > <p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > > Its not like
posix_fadviseis actually portable. So I personally don't<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > > see a problem with that,
but...<pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;
text-indent:0px;-qt-user-state:0;">> > <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> > Well, sync_file_range only works on
Linux,and will probably never<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> > work anywhere else. posix_fadvise() at least has a
chanceof being<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;
text-indent:0px;-qt-user-state:0;">> > supported on other platforms, being a standard and all that. Though I<p
style="margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>> see that my Mac has neither. :-(<p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> <p style="
margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>I would suggest adding a wrapper function like:<p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
pg_hint_writeback_flush(fd,off, len);<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> <p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> which then is something
like<pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;
text-indent:0px;-qt-user-state:0;">> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> #if HAVE_SYNC_FILE_RANGE<p style="
margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE);<p style=" margin-top:0px;
margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">>
#elseifHAVE_POSIX_FADVISE<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> posix_fadvise(fd, off, len, POSIX_FADV_DONTNEED);<p style="
margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>#else<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">> #endif<p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> <p style="
margin-top:0px;margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">>To my knowledge posix_fadvise currently is only supported on linux btw...<p
style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I agree with Andres.<p
style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">I believe we should use
sync_file_range(_before?) with linux.<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">And we can use posix_fadvise_dontneed on other kernels.<p
style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">FADVISE_DONTNEED does start a
writeback(it may decide not to do it too), but the primer objective of posix_fadvise_dontneed is not to make sync()
faster.We just have writeback and sync() calls challenged together and we can face situation where linux does not
handlethat so well. (depends on linux 2.6.18 or 32 or 3.2 or ...)<p style="-qt-paragraph-type:empty; margin-top:0px;
margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><br /><p
style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">-- <p style=" margin-top:0px;
margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Cédric
Villemain+33 (0)6 20 30 22 52<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;">http://2ndQuadrant.fr/<p style=" margin-top:0px;
margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">PostgreSQL:Support 24x7 - Développement, Expertise et Formation

Re: initdb and fsync

From

Jeff Davis

Date:

25 March 2012, 23:59:57

On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
> I agree with Andres.
>
>
> I believe we should use sync_file_range (_before?) with linux.
>
> And we can use posix_fadvise_dontneed on other kernels.
>
OK, updated patch attached. sync_file_range() is preferred,
posix_fadvise() is a fallback.

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120325.patch.gz

Re: initdb and fsync

From

Jeff Davis

Date:

13 June 2012, 01:09:17

On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote:
> On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
> > I agree with Andres.
> >
> >
> > I believe we should use sync_file_range (_before?) with linux.
> >
> > And we can use posix_fadvise_dontneed on other kernels.
> >
> OK, updated patch attached. sync_file_range() is preferred,
> posix_fadvise() is a fallback.
>

Rebased patch attached. No other changes.

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120612.patch.gz

Re: initdb and fsync

From

Peter Eisentraut

Date:

13 June 2012, 07:53:22

On tis, 2012-06-12 at 21:09 -0700, Jeff Davis wrote:
> On Sun, 2012-03-25 at 19:59 -0700, Jeff Davis wrote:
> > On Sat, 2012-03-17 at 17:48 +0100, Cédric Villemain wrote:
> > > I agree with Andres.
> > > 
> > > 
> > > I believe we should use sync_file_range (_before?) with linux.
> > > 
> > > And we can use posix_fadvise_dontneed on other kernels.
> > > 
> > OK, updated patch attached. sync_file_range() is preferred,
> > posix_fadvise() is a fallback.
> > 
> 
> Rebased patch attached. No other changes.

The --help output for the -N option was copy-and-pasted wrongly.

The message issued when using -N is also a bit content-free.  Maybe
something like

"Running in nosync mode.  The data directory might become corrupt if the
operating system crashes.\n"

Which leads to the question, how does one get out of this state?  Is
running sync(1) enough?  Is starting the postgres server enough?

There are no updates to the initdb man page included in your patch,
which would be a suitable place to discuss this briefly.

Re: initdb and fsync

From

Jeff Davis

Date:

13 June 2012, 13:53:24

On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote:
> The --help output for the -N option was copy-and-pasted wrongly.
>
> The message issued when using -N is also a bit content-free.  Maybe
> something like
>
> "Running in nosync mode.  The data directory might become corrupt if the
> operating system crashes.\n"

Thank you, fixed.

> Which leads to the question, how does one get out of this state?  Is
> running sync(1) enough?  Is starting the postgres server enough?

sync(1) calls sync(2), and the man page says:

"According to the standard specification  (e.g.,  POSIX.1-2001),  sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually  wait.   (This  still
does not guarantee data integrity: modern disks have large caches.)"

So it looks like sync is enough if you are on linux *and* you have any
unprotected write cache disabled.

I don't think starting the postgres server is enough.

Before, I think we were safe because we could assume that the OS would
flush the buffers before you had time to store any important data. But
now, that window can be much larger.

> There are no updates to the initdb man page included in your patch,
> which would be a suitable place to discuss this briefly.

Thank you, I added that.

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120613.patch.gz

Re: initdb and fsync

From

Andres Freund

Date:

18 June 2012, 13:05:50

On Wednesday, June 13, 2012 06:53:17 PM Jeff Davis wrote:
> On Wed, 2012-06-13 at 13:53 +0300, Peter Eisentraut wrote:
> > The --help output for the -N option was copy-and-pasted wrongly.
> > 
> > The message issued when using -N is also a bit content-free.  Maybe
> > something like
> > 
> > "Running in nosync mode.  The data directory might become corrupt if the
> > operating system crashes.\n"
> 
> Thank you, fixed.
> 
> > Which leads to the question, how does one get out of this state?  Is
> > running sync(1) enough?  Is starting the postgres server enough?
> 
> sync(1) calls sync(2), and the man page says:
> 
> "According to the standard specification  (e.g.,  POSIX.1-2001),  sync()
> schedules the writes, but may return before the actual writing is done.
> However, since version 1.3.20 Linux does actually  wait.   (This  still
> does not guarantee data integrity: modern disks have large caches.)"
> 
> So it looks like sync is enough if you are on linux *and* you have any
> unprotected write cache disabled.
Protection can include write barries, it doesn't need to be a BBU....

> I don't think starting the postgres server is enough.
Agreed.

> Before, I think we were safe because we could assume that the OS would
> flush the buffers before you had time to store any important data. But
> now, that window can be much larger.
I think to a large degree we didn't see any problem because of the old ext3 
"sync the whole world" type of behaviour which was common for a very long 
time.
So on ext3 (with data=ordered, the default) any fsync (checkpoints, commit, 
...) would lead to all the other files being synced as well which means 
starting the server once would have been enough on such a system...


Quick review:
- defaulting to initdb -N in the regression suite is not a good imo, because 
that way the buildfarm won't catch problems in that area...
- could the copydir.c and initdb.c versions of walkdir/sync_fname et al be 
unified?
- I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname 
instead of cluttering the main function with it

Looks good otherwise!

Thanks,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: initdb and fsync

From

Peter Eisentraut

Date:

18 June 2012, 15:34:50

On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> - defaulting to initdb -N in the regression suite is not a good imo,
> because that way the buildfarm won't catch problems in that area...
> 
The regression test suite also starts postgres with fsync off.  This is
good, and I don't want to have more overhead in the tests.  It's not
like we already have awesome test coverage for OS interaction.

Re: initdb and fsync

From

Jeff Davis

Date:

18 June 2012, 15:40:05

On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> Quick review:
> - defaulting to initdb -N in the regression suite is not a good imo, because
> that way the buildfarm won't catch problems in that area...

I removed the -N as you suggest. How much does performance matter on the
buildfarm?

> - could the copydir.c and initdb.c versions of walkdir/sync_fname et al be
> unified?

There's a lot of backend-specific code in the copydir versions, like
using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
unifying them before, and concluded that it wouldn't add to the
readability, so I just commented where they came from.

If you feel there will be a maintainability problem, I can give it
another shot.

> - I personally would find it way nicer to put USE_PRE_SYNC into pre_sync_fname
> instead of cluttering the main function with it

Done.

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120618.patch.gz

Re: initdb and fsync

From

Jeff Davis

Date:

18 June 2012, 15:52:40

On Mon, 2012-06-18 at 21:34 +0300, Peter Eisentraut wrote:
> On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> > - defaulting to initdb -N in the regression suite is not a good imo,
> > because that way the buildfarm won't catch problems in that area...
> >
> The regression test suite also starts postgres with fsync off.  This is
> good, and I don't want to have more overhead in the tests.  It's not
> like we already have awesome test coverage for OS interaction.

OK, changing back to using -N during regression tests due to performance
concerns.

Regards,
    Jeff Davis

Attachment

initdb-fsync-20120618-2.patch.gz

Re: initdb and fsync

From

Andres Freund

Date:

18 June 2012, 15:58:09

On Monday, June 18, 2012 08:39:47 PM Jeff Davis wrote:
> On Mon, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> > Quick review:
> > - defaulting to initdb -N in the regression suite is not a good imo,
> > because that way the buildfarm won't catch problems in that area...
> I removed the -N as you suggest. How much does performance matter on the
> buildfarm?
I don't think the difference in initdb cost is relevant when running the 
regression tests. Should it prove to be we can re-add -N after a week or two 
in the buildfarm machines. I just remember that there were several OS specific 
regression when adding the pre-syncing for createdb.

> > - could the copydir.c and initdb.c versions of walkdir/sync_fname et al
> > be unified?
> There's a lot of backend-specific code in the copydir versions, like
> using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
> unifying them before, and concluded that it wouldn't add to the
> readability, so I just commented where they came from.
Ok. Sensible reasons. I dislike that we know have two files using different 
logic (copydir.c only using fadvise, initdb using sync_file_range if 
available). Maybe we should just move the fadvise and sync_file_range calls 
into its own common function?

> If you feel there will be a maintainability problem, I can give it
> another shot.
Its not too bad yet I guess, so ...

> > - I personally would find it way nicer to put USE_PRE_SYNC into
> > pre_sync_fname instead of cluttering the main function with it
> Done.

Looks good to me.

Btw, I just want to have said this, although I don't think its particularly 
relevant as it doesn't affect correctness: Its possible to have a system where 
sync_file_range is in the system headers but the kernel during runtime doesn't 
support it. It is relatively new (2.6.17). It would be possible to fallback to 
posix_fadvise which has been around far longer in that case...

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: initdb and fsync

From

Jeff Davis

Date:

18 June 2012, 16:32:40

On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
> > > - defaulting to initdb -N in the regression suite is not a good imo,
> > > because that way the buildfarm won't catch problems in that area...
> > I removed the -N as you suggest. How much does performance matter on the
> > buildfarm?
> I don't think the difference in initdb cost is relevant when running the 
> regression tests. Should it prove to be we can re-add -N after a week or two 
> in the buildfarm machines. I just remember that there were several OS specific 
> regression when adding the pre-syncing for createdb.

That sounds reasonable to me. Both patches are out there, so we can
figure out what the consensus is.

> > > - could the copydir.c and initdb.c versions of walkdir/sync_fname et al
> > > be unified?
> > There's a lot of backend-specific code in the copydir versions, like
> > using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
> > unifying them before, and concluded that it wouldn't add to the
> > readability, so I just commented where they came from.
> Ok. Sensible reasons. I dislike that we know have two files using different 
> logic (copydir.c only using fadvise, initdb using sync_file_range if 
> available). Maybe we should just move the fadvise and sync_file_range calls 
> into its own common function?

I don't see fadvise in copydir.c, it looks like it just uses fsync. It
might speed it up to use a pre-sync call there, too -- database creation
does take a while.

If that's in the scope of this patch, I'll do it.

> Btw, I just want to have said this, although I don't think its particularly 
> relevant as it doesn't affect correctness: Its possible to have a system where 
> sync_file_range is in the system headers but the kernel during runtime doesn't 
> support it. It is relatively new (2.6.17). It would be possible to fallback to 
> posix_fadvise which has been around far longer in that case...

Interesting point, but I'm not too worried about it.

Regards,Jeff Davis

Re: initdb and fsync

From

Andres Freund

Date:

18 June 2012, 16:41:20

On Monday, June 18, 2012 09:32:25 PM Jeff Davis wrote:
> > > > - could the copydir.c and initdb.c versions of walkdir/sync_fname et
> > > > al be unified?
> > > 
> > > There's a lot of backend-specific code in the copydir versions, like
> > > using ereport() and CHECK_FOR_INTERRUPTS(). I gave a brief attempt at
> > > unifying them before, and concluded that it wouldn't add to the
> > > readability, so I just commented where they came from.
> > 
> > Ok. Sensible reasons. I dislike that we know have two files using
> > different logic (copydir.c only using fadvise, initdb using
> > sync_file_range if available). Maybe we should just move the fadvise and
> > sync_file_range calls into its own common function?
> 
> I don't see fadvise in copydir.c, it looks like it just uses fsync. It
> might speed it up to use a pre-sync call there, too -- database creation
> does take a while.
> 
> If that's in the scope of this patch, I'll do it.
It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
maybe just put the sync_file_range in pg_flush_data?

Greetings,

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: initdb and fsync

From

Alvaro Herrera

Date:

18 June 2012, 16:44:17

Excerpts from Jeff Davis's message of lun jun 18 15:32:25 -0400 2012:
> On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:

> > Btw, I just want to have said this, although I don't think its particularly
> > relevant as it doesn't affect correctness: Its possible to have a system where
> > sync_file_range is in the system headers but the kernel during runtime doesn't
> > support it. It is relatively new (2.6.17). It would be possible to fallback to
> > posix_fadvise which has been around far longer in that case...
>
> Interesting point, but I'm not too worried about it.

Yeah.  2.6.17 was released on June 2006.  The latest stable release
prior to 2.6.17 was 2.6.16.62 in 2008 and it was abandoned at that time
in favor of 2.6.27 as the new stable branch.

I don't think this is a problem any sane person is going to face; and
telling them to upgrade to a newer kernel seems an acceptable answer.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: initdb and fsync

From

Jeff Davis

Date:

18 June 2012, 16:55:40

On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
> maybe just put the sync_file_range in pg_flush_data?

Oh, I didn't notice that, thank you.

In that case, it may be good to combine them if possible. I will look
into it. There may be performance implications when used one a larger
amount of data though. I can do some brief testing.

Regards,Jeff Davis

Re: initdb and fsync

From

David Fetter

Date:

19 June 2012, 12:33:52

On Mon, Jun 18, 2012 at 09:34:30PM +0300, Peter Eisentraut wrote:
> On mån, 2012-06-18 at 18:05 +0200, Andres Freund wrote:
> > - defaulting to initdb -N in the regression suite is not a good imo,
> > because that way the buildfarm won't catch problems in that area...
> >
> The regression test suite also starts postgres with fsync off.  This is
> good, and I don't want to have more overhead in the tests.  It's not
> like we already have awesome test coverage for OS interaction.

What sorts of coverage would we want?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: initdb and fsync

From

Jeff Davis

Date:

19 June 2012, 14:22:16

On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> It calls pg_flush_data inside of copy_file which does the posix_fadvise... So 
> maybe just put the sync_file_range in pg_flush_data?

The functions in fd.c aren't linked to initdb, so it's a challenge to
share that code (I remember now: that's why I didn't use pg_flush_data
originally). I don't see a clean way to resolve that, do you?

Also, it seems that sync_file_range() doesn't help with creating a
database much (on my machine), so it doesn't seem important to use it
there.

Right now I'm inclined to leave the patch as-is.

Regards,Jeff Davis

Re: initdb and fsync

From

Andres Freund

Date:

19 June 2012, 14:25:57

On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote:
> On Mon, 2012-06-18 at 21:41 +0200, Andres Freund wrote:
> > It calls pg_flush_data inside of copy_file which does the
> > posix_fadvise... So maybe just put the sync_file_range in pg_flush_data?
> 
> The functions in fd.c aren't linked to initdb, so it's a challenge to
> share that code (I remember now: that's why I didn't use pg_flush_data
> originally). I don't see a clean way to resolve that, do you?
> 
> Also, it seems that sync_file_range() doesn't help with creating a
> database much (on my machine), so it doesn't seem important to use it
> there.
> 
> Right now I'm inclined to leave the patch as-is.
Fine with that, I wanted to bring it up and see it documented.

I have marked it with ready for committer. That committer needs to decide on -
N in the regression tests or not, but that shouldn't be much of a problem ;)

Cool!

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: initdb and fsync

From

Robert Haas

Date:

22 June 2012, 11:04:57

On Sat, Feb 4, 2012 at 8:18 PM, Noah Misch <noah@leadboat.com> wrote:
> On Sat, Feb 04, 2012 at 03:41:27PM -0800, Jeff Davis wrote:
>> On Sat, 2012-01-28 at 13:18 -0500, Tom Lane wrote:
>> > Yeah.  Personally I would be sad if initdb got noticeably slower, and
>> > I've never seen or heard of a failure that this would fix.
>>
>> I worked up a patch, and it looks like it does about 6 file fsync's and
>> a 7th for the PGDATA directory. That degrades the time from about 1.1s
>> to 1.4s on my workstation.
>
>> So, is it worth it? Should we make it an option that can be specified?
>
> If we add fsync calls to the initdb process, they should cover the entire data
> directory tree.  This patch syncs files that initdb.c writes, but we ought to
> also sync files that bootstrap-mode backends had written.  An optimization
> like the pg_flush_data() call in copy_file() may reduce the speed penalty.
>
> initdb should do these syncs by default and offer an option to disable them.

This may be a stupid question, by why is it initdb's job to fsync the
files the server creates, rather than the server's job?  Normally we
rely on the server to make its own writes persistent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: initdb and fsync

From

Noah Misch

Date:

22 June 2012, 22:47:17

On Fri, Jun 22, 2012 at 10:04:23AM -0400, Robert Haas wrote:
> On Sat, Feb 4, 2012 at 8:18 PM, Noah Misch <noah@leadboat.com> wrote:
> > If we add fsync calls to the initdb process, they should cover the entire data
> > directory tree. ?This patch syncs files that initdb.c writes, but we ought to
> > also sync files that bootstrap-mode backends had written. ?An optimization
> > like the pg_flush_data() call in copy_file() may reduce the speed penalty.
> >
> > initdb should do these syncs by default and offer an option to disable them.
> 
> This may be a stupid question, by why is it initdb's job to fsync the
> files the server creates, rather than the server's job?  Normally we
> rely on the server to make its own writes persistent.

Modularity would dictate having the server fsync its own work product, but I
expect that approach to perform materially worse.  initdb runs many
single-user server instances, and each would fsync independently.  When N
initdb steps change one file, it would see N fsyncs.  Using sync_file_range to
queue all writes is best for the typical interactive or quasi-interactive
initdb user.  It's not always a win for server fsyncs, so we would need to
either define special cases such that it's used during initdb or forgo the
optimization.  On the other hand, the server could skip some files, like fsm
forks, in a principled manner.

Overall, I think it will hard to improve modularity while retaining the
performance Jeff's approach achieves through exploiting initdb's big-picture
perspective.  So I favor how Jeff has implemented it.

nm

Re: initdb and fsync

From

Jeff Davis

Date:

23 June 2012, 06:18:00

On Fri, 2012-06-22 at 10:04 -0400, Robert Haas wrote:
> This may be a stupid question, by why is it initdb's job to fsync the
> files the server creates, rather than the server's job?  Normally we
> rely on the server to make its own writes persistent.

That was my first reaction as well:

http://archives.postgresql.org/message-id/1328468000.15224.32.camel@jdavis

However, from the discussion it seems like it will be harder to do it
that way (during normal operation, a checkpoint is what syncs the files;
but as Tom points out, bootstrap mode does not). Also, I would expect
the fsyncs to be in a less-controlled pattern, so it might perform more
poorly.

Regards,Jeff Davis

Re: initdb and fsync

From

Peter Eisentraut

Date:

23 June 2012, 18:34:46

On mån, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
> I don't think the difference in initdb cost is relevant when running
> the regression tests. Should it prove to be we can re-add -N after a
> week or two in the buildfarm machines.

Keep in mind that the regression tests are not only run on the
buildfarm, and that pg_regress is not only used for the main regression
tests.  When you're dealing with things in contrib or src/pl or external
modules, then adding 3 seconds to a 5 second test is not so nice.

>  I just remember that there were several OS specific regression when
> adding the pre-syncing for createdb.

Sure, it would be useful to test this, but then we should also turn on
fsync in the backend.  Make it a separate slow mode (or a separate fast
mode).

Re: initdb and fsync

From

Tom Lane

Date:

13 July 2012, 15:20:11

Andres Freund <andres@2ndquadrant.com> writes:
> On Tuesday, June 19, 2012 07:22:02 PM Jeff Davis wrote:
>> Right now I'm inclined to leave the patch as-is.

> Fine with that, I wanted to bring it up and see it documented.

> I have marked it with ready for committer. That committer needs to decide on -
> N in the regression tests or not, but that shouldn't be much of a problem ;)

I'm picking up this patch now.  What I'm inclined to do about the -N
business is to commit without that, so that we get a round of testing
in the buildfarm and find out about any portability issues, but then
change to use -N after a week or so.  I agree that in the long run
we don't want regression tests to run with fsyncs by default.
        regards, tom lane

Re: initdb and fsync

From

Tom Lane

Date:

13 July 2012, 16:21:50

Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2012-06-18 at 20:57 +0200, Andres Freund wrote:
>> Ok. Sensible reasons. I dislike that we know have two files using different 
>> logic (copydir.c only using fadvise, initdb using sync_file_range if 
>> available). Maybe we should just move the fadvise and sync_file_range calls 
>> into its own common function?

> I don't see fadvise in copydir.c, it looks like it just uses fsync. It
> might speed it up to use a pre-sync call there, too -- database creation
> does take a while.

No, that's incorrect: the fadvise is there, inside pg_flush_data() which
is done during the writing phase.  It seems to me that if we think
sync_file_range is a win, we ought to be using it in pg_flush_data just
as much as in initdb.

However, I'm a bit confused because in
http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php
you said

> So, it looks like fadvise is the "right" thing to do, but I expect we'll

Was that backwards?  If not, why are we bothering with taking any
portability risks by adding use of sync_file_range?
        regards, tom lane

Re: initdb and fsync

From

Tom Lane

Date:

13 July 2012, 18:35:30

I wrote:
> I'm picking up this patch now.  What I'm inclined to do about the -N
> business is to commit without that, so that we get a round of testing
> in the buildfarm and find out about any portability issues, but then
> change to use -N after a week or so.  I agree that in the long run
> we don't want regression tests to run with fsyncs by default.

Committed without the -N in pg_regress (for now).  I also stuck
sync_file_range into the backend's pg_flush_data --- it would be
interesting to hear measurements of whether that makes a noticeable
difference for CREATE DATABASE.
        regards, tom lane

Re: initdb and fsync

From

Jeff Davis

Date:

13 July 2012, 18:42:02

On Fri, 2012-07-13 at 15:21 -0400, Tom Lane wrote:
> No, that's incorrect: the fadvise is there, inside pg_flush_data() which
> is done during the writing phase.

Yeah, Andres pointed that out, also.

>   It seems to me that if we think
> sync_file_range is a win, we ought to be using it in pg_flush_data just
> as much as in initdb.

It was pretty clearly a win for initdb, but I wasn't convinced it was a
good idea for other use cases. 

The mechanism is outlined in the email you linked below. To paraphrase,
fadvise tries to put the data in the io scheduler queue (still in the
kernel before going to the device), and gives up if there is no room;
sync_file_range waits for there to be room if none is available.

For the case of initdb, the data needing to be fsync'd is effectively
constant, and it's a lot of small files. If the requests don't make it
to the io scheduler queue before fsync, the kernel doesn't have an
opportunity to schedule them properly.

But for larger amounts of data copying (like ALTER DATABASE SET
TABLESPACE), it seemed like there was more risk that sync_file_range
would starve out other processes by continuously filling up the io
scheduler queue (I'm not sure if there are protections against that or
not). Also, if the files are larger, a single fsync represents more
data, and the kernel may be able to schedule it well enough anyway.

I'm not an authority in this area though, so if you are comfortable
extending sync_file_range to copydir() as well, that's fine with me.

> However, I'm a bit confused because in
> http://archives.postgresql.org/pgsql-hackers/2012-03/msg01098.php
> you said
> 
> > So, it looks like fadvise is the "right" thing to do, but I expect we'll
> 
> Was that backwards?  If not, why are we bothering with taking any
> portability risks by adding use of sync_file_range?

That part of the email was less than conclusive, and I can't remember
exactly what point I was trying to make. sync_file_range is a big
practical win for the reasons I mentioned above (and also avoids some
unpleasant noises coming from the disk drive).

Regards,Jeff Davis

Re: initdb and fsync

From

Tom Lane

Date:

13 July 2012, 19:11:57

Jeff Davis <pgsql@j-davis.com> writes:
> For the case of initdb, the data needing to be fsync'd is effectively
> constant, and it's a lot of small files. If the requests don't make it
> to the io scheduler queue before fsync, the kernel doesn't have an
> opportunity to schedule them properly.

> But for larger amounts of data copying (like ALTER DATABASE SET
> TABLESPACE), it seemed like there was more risk that sync_file_range
> would starve out other processes by continuously filling up the io
> scheduler queue (I'm not sure if there are protections against that or
> not). Also, if the files are larger, a single fsync represents more
> data, and the kernel may be able to schedule it well enough anyway.

> I'm not an authority in this area though, so if you are comfortable
> extending sync_file_range to copydir() as well, that's fine with me.

It could use some performance testing, which I don't have the time
for (or proper equipment).  Anyone?

Also, I note that copy_file is set up so that it does sync_file_range or
fadvise for each 64K chunk of data, which seems mighty small.  I wonder
if it'd be better to take that out of the loop and do one whole-file
advise at the end of the copy loop.  Or at least use some larger
granularity for those calls.
        regards, tom lane

Re: initdb and fsync

From

Jeff Davis

Date:

13 July 2012, 20:26:02

On Fri, 2012-07-13 at 17:35 -0400, Tom Lane wrote:
> I wrote:
> > I'm picking up this patch now.  What I'm inclined to do about the -N
> > business is to commit without that, so that we get a round of testing
> > in the buildfarm and find out about any portability issues, but then
> > change to use -N after a week or so.  I agree that in the long run
> > we don't want regression tests to run with fsyncs by default.
> 
> Committed without the -N in pg_regress (for now).  I also stuck
> sync_file_range into the backend's pg_flush_data --- it would be
> interesting to hear measurements of whether that makes a noticeable
> difference for CREATE DATABASE.

Thank you.

One point about the commit message: fadvise does not block to go into
the request queue, sync_file_range does. The problem with fadvise is
that, when the request queue is small, it fills up so fast that most of
the requests never make it in, and fadvise is essentially a no-op.
sync_file_range waits for room in the queue, which is (based on my
tests) enough to improve the scheduling a lot.

Regards,Jeff Davis

Re: initdb and fsync

From

Tom Lane

Date:

13 July 2012, 20:43:36

Jeff Davis <pgsql@j-davis.com> writes:
> One point about the commit message: fadvise does not block to go into
> the request queue, sync_file_range does. The problem with fadvise is
> that, when the request queue is small, it fills up so fast that most of
> the requests never make it in, and fadvise is essentially a no-op.
> sync_file_range waits for room in the queue, which is (based on my
> tests) enough to improve the scheduling a lot.

I see.  I misunderstood your previous message.  In that case, it seems
quite likely that it might be helpful if copy_file were to aggregate
the fadvise/sync_file_range calls over larger pieces of the file.
(I'm assuming that the request queue isn't bright enough to aggregate
by itself, though that might be wrong.)
        regards, tom lane