Thread: Filesystem Direct I/O and WAL sync option

Filesystem Direct I/O and WAL sync option

From
Dimitri
Date:
All,

I'm very curious to know if we may expect or guarantee any data
consistency with WAL sync=OFF but using file system mounted in Direct
I/O mode (means every write() system call called by PG really writes
to disk before return)...

So may we expect data consistency:
   - none?
   - per checkpoint basis?
   - full?...

Thanks a lot for any info!

Rgds,
-Dimitri

Re: Filesystem Direct I/O and WAL sync option

From
Heikki Linnakangas
Date:
Dimitri wrote:
> I'm very curious to know if we may expect or guarantee any data
> consistency with WAL sync=OFF but using file system mounted in Direct
> I/O mode (means every write() system call called by PG really writes
> to disk before return)...

You'd have to turn that mode on on the data drives as well to get
consistency, because fsync=off disables checkpoint fsyncs of the data
files as well.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Filesystem Direct I/O and WAL sync option

From
Dimitri
Date:
Yes, disk drives are also having cache disabled or having cache on
controllers and battery protected (in case of more  high-level
storage) - but is it enough to expect data consistency?... (I was
surprised about checkpoint sync, but does it always calls write()
anyway? because in this way it should work without fsync)...

On 7/3/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> Dimitri wrote:
> > I'm very curious to know if we may expect or guarantee any data
> > consistency with WAL sync=OFF but using file system mounted in Direct
> > I/O mode (means every write() system call called by PG really writes
> > to disk before return)...
>
> You'd have to turn that mode on on the data drives as well to get
> consistency, because fsync=off disables checkpoint fsyncs of the data
> files as well.
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>

Re: Filesystem Direct I/O and WAL sync option

From
Gregory Stark
Date:
"Dimitri" <dimitrik.fr@gmail.com> writes:

> Yes, disk drives are also having cache disabled or having cache on
> controllers and battery protected (in case of more  high-level
> storage) - but is it enough to expect data consistency?... (I was
> surprised about checkpoint sync, but does it always calls write()
> anyway? because in this way it should work without fsync)...

Well if everything is mounted in sync mode then I suppose you have the same
guarantee as if fsync were called after every single write. If that's true
then surely that's at least as good. I'm curious how it performs though.

Actually it seems like in that configuration fsync should be basically
zero-cost. In other words, you should be able to leave fsync=on and get the
same performance (whatever that is) and not have to worry about any risks.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Filesystem Direct I/O and WAL sync option

From
Dimitri
Date:
Yes Gregory, that's why I'm asking, because from 1800 transactions/sec
I'm jumping to 2800 transactions/sec!  and it's more than important
performance level increase :))

Rgds,
-Dimitri

On 7/4/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
> "Dimitri" <dimitrik.fr@gmail.com> writes:
>
> > Yes, disk drives are also having cache disabled or having cache on
> > controllers and battery protected (in case of more  high-level
> > storage) - but is it enough to expect data consistency?... (I was
> > surprised about checkpoint sync, but does it always calls write()
> > anyway? because in this way it should work without fsync)...
>
> Well if everything is mounted in sync mode then I suppose you have the same
> guarantee as if fsync were called after every single write. If that's true
> then surely that's at least as good. I'm curious how it performs though.
>
> Actually it seems like in that configuration fsync should be basically
> zero-cost. In other words, you should be able to leave fsync=on and get the
> same performance (whatever that is) and not have to worry about any risks.
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
>

Re: Filesystem Direct I/O and WAL sync option

From
Gregory Stark
Date:
"Dimitri" <dimitrik.fr@gmail.com> writes:

> Yes Gregory, that's why I'm asking, because from 1800 transactions/sec
> I'm jumping to 2800 transactions/sec!  and it's more than important
> performance level increase :))

wow. That's kind of suspicious though. Does the new configuration take
advantage of the lack of the filesystem cache by increasing the size of
shared_buffers? Even then I wouldn't expect such a big boost unless you got
very lucky with the size of your working set compared to the two sizes of
shared_buffers.

It seems likely that somehow this change is not providing the same guarantees
as fsync. Perhaps fsync is actually implementing IDE write barriers and the
sync mode is just flushing buffers to the hard drive cache and then returning.

What transaction rate do you get if you just have a single connection
streaming inserts in autocommit mode? What kind of transaction rate do you get
with both sync mode on and fsync=on in Postgres?

And did you say this with a battery backed cache? In theory fsync=on/off and
shouldn't make much difference at all with a battery backed cache. Stranger
and stranger.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Filesystem Direct I/O and WAL sync option

From
Dimitri
Date:
Gregory, thanks for good questions! :))
I got more lights on my throughput here :))

The running OS is Solaris9 (customer is still not ready to upgrade to
Sol10), and I think the main "sync" issue is coming from the old UFS
implementation... UFS mounted with 'forcedirectio' option uses
different "sync" logic as well accepting concurrent writing to the
same file which is giving here a higher performance level. I did not
expect really so big gain, so did not think to replay the same test
with direct I/O on and fsync=on too. For my big surprise - it also
reached 2800 tps as with fsync=off !!! So, initial question is no more
valid :))

As well my tests are executed just to validate server + storage
capabilities, and honestly it's really pity to see them used under old
Solaris version :))
but well, at least we know what kind of performance they may expect
currently, and think about migration before the end of this year...

Seeing at least 10.000 random writes/sec on storage sub-system during
live database test was very pleasant to customer and make feel them
comfortable for their production...

Thanks a lot for all your help!

Best regards!
-Dimitri

On 7/4/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
> "Dimitri" <dimitrik.fr@gmail.com> writes:
>
> > Yes Gregory, that's why I'm asking, because from 1800 transactions/sec
> > I'm jumping to 2800 transactions/sec!  and it's more than important
> > performance level increase :))
>
> wow. That's kind of suspicious though. Does the new configuration take
> advantage of the lack of the filesystem cache by increasing the size of
> shared_buffers? Even then I wouldn't expect such a big boost unless you got
> very lucky with the size of your working set compared to the two sizes of
> shared_buffers.
>
> It seems likely that somehow this change is not providing the same
> guarantees
> as fsync. Perhaps fsync is actually implementing IDE write barriers and the
> sync mode is just flushing buffers to the hard drive cache and then
> returning.
>
> What transaction rate do you get if you just have a single connection
> streaming inserts in autocommit mode? What kind of transaction rate do you
> get
> with both sync mode on and fsync=on in Postgres?
>
> And did you say this with a battery backed cache? In theory fsync=on/off and
> shouldn't make much difference at all with a battery backed cache. Stranger
> and stranger.
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
>

Re: Filesystem Direct I/O and WAL sync option

From
"Jim C. Nasby"
Date:
On Tue, Jul 03, 2007 at 04:06:29PM +0100, Heikki Linnakangas wrote:
> Dimitri wrote:
> >I'm very curious to know if we may expect or guarantee any data
> >consistency with WAL sync=OFF but using file system mounted in Direct
> >I/O mode (means every write() system call called by PG really writes
> >to disk before return)...
>
> You'd have to turn that mode on on the data drives as well to get
> consistency, because fsync=off disables checkpoint fsyncs of the data
> files as well.

BTW, it might be worth trying the different wal_sync_methods. IIRC,
Jonah's seen some good results from open_datasync.
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Filesystem Direct I/O and WAL sync option

From
"Jonah H. Harris"
Date:
On 7/9/07, Jim C. Nasby <decibel@decibel.org> wrote:
> BTW, it might be worth trying the different wal_sync_methods. IIRC,
> Jonah's seen some good results from open_datasync.

On Linux, using ext3, reiser, or jfs, I've seen open_sync perform
quite better than fsync/fdatasync in most of my tests.  But, I haven't
done significant testing with direct I/O lately.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 3rd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/

Re: Filesystem Direct I/O and WAL sync option

From
Dimitri
Date:
Yes, I tried all WAL sync methods, but there was no difference...
However, there was a huge difference when I run the same tests under
Solaris10 - 'fdatasync' option gave the best performance level. On the
same time direct I/O did not make difference on Solaris 10 :)

So the main rule - there is no universal rule :)
just adapt system options according your workload...

Direct I/O will generally speed-up write operation due avoiding buffer
flashing overhead as well concurrent writing (breaking POSIX
limitation of single writer per given file on the same time). But on
the same time it may slow-down your read operations, and you may need
64bit PG version to use big cache to still keep same performance level
on SELECT queries. And again, there are other file systems like QFS
(for ex.) which may give you the best of both worlds: direct write and
buffered read on the same time! etc. etc. etc. :)

Rgds,
-Dimitri

On 7/9/07, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> On 7/9/07, Jim C. Nasby <decibel@decibel.org> wrote:
> > BTW, it might be worth trying the different wal_sync_methods. IIRC,
> > Jonah's seen some good results from open_datasync.
>
> On Linux, using ext3, reiser, or jfs, I've seen open_sync perform
> quite better than fsync/fdatasync in most of my tests.  But, I haven't
> done significant testing with direct I/O lately.
>
> --
> Jonah H. Harris, Software Architect | phone: 732.331.1324
> EnterpriseDB Corporation            | fax: 732.331.1301
> 33 Wood Ave S, 3rd Floor            | jharris@enterprisedb.com
> Iselin, New Jersey 08830            | http://www.enterprisedb.com/
>