Thread: Simplifying wal_sync_method

Simplifying wal_sync_method

From

Bruce Momjian

Date:

08 August 2005, 16:56:49

Currently, here are the options available for wal_sync_method:
#wal_sync_method = fsync        # the default varies across platforms:                                # fsync,
fdatasync,fsync_writethrough,                                # open_sync, open_datasync
 

I don't understand why we support so many values.  It seems 'fsync'
should be fdatasync(), and if that is not available, fsync().  Same with
open_sync and open_datasync.

In fact, 8.1 uses O_DIRECT if available, and I don't see why we don't
just use the "data" options automatically if available too, rather than
have users guess which options their OS supports.  We might need an
option to print the actual features used, but I am not sure.

Is this something for 8.1 or 8.2?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

08 August 2005, 17:39:35

On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:
> Currently, here are the options available for wal_sync_method:
> 
>     #wal_sync_method = fsync        # the default varies across platforms:
>                                     # fsync, fdatasync, fsync_writethrough,
>                                     # open_sync, open_datasync

On same topic:
 http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php

Why does win32 PostgreSQL allow data corruption by default?

-- 
marko

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

08 August 2005, 18:09:20

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Currently, here are the options available for wal_sync_method:
>     #wal_sync_method = fsync        # the default varies across platforms:
>                                     # fsync, fdatasync, fsync_writethrough,
>                                     # open_sync, open_datasync

> I don't understand why we support so many values.

Because there are so many platforms with different subsets of these APIs
and different performance characteristics for the ones they do have.

> It seems 'fsync' should be fdatasync(), and if that is not available,
> fsync().

I have yet to see anyone do any systematic testing of the different
options on different platforms.  In the absence of hard data, proposing
that we don't need some of the options is highly premature.

> In fact, 8.1 uses O_DIRECT if available,

That's a decision that hasn't got a shred of evidence to justify
imposing it on every platform.
        regards, tom lane

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

08 August 2005, 18:38:17

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Currently, here are the options available for wal_sync_method:
> >     #wal_sync_method = fsync        # the default varies across platforms:
> >                                     # fsync, fdatasync, fsync_writethrough,
> >                                     # open_sync, open_datasync
> 
> > I don't understand why we support so many values.
> 
> Because there are so many platforms with different subsets of these APIs
> and different performance characteristics for the ones they do have.

Right, and our current behavior makes it harder for people to even know
the supported options.

> > It seems 'fsync' should be fdatasync(), and if that is not available,
> > fsync().
> 
> I have yet to see anyone do any systematic testing of the different
> options on different platforms.  In the absence of hard data, proposing
> that we don't need some of the options is highly premature.

No one is every going to do it, so we might as well make the best guess
we have.  I think any platform where the *data* options are slower than
the non-*data* options is broken, and if that logic holds, we might as
well just use *data* by default if we can, which is my proposal.

> > In fact, 8.1 uses O_DIRECT if available,
> 
> That's a decision that hasn't got a shred of evidence to justify
> imposing it on every platform.

Right, and there is no evidence it hurts, so we do our best until
someone comes up with data to suggest we are wrong.  The same should be
done with *data*.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

08 August 2005, 18:42:27

Marko Kreen wrote:
> On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:
> > Currently, here are the options available for wal_sync_method:
> > 
> >     #wal_sync_method = fsync        # the default varies across platforms:
> >                                     # fsync, fdatasync, fsync_writethrough,
> >                                     # open_sync, open_datasync
> 
> On same topic:
> 
>   http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> 
> Why does win32 PostgreSQL allow data corruption by default?

It behaves the same on Unix as Win32, and if you have battery-backed
cache, you don't need writethrough, so we don't have it as default.  I
am going to write a section in the manual for 8.1 about these
reliability issues.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

08 August 2005, 18:44:15

In summary, we added all those wal_sync_method values in hopes of
getting some data on which is best on which platform, but having gone
several years with few reports, I am thinking we should just choose the
best ones we can and move on, rather than expose a confusing API to the
users.

Does anyone show a platform where the *data* options are slower than the
non-*data* ones?

---------------------------------------------------------------------------

pgman wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Currently, here are the options available for wal_sync_method:
> > >     #wal_sync_method = fsync        # the default varies across platforms:
> > >                                     # fsync, fdatasync, fsync_writethrough,
> > >                                     # open_sync, open_datasync
> > 
> > > I don't understand why we support so many values.
> > 
> > Because there are so many platforms with different subsets of these APIs
> > and different performance characteristics for the ones they do have.
> 
> Right, and our current behavior makes it harder for people to even know
> the supported options.
> 
> > > It seems 'fsync' should be fdatasync(), and if that is not available,
> > > fsync().
> > 
> > I have yet to see anyone do any systematic testing of the different
> > options on different platforms.  In the absence of hard data, proposing
> > that we don't need some of the options is highly premature.
> 
> No one is every going to do it, so we might as well make the best guess
> we have.  I think any platform where the *data* options are slower than
> the non-*data* options is broken, and if that logic holds, we might as
> well just use *data* by default if we can, which is my proposal.
> 
> > > In fact, 8.1 uses O_DIRECT if available,
> > 
> > That's a decision that hasn't got a shred of evidence to justify
> > imposing it on every platform.
> 
> Right, and there is no evidence it hurts, so we do our best until
> someone comes up with data to suggest we are wrong.  The same should be
> done with *data*.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

08 August 2005, 18:51:43

On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:
> Marko Kreen wrote:
> > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:
> > > Currently, here are the options available for wal_sync_method:
> > > 
> > >     #wal_sync_method = fsync        # the default varies across platforms:
> > >                                     # fsync, fdatasync, fsync_writethrough,
> > >                                     # open_sync, open_datasync
> > 
> > On same topic:
> > 
> >   http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> > 
> > Why does win32 PostgreSQL allow data corruption by default?
> 
> It behaves the same on Unix as Win32, and if you have battery-backed
> cache, you don't need writethrough, so we don't have it as default.  I
> am going to write a section in the manual for 8.1 about these
> reliability issues.

For some reason I don't see "corruped database after crash"
reports on Unixen.  Why?

Also, why can't win32 be safe without battery-backed cache?
I can't see such requirement on other platforms.

-- 
marko

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

08 August 2005, 19:02:26

Marko Kreen wrote:
> On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:
> > Marko Kreen wrote:
> > > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:
> > > > Currently, here are the options available for wal_sync_method:
> > > > 
> > > >     #wal_sync_method = fsync        # the default varies across platforms:
> > > >                                     # fsync, fdatasync, fsync_writethrough,
> > > >                                     # open_sync, open_datasync
> > > 
> > > On same topic:
> > > 
> > >   http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> > > 
> > > Why does win32 PostgreSQL allow data corruption by default?
> > 
> > It behaves the same on Unix as Win32, and if you have battery-backed
> > cache, you don't need writethrough, so we don't have it as default.  I
> > am going to write a section in the manual for 8.1 about these
> > reliability issues.
> 
> For some reason I don't see "corruped database after crash"
> reports on Unixen.  Why?

They use SCSI or battery-backed RAID cards more often?

> Also, why can't win32 be safe without battery-backed cache?
> I can't see such requirement on other platforms.

If it uses SCSI, it is secure, just like Unix.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

08 August 2005, 19:02:51

Alvaro Herrera wrote:
> On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:
> > Marko Kreen wrote:
> > > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:
> > > > Currently, here are the options available for wal_sync_method:
> > > > 
> > > >     #wal_sync_method = fsync        # the default varies across platforms:
> > > >                                     # fsync, fdatasync, fsync_writethrough,
> > > >                                     # open_sync, open_datasync
> > > 
> > > On same topic:
> > > 
> > >   http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> > > 
> > > Why does win32 PostgreSQL allow data corruption by default?
> > 
> > It behaves the same on Unix as Win32, and if you have battery-backed
> > cache, you don't need writethrough, so we don't have it as default.  I
> > am going to write a section in the manual for 8.1 about these
> > reliability issues.
> 
> I think we should offer the reliable option by default, and mention the
> fast option for those who have battery-backed cache in the manual.

But only on Win32?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Josh Berkus

Date:

08 August 2005, 19:09:32

Marko,

> Also, why can't win32 be safe without battery-backed cache?
> I can't see such requirement on other platforms.

Read the referenced message again.   It's only an issue if you want to use 
open_datasync.   fsync_writethrough should be safe.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Simplifying wal_sync_method

From

Alvaro Herrera

Date:

08 August 2005, 19:15:58

On Mon, Aug 08, 2005 at 06:02:37PM -0400, Bruce Momjian wrote:
> Alvaro Herrera wrote:
> > On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote:
> > > Marko Kreen wrote:
> > > > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote:
> > > > > Currently, here are the options available for wal_sync_method:
> > > > > 
> > > > >     #wal_sync_method = fsync        # the default varies across platforms:
> > > > >                                     # fsync, fdatasync, fsync_writethrough,
> > > > >                                     # open_sync, open_datasync
> > > > 
> > > > On same topic:
> > > > 
> > > >   http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> > > > 
> > > > Why does win32 PostgreSQL allow data corruption by default?
> > > 
> > > It behaves the same on Unix as Win32, and if you have battery-backed
> > > cache, you don't need writethrough, so we don't have it as default.  I
> > > am going to write a section in the manual for 8.1 about these
> > > reliability issues.
> > 
> > I think we should offer the reliable option by default, and mention the
> > fast option for those who have battery-backed cache in the manual.
> 
> But only on Win32?

Yes, because that's the only place where that option works, right?

-- 
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"I dream about dreams about dreams", sang the nightingale
under the pale moon (Sandman)

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

08 August 2005, 19:26:22

On Mon, Aug 08, 2005 at 06:02:37PM -0400, Bruce Momjian wrote:
> Alvaro Herrera wrote:
> > I think we should offer the reliable option by default, and mention the
> > fast option for those who have battery-backed cache in the manual.
> 
> But only on Win32?

We should do what's possible with what's given to us.

On Win32:

1.  We can write through cache.
2.  We have unreliable OS with unreliable filesystem.
3.  The probability of mediocre hardware is higher.

Regular POSIX:
1.  We can't write through cache.
2.  We have good OS with good filesystem (probably even   journaled).
3.  The probably of mediocre hardware is lower.

Why shouldn't we offer reliable option to win32?

Options:

-  Win32 guy complains that PG is bit slow.  We tell him to RTFM.
-  Win32 guy complains he lost database.  We tell him he didn't RTFM.

Which way you make more friends?

-- 
marko

PS.  Yeah, I was the guy who helped him to restore what's left.
I'd say he wasn't exactly happy.

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

08 August 2005, 19:30:22

On Mon, Aug 08, 2005 at 03:10:54PM -0700, Josh Berkus wrote:
> Marko,
> > Also, why can't win32 be safe without battery-backed cache?
> > I can't see such requirement on other platforms.
> 
> Read the referenced message again.   It's only an issue if you want to use 
> open_datasync.   fsync_writethrough should be safe.

But thats the point.  Why isn't fsync_writethrough default?

-- 
marko

Re: Simplifying wal_sync_method

From

Josh Berkus

Date:

08 August 2005, 19:32:14

Bruce,

> No one is every going to do it, so we might as well make the best guess
> we have.  I think any platform where the *data* options are slower than
> the non-*data* options is broken, and if that logic holds, we might as
> well just use *data* by default if we can, which is my proposal.

Changing the defaults is fine with me.    I just don't think that we can
afford to prune options without more testing.   And we will be getting
more testing (from companies) in the future, so I don't think this is
completely out of the question.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Simplifying wal_sync_method

From

"Joshua D. Drake"

Date:

08 August 2005, 19:41:37

>>>I think we should offer the reliable option by default, and mention the
>>>fast option for those who have battery-backed cache in the manual.
>>
>>But only on Win32?
> 
> 
> Yes, because that's the only place where that option works, right?

fsync_writethrough only works on Win32 the postgresql.conf should 
reflect that.

>

Re: Simplifying wal_sync_method

From

Simon Riggs

Date:

08 August 2005, 20:12:44

On Mon, 2005-08-08 at 17:44 -0400, Bruce Momjian wrote:
> In summary, we added all those wal_sync_method values in hopes of
> getting some data on which is best on which platform, but having gone
> several years with few reports, I am thinking we should just choose the
> best ones we can and move on, rather than expose a confusing API to the
> users.

I agree this should be attempted over the 8.1 beta period.

This is a good case for having a Port Coordinator assigned for each
port, so we could ask them to hunt out the solution for their platform.
Maybe this is something that we can broadcast to the BuildFarm team, so
each person can reflect on the appropriate settings?

Best Regards, Simon Riggs

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

08 August 2005, 20:44:08

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> No one is every going to do it, so we might as well make the best guess
> we have.  I think any platform where the *data* options are slower than
> the non-*data* options is broken, and if that logic holds, we might as
> well just use *data* by default if we can, which is my proposal.

Adjusting the default settings I don't have a problem with.  Removing
options I have a problem with --- and that appeared to be what you
were proposing.
        regards, tom lane

Re: Simplifying wal_sync_method

From

Andrew Dunstan

Date:

08 August 2005, 20:46:03

Simon Riggs wrote:

>On Mon, 2005-08-08 at 17:44 -0400, Bruce Momjian wrote:
>  
>
>>In summary, we added all those wal_sync_method values in hopes of
>>getting some data on which is best on which platform, but having gone
>>several years with few reports, I am thinking we should just choose the
>>best ones we can and move on, rather than expose a confusing API to the
>>users.
>>    
>>
>
>I agree this should be attempted over the 8.1 beta period.
>
>This is a good case for having a Port Coordinator assigned for each
>port, so we could ask them to hunt out the solution for their platform.
>Maybe this is something that we can broadcast to the BuildFarm team, so
>each person can reflect on the appropriate settings?
>
>
>  
>

It might be possible to build a new set of tests that we could perform. 
That would have to be built into the buildfarm script, as the PL tests 
were, but they were picked up pretty quickly by the community. 
Unfortunately it doesn't sound like these would fit into the pg_regress 
setup, so we'll have to devise a different test harness - probably not a 
bad idea for automated performance testing anyway.

So the short answer is possibly "You build the tests and we'll run 'em."

cheers

andrew

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

08 August 2005, 20:59:16

Andrew Dunstan <andrew@dunslane.net> writes:
> So the short answer is possibly "You build the tests and we'll run 'em."

The availability of the buildfarm certainly makes it a lot more feasible
to do performance tests on a variety of platforms.  So, who wants to
knock something together?

I suppose we would usually be interested in one-time tests, rather than
something repeated every time CVS is touched.  How might that sort of
requirement fit into the buildfarm software design?
        regards, tom lane

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

08 August 2005, 21:04:46

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Marko Kreen wrote:
>> On same topic:
>> http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
>> Why does win32 PostgreSQL allow data corruption by default?

> It behaves the same on Unix as Win32, and if you have battery-backed
> cache, you don't need writethrough, so we don't have it as default.  I
> am going to write a section in the manual for 8.1 about these
> reliability issues.

I thought we had changed the default for Windows to be fsync_writethrough
in 8.1?  We didn't have that code in 8.0, but now that we do, it surely
seems like the sanest default.
        regards, tom lane

Re: Simplifying wal_sync_method

From

Kris Jurka

Date:

08 August 2005, 21:15:57

On Mon, 8 Aug 2005, Andrew Dunstan wrote:

> So the short answer is possibly "You build the tests and we'll run 'em."
> 

Automated performance testing seems like a bad idea for the buildfarm.  
Consider in my particular case I've got three members that all happen to 
be running in virtual machines on the same host.  What virtualization does 
for performance and what happens when all three members are running at the 
same time renders any results beyond useless.  Certainly soliciting the 
pgbuildfarm-members@pgfoundry.org list is good idea, but I don't think 
automating this testing is a good idea without more knowledge of the 
machines and their other workloads.

Kris Jurka

Re: Simplifying wal_sync_method

From

Andrew Dunstan

Date:

08 August 2005, 21:22:48


Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>So the short answer is possibly "You build the tests and we'll run 'em."
>>    
>>
>
>The availability of the buildfarm certainly makes it a lot more feasible
>to do performance tests on a variety of platforms.  So, who wants to
>knock something together?
>
>I suppose we would usually be interested in one-time tests, rather than
>something repeated every time CVS is touched.  How might that sort of
>requirement fit into the buildfarm software design?
>
>
>  
>

I'll give it some thought. Maybe a unique name would do the trick.

cheers

andrew

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

08 August 2005, 21:29:48

Kris Jurka <books@ejurka.com> writes:
> Automated performance testing seems like a bad idea for the buildfarm.  
> Consider in my particular case I've got three members that all happen to 
> be running in virtual machines on the same host.  What virtualization does 
> for performance and what happens when all three members are running at the 
> same time renders any results beyond useless.

Certainly a good point --- but as I noted to Andrew, we'd probably be
more interested in one-off tests than repetitive testing anyway.  So
possibly this could be handled with a different protocol, and buildfarm
machine owners could be careful to schedule slots for such tests at
times when their machine is otherwise idle.

Anyway it all needs some thought ...
        regards, tom lane

Re: Simplifying wal_sync_method

From

"Andrew Dunstan"

Date:

08 August 2005, 21:51:29

Tom Lane said:
> Kris Jurka <books@ejurka.com> writes:
>> Automated performance testing seems like a bad idea for the buildfarm.
>>   Consider in my particular case I've got three members that all
>> happen to  be running in virtual machines on the same host.  What
>> virtualization does  for performance and what happens when all three
>> members are running at the  same time renders any results beyond
>> useless.
>
> Certainly a good point --- but as I noted to Andrew, we'd probably be
> more interested in one-off tests than repetitive testing anyway.  So
> possibly this could be handled with a different protocol, and buildfarm
> machine owners could be careful to schedule slots for such tests at
> times when their machine is otherwise idle.
>
> Anyway it all needs some thought ...
>

Well, of course running tests would be optional.

But it's also possible that we would create a similar but separate setup to
run performance tests. Creating it would be lots easier this time around ;-)

Let's come up with something we can run by hand, decide the parameters, and
set set about automating and distributing it.

cheers

andrew

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

09 August 2005, 00:09:22

Joshua D. Drake wrote:
> 
> >>>I think we should offer the reliable option by default, and mention the
> >>>fast option for those who have battery-backed cache in the manual.
> >>
> >>But only on Win32?
> > 
> > 
> > Yes, because that's the only place where that option works, right?
> 
> fsync_writethrough only works on Win32 the postgresql.conf should 
> reflect that.

Right now what wal_sync_method supports isn't clear at all.  If you have
fdatasync or O_DSYNC (and it has a different value from O_SYNC/O_FSYNC),
you have those, if not, you get an error.  For example, my system
doesn't have fdatasync(), so if I try to use that value I get this in my
server logs:
FATAL:  invalid value for parameter "wal_sync_method": "fdatasync"

and the server does not start.  Also, writethrough is supported in 8.1
by both Win32 and OS X.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

09 August 2005, 00:55:45

Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> fsync_writethrough only works on Win32 the postgresql.conf should 
>> reflect that.

> Right now what wal_sync_method supports isn't clear at all.

Yeah.  I think we had a TODO to figure out a way for the assign_hook to
report back exactly which values *are* allowed on the current platform.
Constructing the message for this doesn't seem very difficult, but the
rules about when assign_hooks can issue their own elog message seem
to constrain the usefulness...
        regards, tom lane

Re: Simplifying wal_sync_method

From

"Jeffrey W. Baker"

Date:

09 August 2005, 02:03:31

On Mon, 2005-08-08 at 17:03 -0400, Tom Lane wrote:
> 
> That's a decision that hasn't got a shred of evidence to justify
> imposing it on every platform.

This option has its uses on Linux, however.  In my testing it's good for
a large speedup (20%) on a 10-client pgbench, and a minor improvement
with 100 clients.  See my mail of July 14th "O_DIRECT for WAL writes".

-jwb

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

09 August 2005, 02:08:04

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > No one is every going to do it, so we might as well make the best guess
> > we have.  I think any platform where the *data* options are slower than
> > the non-*data* options is broken, and if that logic holds, we might as
> > well just use *data* by default if we can, which is my proposal.
> 
> Adjusting the default settings I don't have a problem with.  Removing
> options I have a problem with --- and that appeared to be what you
> were proposing.

Well, right now we support:
   * open_datasync (write WAL files with open() option O_DSYNC)   * fdatasync (call fdatasync() at each commit),   *
fsync(call fsync() at each commit)   * fsync_writethrough (force write-through of any disk write cache)    * open_sync
(writeWAL files with open() option O_SYNC)

and we pick the first supported item as the default.  I have updated our
documentation to clarify this.

My proposal is to remove fdatasync and open_datasync, and have have
fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall
back to fsync and open_sync if the *data* version are not supported. 

We have flexibility by having more options, but we also have complexity
of having options that have never proven to be useful in the years we
have had them, namely using fsync if fdatasync is supported.

If we remove the *data* spellings, we can probably support both
open_sync and fsync on all platforms because the *data* varieties are
the ones that are not always supported.

One problem is that by removing the *data* versions, you would never
know if you were calling fsync or fdatasync internally.

We also need to re-test these defaults because we now have O_DIRECT and
groups writes of WAL.

If we test using the build farm, if we test two options and alternate
the tests, and one is always faster than the other, I think we can
conclude that that one is faster, even if there are other loads on the
system.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

09 August 2005, 02:17:19

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Marko Kreen wrote:
> >> On same topic:
> >> http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> >> Why does win32 PostgreSQL allow data corruption by default?
> 
> > It behaves the same on Unix as Win32, and if you have battery-backed
> > cache, you don't need writethrough, so we don't have it as default.  I
> > am going to write a section in the manual for 8.1 about these
> > reliability issues.
> 
> I thought we had changed the default for Windows to be fsync_writethrough
> in 8.1?  We didn't have that code in 8.0, but now that we do, it surely
> seems like the sanest default.

Well, 8.0 shipped with commit() for fsync(), which in fact is
writethrough, but we decided that that wasn't a good default because:
o  it didn't match Unixo  Oracle doesn't use that method for fsynco  we would be slower than Oracle on Win32o  it is a
lossfor battery backed RAID

so we moved commit() to fsync_writethrough, and found a way to do real
fdatasync as the default on Win32 in 8.0.2.  This is clearly mentioned
in the release notes:
* Enable the wal_sync_method setting of "open_datasync" on Windows, andmake it the default for that platform (Magnus,
Bruce)Because thedefault is no longer "fsync_writethrough", data loss is possible duringa power failure if the disk
drivehas write caching enabled. To turn offthe write cache on Windows, from the Device Manager, choose the
driveproperties,then Policies.

This was discussed on the lists extensively.

One problem with writethrough is that drives that don't do writethrough
by default are often the ones with the worst performance for this,
namely IDE drives.

Also, in FreeBSD, if you add "hw.ata.wc=0" to /boot/loader.conf, you get
write-through, but for all ATA drives.  Should we recommend that?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

09 August 2005, 02:24:56

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> My proposal is to remove fdatasync and open_datasync, and have have
> fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall
> back to fsync and open_sync if the *data* version are not supported. 

And this will buy us what, other than lack of flexibility?

The "data" options already are the default when available, I think
(if not, I have no objection to making them so).  That does not
equate to saying we should remove access to the other options.
Your argument that they are useless only holds up in a perfect
world where there are no hardware bugs and no kernel bugs ...
and last I checked, we do not live in such a world.
        regards, tom lane

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

09 August 2005, 02:28:37

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > My proposal is to remove fdatasync and open_datasync, and have have
> > fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall
> > back to fsync and open_sync if the *data* version are not supported. 
> 
> And this will buy us what, other than lack of flexibility?

Clarity in testing options.

> The "data" options already are the default when available, I think
> (if not, I have no objection to making them so).  That does not

They are.

> equate to saying we should remove access to the other options.
> Your argument that they are useless only holds up in a perfect
> world where there are no hardware bugs and no kernel bugs ...
> and last I checked, we do not live in such a world.

Is it useful to have the option of using non-*data* options when *data*
options are available?  I have never heard of anyone wanting to do that,
nor do I imagine anyone doing that.  Is there a real use case?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

09 August 2005, 05:01:21

On Mon, Aug 08, 2005 at 08:04:44PM -0400, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Marko Kreen wrote:
> >> On same topic:
> >> http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> >> Why does win32 PostgreSQL allow data corruption by default?
> 
> > It behaves the same on Unix as Win32, and if you have battery-backed
> > cache, you don't need writethrough, so we don't have it as default.  I
> > am going to write a section in the manual for 8.1 about these
> > reliability issues.
> 
> I thought we had changed the default for Windows to be fsync_writethrough
> in 8.1?  We didn't have that code in 8.0, but now that we do, it surely
> seems like the sanest default.

Seems it _was_ default in 8.0 and 8.0.1 (called fsync) but
renamed to fsync_writethrough in 8.0.2 and moved away from being
default.

Now, 8.0.2 was released on 2005-04-07 and first destruction
happened in 2005-07-20.  If this says anything about future,
I don't think PostgreSQL will stay known as 'reliable' database.

-- 
marko

Re: Simplifying wal_sync_method

From

"Magnus Hagander"

Date:

09 August 2005, 05:02:56

> > > > Currently, here are the options available for wal_sync_method:
> > > >
> > > >     #wal_sync_method = fsync        # the default
> varies across platforms:
> > > >                                     # fsync,
> fdatasync, fsync_writethrough,
> > > >                                     # open_sync,
> open_datasync
> > >
> > > On same topic:
> > >
> > >
> http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php
> > >
> > > Why does win32 PostgreSQL allow data corruption by default?
> >
> > It behaves the same on Unix as Win32, and if you have
> battery-backed
> > cache, you don't need writethrough, so we don't have it as
> default.  I

Correction, if you have bbwc, you *should not* have writethrough. Not
only do you not need it, enabling it will drastically lower performance.


> > am going to write a section in the manual for 8.1 about these
> > reliability issues.
>
> For some reason I don't see "corruped database after crash"
> reports on Unixen.  Why?

Because you don't read the lists often enough? I see it happen quite
often.


> Also, why can't win32 be safe without battery-backed cache?
> I can't see such requirement on other platforms.

It can, you just need to learn how to configure your system. There are
two different options to make it safe on win32 without battery backed
cache:

1) Use the postgresql option for fsync write through

2) Configure windows to disable write caching. If you do this, which you
of course already do on all your windows servers without write cache I
hope since it affects all windows operations including the filesystem
itself, you are safe with the default settings in postgresql.



I think what a lot of people don't realise is how easy option 2 is. It's
in traditional windows style *a single checkbox* in the harddisk
configuration.
(Granted, you need a modern windows for that. On older windows it's a
registry key)


I have some code floating in my tree to issue a WARNING on startup if
write cache is enabled and postgresql is not using writethrough. It's
not quite ready yet, but if such a thing would be accepted post
feature-freeze I can have it finished in good time before 8.1. It would
be quite simple (looking at just the main data directory for example,
ignoring tablespaces), but if you're dealing with complex installations
you'd better have a clue about how windows works anyway...


//Magnus

Re: Simplifying wal_sync_method

From

"Magnus Hagander"

Date:

09 August 2005, 05:08:35

> > > I think we should offer the reliable option by default,
> and mention
> > > the fast option for those who have battery-backed cache
> in the manual.
> >
> > But only on Win32?
>
> We should do what's possible with what's given to us.
>
> On Win32:
>
> 1.  We can write through cache.

Yes.

> 2.  We have unreliable OS with unreliable filesystem.

That can definitly be debated. Properly maintaned on proper hardware,
it's quite reliable these days.
Most filesystem corruptions that happen on windows are because people
enable write caching on drives without battery backup. The same issue
we're facing here, it's *not* a problem in the fs, it's a problem in the
admin. Sure, there are lots of things that could be better with ntfs,
but I would definitly not call it unreliable.


> 3.  The probability of mediocre hardware is higher.

I would say it's actually *lower*. If you look in the average
datacenter, I bet you'll find a lot more linux boxes running on
built-at-home-with-the-cheapest-parts boxes. Whereas your windows boxes
will run on HP or IBM or whatever real server-grade hardware.

I don't know anybody who claims to run a professional business who uses
IDE drives in a Windows server, for example. I know several who run
linux or freebsd on it.


> Regular POSIX:
> 1.  We can't write through cache.
> 2.  We have good OS with good filesystem (probably even
>     journaled).

NTFS is journaled, BTW. And I've seen a lot more corruption on ext2,
extr3 or reiser than I'ev seen on NTFS in my datacenter - and I have
about 5 times more Windows server than linux...
Granted other unixen might be more stable, I don't run any of those..


> 3.  The probably of mediocre hardware is lower.

See above.


> Why shouldn't we offer reliable option to win32?

*we do offer a reliabel option*.
Same as on POSIX, we don't enable it by default for *non-server
hardware*.


> Options:
>
> -  Win32 guy complains that PG is bit slow.
>    We tell him to RTFM.

What most often happens here is:
Win32 guy notices PG is very slow, changes to mysql or mssql.


> PS.  Yeah, I was the guy who helped him to restore what's left.
> I'd say he wasn't exactly happy.

I bet. Has he looked over all his other windows servers that are
improperly configured with regards to write cache?

//Magnus

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

09 August 2005, 06:15:48

On Tue, Aug 09, 2005 at 10:02:44AM +0200, Magnus Hagander wrote:
> > > It behaves the same on Unix as Win32, and if you have 
> > battery-backed 
> > > cache, you don't need writethrough, so we don't have it as 
> > default.  I 
> 
> Correction, if you have bbwc, you *should not* have writethrough. Not
> only do you not need it, enabling it will drastically lower performance.

So what?  User should read docs how to get good performance.

> > Also, why can't win32 be safe without battery-backed cache?
> > I can't see such requirement on other platforms.
> 
> It can, you just need to learn how to configure your system. There are
> two different options to make it safe on win32 without battery backed
> cache:

I personally do not use PostgreSQL in win32 (yet - this may
change).  I just felt the pain of a guy who tried...

> in traditional windows style *a single checkbox* in the harddisk
> configuration.
> (Granted, you need a modern windows for that. On older windows it's a
> registry key)

I think PostgreSQL should reliable by default.

Now with the Windows port there are lot of people who just try it out
on regular desktop machine.

With point-n-click installer there's no need to read docs and
after experiencing the unreliability they won't take it as
serious database.

> I have some code floating in my tree to issue a WARNING on startup if
> write cache is enabled and postgresql is not using writethrough. It's
> not quite ready yet, but if such a thing would be accepted post
> feature-freeze I can have it finished in good time before 8.1. It would
> be quite simple (looking at just the main data directory for example,
> ignoring tablespaces), but if you're dealing with complex installations
> you'd better have a clue about how windows works anyway...

Hey, thats a good idea, irrespective whether the default changes or not.

I think if it's just couple of checks and then printf, it should
not meet much resistance.

-- 
marko

Re: Simplifying wal_sync_method

From

"Magnus Hagander"

Date:

09 August 2005, 06:20:02

> > > Also, why can't win32 be safe without battery-backed cache?
> > > I can't see such requirement on other platforms.
> >
> > It can, you just need to learn how to configure your
> system. There are
> > two different options to make it safe on win32 without
> battery backed
> > cache:
>
> I personally do not use PostgreSQL in win32 (yet - this may
> change).  I just felt the pain of a guy who tried...

Didn't mean "you" as in you personally, meant "you" as in the user.
Sorry.


> > in traditional windows style *a single checkbox* in the harddisk
> > configuration.
> > (Granted, you need a modern windows for that. On older
> windows it's a
> > registry key)
>
> I think PostgreSQL should reliable by default.

For that I think we need to set it to fsync() on all platforms. it's the
least unsafe one on POSIX and it's the safe one on Win32.


> Now with the Windows port there are lot of people who just
> try it out on regular desktop machine.

Sure, but if you're just trying it out, it's not going to kill you if
you lose the data...


> With point-n-click installer there's no need to read docs and
> after experiencing the unreliability they won't take it as
> serious database.

Well the same reasoning applies to the fact that they won't take it as a
serious database because it's too slow.

Perhaps we need to provide an option in the installer to controll what
goes in the initialized database. With an explanation ("don't enable
this if you use IDE disks and care about your data").


> > I have some code floating in my tree to issue a WARNING on
> startup if
> > write cache is enabled and postgresql is not using
> writethrough. It's
> > not quite ready yet, but if such a thing would be accepted post
> > feature-freeze I can have it finished in good time before 8.1. It
> > would be quite simple (looking at just the main data directory for
> > example, ignoring tablespaces), but if you're dealing with complex
> > installations you'd better have a clue about how windows
> works anyway...
>
> Hey, thats a good idea, irrespective whether the default
> changes or not.
>
> I think if it's just couple of checks and then printf, it
> should not meet much resistance.

That's the general idea - I'm hoping it will be that simpel at least :-)

//Magnus

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

09 August 2005, 06:26:05

On Tue, Aug 09, 2005 at 10:08:25AM +0200, Magnus Hagander wrote:
> That can definitly be debated. Properly maintaned on proper hardware,
> it's quite reliable these days.
> Most filesystem corruptions that happen on windows are because people
> enable write caching on drives without battery backup. The same issue
> we're facing here, it's *not* a problem in the fs, it's a problem in the
> admin. Sure, there are lots of things that could be better with ntfs,
> but I would definitly not call it unreliable.

People enable?  Isn't it the default?

> > 3.  The probability of mediocre hardware is higher.
> 
> I would say it's actually *lower*. If you look in the average
> datacenter, I bet you'll find a lot more linux boxes running on
> built-at-home-with-the-cheapest-parts boxes. Whereas your windows boxes
> will run on HP or IBM or whatever real server-grade hardware.
> 
> I don't know anybody who claims to run a professional business who uses
> IDE drives in a Windows server, for example. I know several who run
> linux or freebsd on it.

The professional probably tests it on his own desktop.  I don't
think PostgreSQL reaches the data center before passing the run
on desktop.

> > Regular POSIX:
> > 1.  We can't write through cache.
> > 2.  We have good OS with good filesystem (probably even
> >     journaled).
> 
> NTFS is journaled, BTW. And I've seen a lot more corruption on ext2,
> extr3 or reiser than I'ev seen on NTFS in my datacenter - and I have
> about 5 times more Windows server than linux...
> Granted other unixen might be more stable, I don't run any of those..
> 
> > 3.  The probably of mediocre hardware is lower.
> 
> See above.

Ok, comparing impressions is not productive.


> > Why shouldn't we offer reliable option to win32?
> 
> *we do offer a reliabel option*.
> Same as on POSIX, we don't enable it by default for *non-server
> hardware*.

What do you mean here?  AFAIK we try to be reliable on POSIX too.


> > Options:
> > 
> > -  Win32 guy complains that PG is bit slow.
> >    We tell him to RTFM.
> 
> What most often happens here is:
> Win32 guy notices PG is very slow, changes to mysql or mssql.

But lost database is no problem?

-- 
marko

Re: Simplifying wal_sync_method

From

"Magnus Hagander"

Date:

09 August 2005, 07:14:17

> > That can definitly be debated. Properly maintaned on proper
> hardware,
> > it's quite reliable these days.
> > Most filesystem corruptions that happen on windows are
> because people
> > enable write caching on drives without battery backup. The
> same issue
> > we're facing here, it's *not* a problem in the fs, it's a
> problem in
> > the admin. Sure, there are lots of things that could be better with
> > ntfs, but I would definitly not call it unreliable.
>
> People enable?  Isn't it the default?

I dunno about workstation OS, but on the server OSes it certainly isn't
default.


> > > 3.  The probability of mediocre hardware is higher.
> >
> > I would say it's actually *lower*. If you look in the average
> > datacenter, I bet you'll find a lot more linux boxes running on
> > built-at-home-with-the-cheapest-parts boxes. Whereas your windows
> > boxes will run on HP or IBM or whatever real server-grade hardware.
> >
> > I don't know anybody who claims to run a professional business who
> > uses IDE drives in a Windows server, for example. I know
> several who
> > run linux or freebsd on it.
>
> The professional probably tests it on his own desktop.  I
> don't think PostgreSQL reaches the data center before passing
> the run on desktop.

I can't speak for others, but I would always test a server product on a
server OS on server hardware. Certainly not as beefy as eventual
production server, but the same level. Otherwise the test is not fully
relevant.

> > > Why shouldn't we offer reliable option to win32?
> >
> > *we do offer a reliabel option*.
> > Same as on POSIX, we don't enable it by default for *non-server
> > hardware*.
>
> What do you mean here?  AFAIK we try to be reliable on POSIX too.

AFAIK fsync is slightly safer than open_sync, because it also flushes
the metadata. We don't default to that.

> > > Options:
> > >
> > > -  Win32 guy complains that PG is bit slow.
> > >    We tell him to RTFM.
> >
> > What most often happens here is:
> > Win32 guy notices PG is very slow, changes to mysql or mssql.
>
> But lost database is no problem?
>

It certainly is. That's not what I'm arguing. What I'm saying is that
you shouldn't expect server grade reliabilty on desktop hardware and
desktop OS. Regardless of platform.

//Magnus

Re: Simplifying wal_sync_method

From

Marko Kreen

Date:

09 August 2005, 07:30:59

On Tue, Aug 09, 2005 at 12:14:09PM +0200, Magnus Hagander wrote:
> > > That can definitly be debated. Properly maintaned on proper 
> > hardware, 
> > > it's quite reliable these days.
> > > Most filesystem corruptions that happen on windows are 
> > because people 
> > > enable write caching on drives without battery backup. The 
> > same issue 
> > > we're facing here, it's *not* a problem in the fs, it's a 
> > problem in 
> > > the admin. Sure, there are lots of things that could be better with 
> > > ntfs, but I would definitly not call it unreliable.
> > 
> > People enable?  Isn't it the default?
> 
> I dunno about workstation OS, but on the server OSes it certainly isn't
> default.

At least on XP Pro it is default.


> > The professional probably tests it on his own desktop.  I 
> > don't think PostgreSQL reaches the data center before passing 
> > the run on desktop.
> 
> I can't speak for others, but I would always test a server product on a
> server OS on server hardware. Certainly not as beefy as eventual
> production server, but the same level. Otherwise the test is not fully
> relevant.

You are right, but it always does not happen so.  Also think of
developers who run a dev-server on a desktop.


> > > > Why shouldn't we offer reliable option to win32?
> > > 
> > > *we do offer a reliabel option*.
> > > Same as on POSIX, we don't enable it by default for *non-server 
> > > hardware*.
> > 
> > What do you mean here?  AFAIK we try to be reliable on POSIX too.
> 
> AFAIK fsync is slightly safer than open_sync, because it also flushes
> the metadata. We don't default to that.

At least for WAL, the metadata does not change so it should not matter.

Now thinking about it, the guy had corrupt table, not WAL log.
How is WAL->tables synched?  Does the 'wal_sync_method' affect
it or not?

Ofcourse, postgres could get corrupt data from WAL and put it
into table.  (AFAIK NTFS does not log data, so we are back on
wal_sync_method.)

> > > > Options:
> > > > 
> > > > -  Win32 guy complains that PG is bit slow.
> > > >    We tell him to RTFM.
> > > 
> > > What most often happens here is:
> > > Win32 guy notices PG is very slow, changes to mysql or mssql.
> > 
> > But lost database is no problem?
> 
> It certainly is. That's not what I'm arguing. What I'm saying is that
> you shouldn't expect server grade reliabilty on desktop hardware and
> desktop OS. Regardless of platform.

But we should expect server-grade speed?  ;)

-- 
marko

Re: Simplifying wal_sync_method

From

"Magnus Hagander"

Date:

09 August 2005, 07:58:36

> > I dunno about workstation OS, but on the server OSes it certainly
> > isn't default.
>
> At least on XP Pro it is default.

Yuck.


> > > The professional probably tests it on his own desktop.  I don't
> > > think PostgreSQL reaches the data center before passing
> the run on
> > > desktop.
> >
> > I can't speak for others, but I would always test a server
> product on
> > a server OS on server hardware. Certainly not as beefy as eventual
> > production server, but the same level. Otherwise the test
> is not fully
> > relevant.
>
> You are right, but it always does not happen so.  Also think
> of developers who run a dev-server on a desktop.

Well, with developers losing your data really isn't all that bad. It's a lot easier to deal with than losing a server
:-)


> > > > > Why shouldn't we offer reliable option to win32?
> > > >
> > > > *we do offer a reliabel option*.
> > > > Same as on POSIX, we don't enable it by default for *non-server
> > > > hardware*.
> > >
> > > What do you mean here?  AFAIK we try to be reliable on POSIX too.
> >
> > AFAIK fsync is slightly safer than open_sync, because it
> also flushes
> > the metadata. We don't default to that.
>
> At least for WAL, the metadata does not change so it should
> not matter.

In most cases, right. In some cases it does (create a new WAL log segment for example). It's not a very common
scenario,but I've seen error reports saying that an entire WAL segment is missing which is probably from metadata not
beingon disk at crash time. 
(This is one thing that's "better" with the dbs that stuff evrything in a single precreated file (for example mssql) -
theonly metadata in the filesystem there is the "latest write time", which is completely irrelevant to the data) 


> Now thinking about it, the guy had corrupt table, not WAL log.
> How is WAL->tables synched?  Does the 'wal_sync_method'
> affect it or not?

I *think* it always fsyncs() there as it is now, but I'm not 100% sure.


> Ofcourse, postgres could get corrupt data from WAL and put it
> into table.  (AFAIK NTFS does not log data, so we are back on
> wal_sync_method.)

Correct, and I beleive that's true for most Unix journaling fs:s as well - they only journal metadata.
Also, once a checkpoint has occured, postgresql will discard the WAL log. If the sync came through for the checkpoint
recordin the WAL file but not in the contents of the datafile, the recovery process will think that the file is ok even
thoughit isn't. 

> > It certainly is. That's not what I'm arguing. What I'm
> saying is that
> > you shouldn't expect server grade reliabilty on desktop
> hardware and
> > desktop OS. Regardless of platform.
>
> But we should expect server-grade speed?  ;)

Touché :-)

//Magnus

Re: Simplifying wal_sync_method

From

Alvaro Herrera

Date:

09 August 2005, 11:04:31

On Tue, Aug 09, 2005 at 12:58:31PM +0200, Magnus Hagander wrote:

> > Now thinking about it, the guy had corrupt table, not WAL log.
> > How is WAL->tables synched?  Does the 'wal_sync_method' 
> > affect it or not?
> 
> I *think* it always fsyncs() there as it is now, but I'm not 100% sure.

No.  If fsync is off, then no fsync is done to the data files on
checkpoint either.  (See mdsync() on src/backend/storage/smgr/md.c)

-- 
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
A male gynecologist is like an auto mechanic who never owned a car.
(Carrie Snow)

Re: Simplifying wal_sync_method

From

"Magnus Hagander"

Date:

09 August 2005, 11:06:14

> > > Now thinking about it, the guy had corrupt table, not WAL log.
> > > How is WAL->tables synched?  Does the 'wal_sync_method'
> > > affect it or not?
> >
> > I *think* it always fsyncs() there as it is now, but I'm
> not 100% sure.
>
> No.  If fsync is off, then no fsync is done to the data files
> on checkpoint either.  (See mdsync() on src/backend/storage/smgr/md.c)

Right, but we're not talking fsync=off, we're talking when you are using
fdatasync, O_SYNC etc.

If you turn off fsync you're on your own, no matter the OS or other
settings...

//Magnus

Re: Simplifying wal_sync_method

From

Alvaro Herrera

Date:

09 August 2005, 11:18:44

On Tue, Aug 09, 2005 at 04:05:28PM +0200, Magnus Hagander wrote:
> > > > Now thinking about it, the guy had corrupt table, not WAL log.
> > > > How is WAL->tables synched?  Does the 'wal_sync_method' 
> > > > affect it or not?
> > > 
> > > I *think* it always fsyncs() there as it is now, but I'm 
> > not 100% sure.
> > 
> > No.  If fsync is off, then no fsync is done to the data files 
> > on checkpoint either.  (See mdsync() on src/backend/storage/smgr/md.c)
> 
> Right, but we're not talking fsync=off, we're talking when you are using
> fdatasync, O_SYNC etc. 

Oh, sorry :-)  At that point, pg_fsync is called, which can invoke
commit() or fsync() depending on whether you have writethrough enabled.

pg_fsync() on storage/file/fd.c

-- 
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
FOO MANE PADME HUM

Re: Simplifying wal_sync_method

From

mark@mark.mielke.cc

Date:

09 August 2005, 11:36:59

On Tue, Aug 09, 2005 at 12:25:36PM +0300, Marko Kreen wrote:
> On Tue, Aug 09, 2005 at 10:08:25AM +0200, Magnus Hagander wrote:
> > Most filesystem corruptions that happen on windows are because people
> > enable write caching on drives without battery backup. The same issue
> > we're facing here, it's *not* a problem in the fs, it's a problem in the
> > admin. Sure, there are lots of things that could be better with ntfs,
> > but I would definitly not call it unreliable.
> People enable?  Isn't it the default?

I think a little too much speculation in this thread, and not enough real
data... :-)

I only have Windows notebooks, and pre-configured systems by the company
I work for to judge. The notebooks of course have it 'on' (battery packed,
and if it wasn't on, I would have enabled it myself). I won't bother to
check the corporate systems, as whatever they are, they may not be the
Windows system default. Who knows for real?

In any case - I disagreed with the conclusions presented that
suggested that Windows had a poor file system, or should be linked
with poor hardware. Seems like FUD to me, and doesn't match my
experiences. I agree with the other poster that Windows hardware is
usually better in actual professional server environments. It might be
because people feel Windows requires better hardware to be stable, or
it might be that Windows applications tend to use more memory and disk
space, therefore the recommended entry level system is of higher
quality. It doesn't matter why people do it - or even if their reasons
are valid - what does matter, is that it isn't a fair conclusion that
Windows boxes will use poorer hardware. The opposite may be true, or
neither may be true.

> > I don't know anybody who claims to run a professional business who uses
> > IDE drives in a Windows server, for example. I know several who run
> > linux or freebsd on it.
> The professional probably tests it on his own desktop.  I don't
> think PostgreSQL reaches the data center before passing the run
> on desktop.

I don't know why this would be relevant.  The 'professional' may do
some sort of local testing, but this doesn't negate the requirement
for server testing, as it should be well known that the environment is
sufficiently different, and therefore the expectations should be
sufficiently different. The 'professional' may choose to enable write
caching, because they don't care about reliability on their local
system. If it crashes, they re-clone their system, and re-populate the
database. In any case, this is more speculation, and not productive.

> > > Options:
> > > -  Win32 guy complains that PG is bit slow.
> > >    We tell him to RTFM.
> > What most often happens here is:
> > Win32 guy notices PG is very slow, changes to mysql or mssql.
> But lost database is no problem?

Personally, my only complaint regarding either choice is the
assumption that a 'WIN32' guy is stupid, and that 'WIN32' itself is
deficient. As long as the default is well documented, I don't have a
problem with either 'faster but less reliable on systems configured
for speed over reliability at the operating system level (write
caching enabled)' or 'slower, but reliable, just in case the system is
configured for speed over reliability at the operating system level
(write caching enabled)'. As long as it is well documented, either is
fine. I'm not convinced that Linux is really that much safer anyways,
and when it comes to a standard WIN32 configuration option, I assume
that the WIN32 administrator is somewhat competent.

You guys are too deep-routed in UNIX-land. I can't entirely blame you
- but the world is bigger than UNIX. :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

09 August 2005, 11:54:20

Magnus Hagander wrote:
> > Now thinking about it, the guy had corrupt table, not WAL log.
> > How is WAL->tables synched?  Does the 'wal_sync_method' 
> > affect it or not?
> 
> I *think* it always fsyncs() there as it is now, but I'm not 100% sure.

wal_sync_method is also used to flush pages during a checkpoint, so it
could lead to table corruption too, not just WAL corruption.

However, on Unix, 99% of corruption is caused by bad disk or RAM.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

09 August 2005, 11:54:20

Magnus Hagander wrote:
> > > I dunno about workstation OS, but on the server OSes it certainly 
> > > isn't default.
> > 
> > At least on XP Pro it is default.
> 
> Yuck.

I see "enable write caching" as enabled by default on my XP Pro laptop,
though laptops can be said to already have battery-backed disks.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

Bruce Momjian

Date:

09 August 2005, 12:13:45

Magnus Hagander wrote:
> > > > Now thinking about it, the guy had corrupt table, not WAL log.
> > > > How is WAL->tables synched?  Does the 'wal_sync_method' 
> > > > affect it or not?
> > > 
> > > I *think* it always fsyncs() there as it is now, but I'm 
> > not 100% sure.
> > 
> > wal_sync_method is also used to flush pages during a 
> > checkpoint, so it could lead to table corruption too, not 
> > just WAL corruption.
> > 
> > However, on Unix, 99% of corruption is caused by bad disk or RAM.
> 
> ... or iDE disks with write cache enabled. I've certainly seen more than
> what I'd call 1% (though I haven't studied it to be sure) that's because
> of write-cached disks...

Personally, I can't remember a case that was caused by something other
than bad RAM or bad disk.

Let me write up a section in the manual on this for 8.1, and link it to
the wal_sync_method documentation section, and see how it looks.  Even
re-ordering the items in the docs and making bullets has made it clearer
to me what is happening, and what is the default.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Simplifying wal_sync_method

From

"Magnus Hagander"

Date:

09 August 2005, 12:15:21

> > > Now thinking about it, the guy had corrupt table, not WAL log.
> > > How is WAL->tables synched?  Does the 'wal_sync_method'
> > > affect it or not?
> >
> > I *think* it always fsyncs() there as it is now, but I'm
> not 100% sure.
>
> wal_sync_method is also used to flush pages during a
> checkpoint, so it could lead to table corruption too, not
> just WAL corruption.
>
> However, on Unix, 99% of corruption is caused by bad disk or RAM.

... or iDE disks with write cache enabled. I've certainly seen more than
what I'd call 1% (though I haven't studied it to be sure) that's because
of write-cached disks...

//Magnus

Re: Simplifying wal_sync_method

From

Andrew - Supernews

Date:

09 August 2005, 16:35:17

On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote:
> ... or iDE disks with write cache enabled. I've certainly seen more than
> what I'd call 1% (though I haven't studied it to be sure) that's because
> of write-cached disks...

Every SCSI disk I've looked at recently has had write cache enabled by
default, fwiw.

Turning it off isn't quite the performance killer that it is on IDE, of
course, but it is there.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

09 August 2005, 17:08:13

Andrew - Supernews <andrew+nonews@supernews.com> writes:
> On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote:
>> ... or iDE disks with write cache enabled. I've certainly seen more than
>> what I'd call 1% (though I haven't studied it to be sure) that's because
>> of write-cached disks...

> Every SCSI disk I've looked at recently has had write cache enabled by
> default, fwiw.

On SCSI, write cacheing is default because the protocol is actually
designed to support it: the drive can take the data, and then take some
more, without giving the impression that the write has been done.

If a SCSI drive reports write complete when it hasn't actually put the
bits on the platter yet, then it's simply broken.
        regards, tom lane

Re: Simplifying wal_sync_method

From

Andrew - Supernews

Date:

09 August 2005, 22:09:58

On 2005-08-09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew - Supernews <andrew+nonews@supernews.com> writes:
>> On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote:
>>> ... or iDE disks with write cache enabled. I've certainly seen more than
>>> what I'd call 1% (though I haven't studied it to be sure) that's because
>>> of write-cached disks...
>
>> Every SCSI disk I've looked at recently has had write cache enabled by
>> default, fwiw.
>
> On SCSI, write cacheing is default because the protocol is actually
> designed to support it: the drive can take the data, and then take some
> more, without giving the impression that the write has been done.

Wrong. Write caching as controlled by the WCE parameter on mode page 8
for direct-access devices does in fact report the write operation as
complete before the bits are on the disk. The protocol supplies a number
of additional commands to flush the cache, etc., for which you'll have
to consult the specs.

The reason it's not so much of a performance killer to turn it off is that
tag-queueing (which is what you are referring to) provides for some
optimization of concurrent requests even with the cache off.

> If a SCSI drive reports write complete when it hasn't actually put the
> bits on the platter yet, then it's simply broken.

I guess you haven't read the spec much, then.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: Simplifying wal_sync_method

From

Tom Lane

Date:

10 August 2005, 00:02:49

Andrew - Supernews <andrew+nonews@supernews.com> writes:
>> If a SCSI drive reports write complete when it hasn't actually put the
>> bits on the platter yet, then it's simply broken.

> I guess you haven't read the spec much, then.

[ shrug... ]  I have seen that spec before: I was making a living by
implementing SCSI device drivers in the mid-80's.  I think that anyone
who uses WCE in place of tagged command queueing is not someone whose
code I would care to rely on for mission-critical applications.  TCQ
is a design that just works; WCE is someone's attempt to emulate all
the worst features of IDE.
        regards, tom lane

Re: Simplifying wal_sync_method

From

mark@mark.mielke.cc

Date:

10 August 2005, 03:07:17

On Tue, Aug 09, 2005 at 11:01:36PM -0400, Tom Lane wrote:
> Andrew - Supernews <andrew+nonews@supernews.com> writes:
> >> If a SCSI drive reports write complete when it hasn't actually put the
> >> bits on the platter yet, then it's simply broken.
> > I guess you haven't read the spec much, then.
> [ shrug... ]  I have seen that spec before: I was making a living by
> implementing SCSI device drivers in the mid-80's.  I think that anyone
> who uses WCE in place of tagged command queueing is not someone whose
> code I would care to rely on for mission-critical applications.  TCQ
> is a design that just works; WCE is someone's attempt to emulate all
> the worst features of IDE.

They're relying on you, not you on them.

Is their reliance founded upon reasonable logic, or are they unreasonably
putting the fault in your court? Depends on the issue...

Many people would not like to need to know these 'under the hood' type
issues. This doesn't mean they deserve to have their databases
corrupted to teach them the hard way why these 'under the hood' type
details are useful to know... :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/

Re: Simplifying wal_sync_method

From

Adrian Maier

Date:

10 August 2005, 03:50:38

On 8/9/05, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote:
> Personally, my only complaint regarding either choice is the
> assumption that a 'WIN32' guy is stupid, and that 'WIN32' itself is
> deficient. As long as the default is well documented, I don't have a
> problem with either 'faster but less reliable on systems configured
> for speed over reliability at the operating system level (write
> caching enabled)' or 'slower, but reliable, just in case the system is
> configured for speed over reliability at the operating system level
> (write caching enabled)'. As long as it is well documented, either is
> fine. I'm not convinced that Linux is really that much safer anyways,
> and when it comes to a standard WIN32 configuration option, I assume
> that the WIN32 administrator is somewhat competent.

Hello guys,

There seem to be arguments for both possible default configurations
"faster but less reliable" and "slower but reliable". I personally think
that the safer configuration is better.

Anyway, i have an idea :

What do you think about letting the person who installs PostgreSQL
on Win32 decide?  For Windows, we have the graphical installer
that can be improved so that the user is asked to choose between
the two possible configurations.

This way the user will be aware of this choice even if he/she does not
read the docs.

If we let this choice be made at installation time, it would be less
important which is the default value because i think that the users
who install PostgreSQL from sources on Win32 are fewer.
And we can expect that, after bothering to install mingw and compile
PostgreSQL,   they will also bother to configure it according to
their needs.

Cheers,
Adrian Maier

Re: Simplifying wal_sync_method

From

"Thomas F. O'Connell"

Date:

10 August 2005, 04:11:59

I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein
it was apparently demonstrated that fsync was the fastest option
among the 7.4.x wal_sync_method options.

If there's a way to make this information more useful by providing
more data, please let me know, and I'll see what I can do.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Strategic Open Source: Open Your i™

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

On Aug 8, 2005, at 4:44 PM, Bruce Momjian wrote:

> In summary, we added all those wal_sync_method values in hopes of
> getting some data on which is best on which platform, but having gone
> several years with few reports, I am thinking we should just choose
> the
> best ones we can and move on, rather than expose a confusing API to
> the
> users.
>
> Does anyone show a platform where the *data* options are slower
> than the
> non-*data* ones?

Re: Simplifying wal_sync_method

From

Andrew - Supernews

Date:

10 August 2005, 13:13:45

On 2005-08-10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew - Supernews <andrew+nonews@supernews.com> writes:
>>> If a SCSI drive reports write complete when it hasn't actually put the
>>> bits on the platter yet, then it's simply broken.
>
>> I guess you haven't read the spec much, then.
>
> [ shrug... ]  I have seen that spec before: I was making a living by
> implementing SCSI device drivers in the mid-80's.  I think that anyone
> who uses WCE in place of tagged command queueing is not someone whose
> code I would care to rely on for mission-critical applications.  TCQ
> is a design that just works; WCE is someone's attempt to emulate all
> the worst features of IDE.

1) Tag queueing and WCE are orthogonal concepts. It's not a question of
using one "in place of" the other. My comment was that my recent
observation of actual SCSI drives is that WCE is enabled by default and
as such _will_ be used unless either you disable it manually, or the host
OS does so.

2) What OSes in common use adapt to the WCE setting, either by turning it
off, or using FUA or issuing SYNCHRONIZE CACHE commands? Since it is
entirely transparent to the host OS, I do not believe any are, though it
looks like very recent Linux development is moving in this direction.

-- 
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Re: Simplifying wal_sync_method

From

Andrew Sullivan

Date:

11 August 2005, 18:18:45

On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote:
> I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein  
> it was apparently demonstrated that fsync was the fastest option  
> among the 7.4.x wal_sync_method options.
> 
> If there's a way to make this information more useful by providing  
> more data, please let me know, and I'll see what I can do.

What would be really interesting to me to know is what Sun did
between 8 and 9 to make that so.  We don't use Solaris for databases
any more, but fsync was a lot slower than whatever we ended up using
on 8.  I wouldn't be surprised if they'd wired fsync directly to
something else; but I can hardly believe it'd be faster than any
other option.  (Mind, we were using Veritas filesyste with this, as
well, which was at least half the headache.)

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.    --Philip Greenspun

Re: Simplifying wal_sync_method

From

"Thomas F. O'Connell"

Date:

14 August 2005, 16:45:39

UFS was the filesystem on the Solaris 9 box.

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC

Strategic Open Source: Open Your i™

http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-469-5150
615-469-5151 (fax)

On Aug 11, 2005, at 4:18 PM, Andrew Sullivan wrote:

> On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote:
>
>> I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein
>> it was apparently demonstrated that fsync was the fastest option
>> among the 7.4.x wal_sync_method options.
>>
>> If there's a way to make this information more useful by providing
>> more data, please let me know, and I'll see what I can do.
>>
>
> What would be really interesting to me to know is what Sun did
> between 8 and 9 to make that so.  We don't use Solaris for databases
> any more, but fsync was a lot slower than whatever we ended up using
> on 8.  I wouldn't be surprised if they'd wired fsync directly to
> something else; but I can hardly believe it'd be faster than any
> other option.  (Mind, we were using Veritas filesyste with this, as
> well, which was at least half the headache.)
>
> A

Re: Simplifying wal_sync_method

From

"Jim C. Nasby"

Date:

21 August 2005, 21:27:53

On Mon, Aug 08, 2005 at 07:45:38PM -0400, Andrew Dunstan wrote:
> So the short answer is possibly "You build the tests and we'll run 'em."

Would some version of dbt2/3 work for this?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software        http://pervasive.com        512-569-9461

Re: Simplifying wal_sync_method

From

Mark Wong

Date:

29 August 2005, 12:28:42

On Sun, 21 Aug 2005 19:27:35 -0500
"Jim C. Nasby" <jnasby@pervasive.com> wrote:

> On Mon, Aug 08, 2005 at 07:45:38PM -0400, Andrew Dunstan wrote:
> > So the short answer is possibly "You build the tests and we'll run 'em."
> 
> Would some version of dbt2/3 work for this?

Yeah, trying...  On the larger system I'm using I'm not seeing much of a
performance difference but I'm looking for a way to see if we can
identify any benefit to bypassing the kernel cache.  I've been
re-arranging disks due to failures and trying to tweak a couple of
profiling things, but I'll try to get some data to share within a few
days.

Mark