Thread: Simplifying wal_sync_method
Currently, here are the options available for wal_sync_method: #wal_sync_method = fsync # the default varies across platforms: # fsync, fdatasync,fsync_writethrough, # open_sync, open_datasync I don't understand why we support so many values. It seems 'fsync' should be fdatasync(), and if that is not available, fsync(). Same with open_sync and open_datasync. In fact, 8.1 uses O_DIRECT if available, and I don't see why we don't just use the "data" options automatically if available too, rather than have users guess which options their OS supports. We might need an option to print the actual features used, but I am not sure. Is this something for 8.1 or 8.2? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote: > Currently, here are the options available for wal_sync_method: > > #wal_sync_method = fsync # the default varies across platforms: > # fsync, fdatasync, fsync_writethrough, > # open_sync, open_datasync On same topic: http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php Why does win32 PostgreSQL allow data corruption by default? -- marko
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Currently, here are the options available for wal_sync_method: > #wal_sync_method = fsync # the default varies across platforms: > # fsync, fdatasync, fsync_writethrough, > # open_sync, open_datasync > I don't understand why we support so many values. Because there are so many platforms with different subsets of these APIs and different performance characteristics for the ones they do have. > It seems 'fsync' should be fdatasync(), and if that is not available, > fsync(). I have yet to see anyone do any systematic testing of the different options on different platforms. In the absence of hard data, proposing that we don't need some of the options is highly premature. > In fact, 8.1 uses O_DIRECT if available, That's a decision that hasn't got a shred of evidence to justify imposing it on every platform. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Currently, here are the options available for wal_sync_method: > > #wal_sync_method = fsync # the default varies across platforms: > > # fsync, fdatasync, fsync_writethrough, > > # open_sync, open_datasync > > > I don't understand why we support so many values. > > Because there are so many platforms with different subsets of these APIs > and different performance characteristics for the ones they do have. Right, and our current behavior makes it harder for people to even know the supported options. > > It seems 'fsync' should be fdatasync(), and if that is not available, > > fsync(). > > I have yet to see anyone do any systematic testing of the different > options on different platforms. In the absence of hard data, proposing > that we don't need some of the options is highly premature. No one is every going to do it, so we might as well make the best guess we have. I think any platform where the *data* options are slower than the non-*data* options is broken, and if that logic holds, we might as well just use *data* by default if we can, which is my proposal. > > In fact, 8.1 uses O_DIRECT if available, > > That's a decision that hasn't got a shred of evidence to justify > imposing it on every platform. Right, and there is no evidence it hurts, so we do our best until someone comes up with data to suggest we are wrong. The same should be done with *data*. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Marko Kreen wrote: > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote: > > Currently, here are the options available for wal_sync_method: > > > > #wal_sync_method = fsync # the default varies across platforms: > > # fsync, fdatasync, fsync_writethrough, > > # open_sync, open_datasync > > On same topic: > > http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > > Why does win32 PostgreSQL allow data corruption by default? It behaves the same on Unix as Win32, and if you have battery-backed cache, you don't need writethrough, so we don't have it as default. I am going to write a section in the manual for 8.1 about these reliability issues. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
In summary, we added all those wal_sync_method values in hopes of getting some data on which is best on which platform, but having gone several years with few reports, I am thinking we should just choose the best ones we can and move on, rather than expose a confusing API to the users. Does anyone show a platform where the *data* options are slower than the non-*data* ones? --------------------------------------------------------------------------- pgman wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Currently, here are the options available for wal_sync_method: > > > #wal_sync_method = fsync # the default varies across platforms: > > > # fsync, fdatasync, fsync_writethrough, > > > # open_sync, open_datasync > > > > > I don't understand why we support so many values. > > > > Because there are so many platforms with different subsets of these APIs > > and different performance characteristics for the ones they do have. > > Right, and our current behavior makes it harder for people to even know > the supported options. > > > > It seems 'fsync' should be fdatasync(), and if that is not available, > > > fsync(). > > > > I have yet to see anyone do any systematic testing of the different > > options on different platforms. In the absence of hard data, proposing > > that we don't need some of the options is highly premature. > > No one is every going to do it, so we might as well make the best guess > we have. I think any platform where the *data* options are slower than > the non-*data* options is broken, and if that logic holds, we might as > well just use *data* by default if we can, which is my proposal. > > > > In fact, 8.1 uses O_DIRECT if available, > > > > That's a decision that hasn't got a shred of evidence to justify > > imposing it on every platform. > > Right, and there is no evidence it hurts, so we do our best until > someone comes up with data to suggest we are wrong. The same should be > done with *data*. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote: > Marko Kreen wrote: > > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote: > > > Currently, here are the options available for wal_sync_method: > > > > > > #wal_sync_method = fsync # the default varies across platforms: > > > # fsync, fdatasync, fsync_writethrough, > > > # open_sync, open_datasync > > > > On same topic: > > > > http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > > > > Why does win32 PostgreSQL allow data corruption by default? > > It behaves the same on Unix as Win32, and if you have battery-backed > cache, you don't need writethrough, so we don't have it as default. I > am going to write a section in the manual for 8.1 about these > reliability issues. For some reason I don't see "corruped database after crash" reports on Unixen. Why? Also, why can't win32 be safe without battery-backed cache? I can't see such requirement on other platforms. -- marko
Marko Kreen wrote: > On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote: > > Marko Kreen wrote: > > > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote: > > > > Currently, here are the options available for wal_sync_method: > > > > > > > > #wal_sync_method = fsync # the default varies across platforms: > > > > # fsync, fdatasync, fsync_writethrough, > > > > # open_sync, open_datasync > > > > > > On same topic: > > > > > > http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > > > > > > Why does win32 PostgreSQL allow data corruption by default? > > > > It behaves the same on Unix as Win32, and if you have battery-backed > > cache, you don't need writethrough, so we don't have it as default. I > > am going to write a section in the manual for 8.1 about these > > reliability issues. > > For some reason I don't see "corruped database after crash" > reports on Unixen. Why? They use SCSI or battery-backed RAID cards more often? > Also, why can't win32 be safe without battery-backed cache? > I can't see such requirement on other platforms. If it uses SCSI, it is secure, just like Unix. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Alvaro Herrera wrote: > On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote: > > Marko Kreen wrote: > > > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote: > > > > Currently, here are the options available for wal_sync_method: > > > > > > > > #wal_sync_method = fsync # the default varies across platforms: > > > > # fsync, fdatasync, fsync_writethrough, > > > > # open_sync, open_datasync > > > > > > On same topic: > > > > > > http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > > > > > > Why does win32 PostgreSQL allow data corruption by default? > > > > It behaves the same on Unix as Win32, and if you have battery-backed > > cache, you don't need writethrough, so we don't have it as default. I > > am going to write a section in the manual for 8.1 about these > > reliability issues. > > I think we should offer the reliable option by default, and mention the > fast option for those who have battery-backed cache in the manual. But only on Win32? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Marko, > Also, why can't win32 be safe without battery-backed cache? > I can't see such requirement on other platforms. Read the referenced message again. It's only an issue if you want to use open_datasync. fsync_writethrough should be safe. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Mon, Aug 08, 2005 at 06:02:37PM -0400, Bruce Momjian wrote: > Alvaro Herrera wrote: > > On Mon, Aug 08, 2005 at 05:38:59PM -0400, Bruce Momjian wrote: > > > Marko Kreen wrote: > > > > On Mon, Aug 08, 2005 at 03:56:39PM -0400, Bruce Momjian wrote: > > > > > Currently, here are the options available for wal_sync_method: > > > > > > > > > > #wal_sync_method = fsync # the default varies across platforms: > > > > > # fsync, fdatasync, fsync_writethrough, > > > > > # open_sync, open_datasync > > > > > > > > On same topic: > > > > > > > > http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > > > > > > > > Why does win32 PostgreSQL allow data corruption by default? > > > > > > It behaves the same on Unix as Win32, and if you have battery-backed > > > cache, you don't need writethrough, so we don't have it as default. I > > > am going to write a section in the manual for 8.1 about these > > > reliability issues. > > > > I think we should offer the reliable option by default, and mention the > > fast option for those who have battery-backed cache in the manual. > > But only on Win32? Yes, because that's the only place where that option works, right? -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "I dream about dreams about dreams", sang the nightingale under the pale moon (Sandman)
On Mon, Aug 08, 2005 at 06:02:37PM -0400, Bruce Momjian wrote: > Alvaro Herrera wrote: > > I think we should offer the reliable option by default, and mention the > > fast option for those who have battery-backed cache in the manual. > > But only on Win32? We should do what's possible with what's given to us. On Win32: 1. We can write through cache. 2. We have unreliable OS with unreliable filesystem. 3. The probability of mediocre hardware is higher. Regular POSIX: 1. We can't write through cache. 2. We have good OS with good filesystem (probably even journaled). 3. The probably of mediocre hardware is lower. Why shouldn't we offer reliable option to win32? Options: - Win32 guy complains that PG is bit slow. We tell him to RTFM. - Win32 guy complains he lost database. We tell him he didn't RTFM. Which way you make more friends? -- marko PS. Yeah, I was the guy who helped him to restore what's left. I'd say he wasn't exactly happy.
On Mon, Aug 08, 2005 at 03:10:54PM -0700, Josh Berkus wrote: > Marko, > > Also, why can't win32 be safe without battery-backed cache? > > I can't see such requirement on other platforms. > > Read the referenced message again. It's only an issue if you want to use > open_datasync. fsync_writethrough should be safe. But thats the point. Why isn't fsync_writethrough default? -- marko
Bruce, > No one is every going to do it, so we might as well make the best guess > we have. I think any platform where the *data* options are slower than > the non-*data* options is broken, and if that logic holds, we might as > well just use *data* by default if we can, which is my proposal. Changing the defaults is fine with me. I just don't think that we can afford to prune options without more testing. And we will be getting more testing (from companies) in the future, so I don't think this is completely out of the question. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
>>>I think we should offer the reliable option by default, and mention the >>>fast option for those who have battery-backed cache in the manual. >> >>But only on Win32? > > > Yes, because that's the only place where that option works, right? fsync_writethrough only works on Win32 the postgresql.conf should reflect that. >
On Mon, 2005-08-08 at 17:44 -0400, Bruce Momjian wrote: > In summary, we added all those wal_sync_method values in hopes of > getting some data on which is best on which platform, but having gone > several years with few reports, I am thinking we should just choose the > best ones we can and move on, rather than expose a confusing API to the > users. I agree this should be attempted over the 8.1 beta period. This is a good case for having a Port Coordinator assigned for each port, so we could ask them to hunt out the solution for their platform. Maybe this is something that we can broadcast to the BuildFarm team, so each person can reflect on the appropriate settings? Best Regards, Simon Riggs
Bruce Momjian <pgman@candle.pha.pa.us> writes: > No one is every going to do it, so we might as well make the best guess > we have. I think any platform where the *data* options are slower than > the non-*data* options is broken, and if that logic holds, we might as > well just use *data* by default if we can, which is my proposal. Adjusting the default settings I don't have a problem with. Removing options I have a problem with --- and that appeared to be what you were proposing. regards, tom lane
Simon Riggs wrote: >On Mon, 2005-08-08 at 17:44 -0400, Bruce Momjian wrote: > > >>In summary, we added all those wal_sync_method values in hopes of >>getting some data on which is best on which platform, but having gone >>several years with few reports, I am thinking we should just choose the >>best ones we can and move on, rather than expose a confusing API to the >>users. >> >> > >I agree this should be attempted over the 8.1 beta period. > >This is a good case for having a Port Coordinator assigned for each >port, so we could ask them to hunt out the solution for their platform. >Maybe this is something that we can broadcast to the BuildFarm team, so >each person can reflect on the appropriate settings? > > > > It might be possible to build a new set of tests that we could perform. That would have to be built into the buildfarm script, as the PL tests were, but they were picked up pretty quickly by the community. Unfortunately it doesn't sound like these would fit into the pg_regress setup, so we'll have to devise a different test harness - probably not a bad idea for automated performance testing anyway. So the short answer is possibly "You build the tests and we'll run 'em." cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > So the short answer is possibly "You build the tests and we'll run 'em." The availability of the buildfarm certainly makes it a lot more feasible to do performance tests on a variety of platforms. So, who wants to knock something together? I suppose we would usually be interested in one-time tests, rather than something repeated every time CVS is touched. How might that sort of requirement fit into the buildfarm software design? regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Marko Kreen wrote: >> On same topic: >> http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php >> Why does win32 PostgreSQL allow data corruption by default? > It behaves the same on Unix as Win32, and if you have battery-backed > cache, you don't need writethrough, so we don't have it as default. I > am going to write a section in the manual for 8.1 about these > reliability issues. I thought we had changed the default for Windows to be fsync_writethrough in 8.1? We didn't have that code in 8.0, but now that we do, it surely seems like the sanest default. regards, tom lane
On Mon, 8 Aug 2005, Andrew Dunstan wrote: > So the short answer is possibly "You build the tests and we'll run 'em." > Automated performance testing seems like a bad idea for the buildfarm. Consider in my particular case I've got three members that all happen to be running in virtual machines on the same host. What virtualization does for performance and what happens when all three members are running at the same time renders any results beyond useless. Certainly soliciting the pgbuildfarm-members@pgfoundry.org list is good idea, but I don't think automating this testing is a good idea without more knowledge of the machines and their other workloads. Kris Jurka
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>So the short answer is possibly "You build the tests and we'll run 'em." >> >> > >The availability of the buildfarm certainly makes it a lot more feasible >to do performance tests on a variety of platforms. So, who wants to >knock something together? > >I suppose we would usually be interested in one-time tests, rather than >something repeated every time CVS is touched. How might that sort of >requirement fit into the buildfarm software design? > > > > I'll give it some thought. Maybe a unique name would do the trick. cheers andrew
Kris Jurka <books@ejurka.com> writes: > Automated performance testing seems like a bad idea for the buildfarm. > Consider in my particular case I've got three members that all happen to > be running in virtual machines on the same host. What virtualization does > for performance and what happens when all three members are running at the > same time renders any results beyond useless. Certainly a good point --- but as I noted to Andrew, we'd probably be more interested in one-off tests than repetitive testing anyway. So possibly this could be handled with a different protocol, and buildfarm machine owners could be careful to schedule slots for such tests at times when their machine is otherwise idle. Anyway it all needs some thought ... regards, tom lane
Tom Lane said: > Kris Jurka <books@ejurka.com> writes: >> Automated performance testing seems like a bad idea for the buildfarm. >> Consider in my particular case I've got three members that all >> happen to be running in virtual machines on the same host. What >> virtualization does for performance and what happens when all three >> members are running at the same time renders any results beyond >> useless. > > Certainly a good point --- but as I noted to Andrew, we'd probably be > more interested in one-off tests than repetitive testing anyway. So > possibly this could be handled with a different protocol, and buildfarm > machine owners could be careful to schedule slots for such tests at > times when their machine is otherwise idle. > > Anyway it all needs some thought ... > Well, of course running tests would be optional. But it's also possible that we would create a similar but separate setup to run performance tests. Creating it would be lots easier this time around ;-) Let's come up with something we can run by hand, decide the parameters, and set set about automating and distributing it. cheers andrew
Joshua D. Drake wrote: > > >>>I think we should offer the reliable option by default, and mention the > >>>fast option for those who have battery-backed cache in the manual. > >> > >>But only on Win32? > > > > > > Yes, because that's the only place where that option works, right? > > fsync_writethrough only works on Win32 the postgresql.conf should > reflect that. Right now what wal_sync_method supports isn't clear at all. If you have fdatasync or O_DSYNC (and it has a different value from O_SYNC/O_FSYNC), you have those, if not, you get an error. For example, my system doesn't have fdatasync(), so if I try to use that value I get this in my server logs: FATAL: invalid value for parameter "wal_sync_method": "fdatasync" and the server does not start. Also, writethrough is supported in 8.1 by both Win32 and OS X. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> fsync_writethrough only works on Win32 the postgresql.conf should >> reflect that. > Right now what wal_sync_method supports isn't clear at all. Yeah. I think we had a TODO to figure out a way for the assign_hook to report back exactly which values *are* allowed on the current platform. Constructing the message for this doesn't seem very difficult, but the rules about when assign_hooks can issue their own elog message seem to constrain the usefulness... regards, tom lane
On Mon, 2005-08-08 at 17:03 -0400, Tom Lane wrote: > > That's a decision that hasn't got a shred of evidence to justify > imposing it on every platform. This option has its uses on Linux, however. In my testing it's good for a large speedup (20%) on a 10-client pgbench, and a minor improvement with 100 clients. See my mail of July 14th "O_DIRECT for WAL writes". -jwb
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > No one is every going to do it, so we might as well make the best guess > > we have. I think any platform where the *data* options are slower than > > the non-*data* options is broken, and if that logic holds, we might as > > well just use *data* by default if we can, which is my proposal. > > Adjusting the default settings I don't have a problem with. Removing > options I have a problem with --- and that appeared to be what you > were proposing. Well, right now we support: * open_datasync (write WAL files with open() option O_DSYNC) * fdatasync (call fdatasync() at each commit), * fsync(call fsync() at each commit) * fsync_writethrough (force write-through of any disk write cache) * open_sync (writeWAL files with open() option O_SYNC) and we pick the first supported item as the default. I have updated our documentation to clarify this. My proposal is to remove fdatasync and open_datasync, and have have fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall back to fsync and open_sync if the *data* version are not supported. We have flexibility by having more options, but we also have complexity of having options that have never proven to be useful in the years we have had them, namely using fsync if fdatasync is supported. If we remove the *data* spellings, we can probably support both open_sync and fsync on all platforms because the *data* varieties are the ones that are not always supported. One problem is that by removing the *data* versions, you would never know if you were calling fsync or fdatasync internally. We also need to re-test these defaults because we now have O_DIRECT and groups writes of WAL. If we test using the build farm, if we test two options and alternate the tests, and one is always faster than the other, I think we can conclude that that one is faster, even if there are other loads on the system. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Marko Kreen wrote: > >> On same topic: > >> http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > >> Why does win32 PostgreSQL allow data corruption by default? > > > It behaves the same on Unix as Win32, and if you have battery-backed > > cache, you don't need writethrough, so we don't have it as default. I > > am going to write a section in the manual for 8.1 about these > > reliability issues. > > I thought we had changed the default for Windows to be fsync_writethrough > in 8.1? We didn't have that code in 8.0, but now that we do, it surely > seems like the sanest default. Well, 8.0 shipped with commit() for fsync(), which in fact is writethrough, but we decided that that wasn't a good default because: o it didn't match Unixo Oracle doesn't use that method for fsynco we would be slower than Oracle on Win32o it is a lossfor battery backed RAID so we moved commit() to fsync_writethrough, and found a way to do real fdatasync as the default on Win32 in 8.0.2. This is clearly mentioned in the release notes: * Enable the wal_sync_method setting of "open_datasync" on Windows, andmake it the default for that platform (Magnus, Bruce)Because thedefault is no longer "fsync_writethrough", data loss is possible duringa power failure if the disk drivehas write caching enabled. To turn offthe write cache on Windows, from the Device Manager, choose the driveproperties,then Policies. This was discussed on the lists extensively. One problem with writethrough is that drives that don't do writethrough by default are often the ones with the worst performance for this, namely IDE drives. Also, in FreeBSD, if you add "hw.ata.wc=0" to /boot/loader.conf, you get write-through, but for all ATA drives. Should we recommend that? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > My proposal is to remove fdatasync and open_datasync, and have have > fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall > back to fsync and open_sync if the *data* version are not supported. And this will buy us what, other than lack of flexibility? The "data" options already are the default when available, I think (if not, I have no objection to making them so). That does not equate to saying we should remove access to the other options. Your argument that they are useless only holds up in a perfect world where there are no hardware bugs and no kernel bugs ... and last I checked, we do not live in such a world. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > My proposal is to remove fdatasync and open_datasync, and have have > > fsync _prefer_ fdatasync, and open_sync prefer open_datastync, but fall > > back to fsync and open_sync if the *data* version are not supported. > > And this will buy us what, other than lack of flexibility? Clarity in testing options. > The "data" options already are the default when available, I think > (if not, I have no objection to making them so). That does not They are. > equate to saying we should remove access to the other options. > Your argument that they are useless only holds up in a perfect > world where there are no hardware bugs and no kernel bugs ... > and last I checked, we do not live in such a world. Is it useful to have the option of using non-*data* options when *data* options are available? I have never heard of anyone wanting to do that, nor do I imagine anyone doing that. Is there a real use case? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, Aug 08, 2005 at 08:04:44PM -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Marko Kreen wrote: > >> On same topic: > >> http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > >> Why does win32 PostgreSQL allow data corruption by default? > > > It behaves the same on Unix as Win32, and if you have battery-backed > > cache, you don't need writethrough, so we don't have it as default. I > > am going to write a section in the manual for 8.1 about these > > reliability issues. > > I thought we had changed the default for Windows to be fsync_writethrough > in 8.1? We didn't have that code in 8.0, but now that we do, it surely > seems like the sanest default. Seems it _was_ default in 8.0 and 8.0.1 (called fsync) but renamed to fsync_writethrough in 8.0.2 and moved away from being default. Now, 8.0.2 was released on 2005-04-07 and first destruction happened in 2005-07-20. If this says anything about future, I don't think PostgreSQL will stay known as 'reliable' database. -- marko
> > > > Currently, here are the options available for wal_sync_method: > > > > > > > > #wal_sync_method = fsync # the default > varies across platforms: > > > > # fsync, > fdatasync, fsync_writethrough, > > > > # open_sync, > open_datasync > > > > > > On same topic: > > > > > > > http://archives.postgresql.org/pgsql-general/2005-07/msg00811.php > > > > > > Why does win32 PostgreSQL allow data corruption by default? > > > > It behaves the same on Unix as Win32, and if you have > battery-backed > > cache, you don't need writethrough, so we don't have it as > default. I Correction, if you have bbwc, you *should not* have writethrough. Not only do you not need it, enabling it will drastically lower performance. > > am going to write a section in the manual for 8.1 about these > > reliability issues. > > For some reason I don't see "corruped database after crash" > reports on Unixen. Why? Because you don't read the lists often enough? I see it happen quite often. > Also, why can't win32 be safe without battery-backed cache? > I can't see such requirement on other platforms. It can, you just need to learn how to configure your system. There are two different options to make it safe on win32 without battery backed cache: 1) Use the postgresql option for fsync write through 2) Configure windows to disable write caching. If you do this, which you of course already do on all your windows servers without write cache I hope since it affects all windows operations including the filesystem itself, you are safe with the default settings in postgresql. I think what a lot of people don't realise is how easy option 2 is. It's in traditional windows style *a single checkbox* in the harddisk configuration. (Granted, you need a modern windows for that. On older windows it's a registry key) I have some code floating in my tree to issue a WARNING on startup if write cache is enabled and postgresql is not using writethrough. It's not quite ready yet, but if such a thing would be accepted post feature-freeze I can have it finished in good time before 8.1. It would be quite simple (looking at just the main data directory for example, ignoring tablespaces), but if you're dealing with complex installations you'd better have a clue about how windows works anyway... //Magnus
> > > I think we should offer the reliable option by default, > and mention > > > the fast option for those who have battery-backed cache > in the manual. > > > > But only on Win32? > > We should do what's possible with what's given to us. > > On Win32: > > 1. We can write through cache. Yes. > 2. We have unreliable OS with unreliable filesystem. That can definitly be debated. Properly maintaned on proper hardware, it's quite reliable these days. Most filesystem corruptions that happen on windows are because people enable write caching on drives without battery backup. The same issue we're facing here, it's *not* a problem in the fs, it's a problem in the admin. Sure, there are lots of things that could be better with ntfs, but I would definitly not call it unreliable. > 3. The probability of mediocre hardware is higher. I would say it's actually *lower*. If you look in the average datacenter, I bet you'll find a lot more linux boxes running on built-at-home-with-the-cheapest-parts boxes. Whereas your windows boxes will run on HP or IBM or whatever real server-grade hardware. I don't know anybody who claims to run a professional business who uses IDE drives in a Windows server, for example. I know several who run linux or freebsd on it. > Regular POSIX: > 1. We can't write through cache. > 2. We have good OS with good filesystem (probably even > journaled). NTFS is journaled, BTW. And I've seen a lot more corruption on ext2, extr3 or reiser than I'ev seen on NTFS in my datacenter - and I have about 5 times more Windows server than linux... Granted other unixen might be more stable, I don't run any of those.. > 3. The probably of mediocre hardware is lower. See above. > Why shouldn't we offer reliable option to win32? *we do offer a reliabel option*. Same as on POSIX, we don't enable it by default for *non-server hardware*. > Options: > > - Win32 guy complains that PG is bit slow. > We tell him to RTFM. What most often happens here is: Win32 guy notices PG is very slow, changes to mysql or mssql. > PS. Yeah, I was the guy who helped him to restore what's left. > I'd say he wasn't exactly happy. I bet. Has he looked over all his other windows servers that are improperly configured with regards to write cache? //Magnus
On Tue, Aug 09, 2005 at 10:02:44AM +0200, Magnus Hagander wrote: > > > It behaves the same on Unix as Win32, and if you have > > battery-backed > > > cache, you don't need writethrough, so we don't have it as > > default. I > > Correction, if you have bbwc, you *should not* have writethrough. Not > only do you not need it, enabling it will drastically lower performance. So what? User should read docs how to get good performance. > > Also, why can't win32 be safe without battery-backed cache? > > I can't see such requirement on other platforms. > > It can, you just need to learn how to configure your system. There are > two different options to make it safe on win32 without battery backed > cache: I personally do not use PostgreSQL in win32 (yet - this may change). I just felt the pain of a guy who tried... > in traditional windows style *a single checkbox* in the harddisk > configuration. > (Granted, you need a modern windows for that. On older windows it's a > registry key) I think PostgreSQL should reliable by default. Now with the Windows port there are lot of people who just try it out on regular desktop machine. With point-n-click installer there's no need to read docs and after experiencing the unreliability they won't take it as serious database. > I have some code floating in my tree to issue a WARNING on startup if > write cache is enabled and postgresql is not using writethrough. It's > not quite ready yet, but if such a thing would be accepted post > feature-freeze I can have it finished in good time before 8.1. It would > be quite simple (looking at just the main data directory for example, > ignoring tablespaces), but if you're dealing with complex installations > you'd better have a clue about how windows works anyway... Hey, thats a good idea, irrespective whether the default changes or not. I think if it's just couple of checks and then printf, it should not meet much resistance. -- marko
> > > Also, why can't win32 be safe without battery-backed cache? > > > I can't see such requirement on other platforms. > > > > It can, you just need to learn how to configure your > system. There are > > two different options to make it safe on win32 without > battery backed > > cache: > > I personally do not use PostgreSQL in win32 (yet - this may > change). I just felt the pain of a guy who tried... Didn't mean "you" as in you personally, meant "you" as in the user. Sorry. > > in traditional windows style *a single checkbox* in the harddisk > > configuration. > > (Granted, you need a modern windows for that. On older > windows it's a > > registry key) > > I think PostgreSQL should reliable by default. For that I think we need to set it to fsync() on all platforms. it's the least unsafe one on POSIX and it's the safe one on Win32. > Now with the Windows port there are lot of people who just > try it out on regular desktop machine. Sure, but if you're just trying it out, it's not going to kill you if you lose the data... > With point-n-click installer there's no need to read docs and > after experiencing the unreliability they won't take it as > serious database. Well the same reasoning applies to the fact that they won't take it as a serious database because it's too slow. Perhaps we need to provide an option in the installer to controll what goes in the initialized database. With an explanation ("don't enable this if you use IDE disks and care about your data"). > > I have some code floating in my tree to issue a WARNING on > startup if > > write cache is enabled and postgresql is not using > writethrough. It's > > not quite ready yet, but if such a thing would be accepted post > > feature-freeze I can have it finished in good time before 8.1. It > > would be quite simple (looking at just the main data directory for > > example, ignoring tablespaces), but if you're dealing with complex > > installations you'd better have a clue about how windows > works anyway... > > Hey, thats a good idea, irrespective whether the default > changes or not. > > I think if it's just couple of checks and then printf, it > should not meet much resistance. That's the general idea - I'm hoping it will be that simpel at least :-) //Magnus
On Tue, Aug 09, 2005 at 10:08:25AM +0200, Magnus Hagander wrote: > That can definitly be debated. Properly maintaned on proper hardware, > it's quite reliable these days. > Most filesystem corruptions that happen on windows are because people > enable write caching on drives without battery backup. The same issue > we're facing here, it's *not* a problem in the fs, it's a problem in the > admin. Sure, there are lots of things that could be better with ntfs, > but I would definitly not call it unreliable. People enable? Isn't it the default? > > 3. The probability of mediocre hardware is higher. > > I would say it's actually *lower*. If you look in the average > datacenter, I bet you'll find a lot more linux boxes running on > built-at-home-with-the-cheapest-parts boxes. Whereas your windows boxes > will run on HP or IBM or whatever real server-grade hardware. > > I don't know anybody who claims to run a professional business who uses > IDE drives in a Windows server, for example. I know several who run > linux or freebsd on it. The professional probably tests it on his own desktop. I don't think PostgreSQL reaches the data center before passing the run on desktop. > > Regular POSIX: > > 1. We can't write through cache. > > 2. We have good OS with good filesystem (probably even > > journaled). > > NTFS is journaled, BTW. And I've seen a lot more corruption on ext2, > extr3 or reiser than I'ev seen on NTFS in my datacenter - and I have > about 5 times more Windows server than linux... > Granted other unixen might be more stable, I don't run any of those.. > > > 3. The probably of mediocre hardware is lower. > > See above. Ok, comparing impressions is not productive. > > Why shouldn't we offer reliable option to win32? > > *we do offer a reliabel option*. > Same as on POSIX, we don't enable it by default for *non-server > hardware*. What do you mean here? AFAIK we try to be reliable on POSIX too. > > Options: > > > > - Win32 guy complains that PG is bit slow. > > We tell him to RTFM. > > What most often happens here is: > Win32 guy notices PG is very slow, changes to mysql or mssql. But lost database is no problem? -- marko
> > That can definitly be debated. Properly maintaned on proper > hardware, > > it's quite reliable these days. > > Most filesystem corruptions that happen on windows are > because people > > enable write caching on drives without battery backup. The > same issue > > we're facing here, it's *not* a problem in the fs, it's a > problem in > > the admin. Sure, there are lots of things that could be better with > > ntfs, but I would definitly not call it unreliable. > > People enable? Isn't it the default? I dunno about workstation OS, but on the server OSes it certainly isn't default. > > > 3. The probability of mediocre hardware is higher. > > > > I would say it's actually *lower*. If you look in the average > > datacenter, I bet you'll find a lot more linux boxes running on > > built-at-home-with-the-cheapest-parts boxes. Whereas your windows > > boxes will run on HP or IBM or whatever real server-grade hardware. > > > > I don't know anybody who claims to run a professional business who > > uses IDE drives in a Windows server, for example. I know > several who > > run linux or freebsd on it. > > The professional probably tests it on his own desktop. I > don't think PostgreSQL reaches the data center before passing > the run on desktop. I can't speak for others, but I would always test a server product on a server OS on server hardware. Certainly not as beefy as eventual production server, but the same level. Otherwise the test is not fully relevant. > > > Why shouldn't we offer reliable option to win32? > > > > *we do offer a reliabel option*. > > Same as on POSIX, we don't enable it by default for *non-server > > hardware*. > > What do you mean here? AFAIK we try to be reliable on POSIX too. AFAIK fsync is slightly safer than open_sync, because it also flushes the metadata. We don't default to that. > > > Options: > > > > > > - Win32 guy complains that PG is bit slow. > > > We tell him to RTFM. > > > > What most often happens here is: > > Win32 guy notices PG is very slow, changes to mysql or mssql. > > But lost database is no problem? > It certainly is. That's not what I'm arguing. What I'm saying is that you shouldn't expect server grade reliabilty on desktop hardware and desktop OS. Regardless of platform. //Magnus
On Tue, Aug 09, 2005 at 12:14:09PM +0200, Magnus Hagander wrote: > > > That can definitly be debated. Properly maintaned on proper > > hardware, > > > it's quite reliable these days. > > > Most filesystem corruptions that happen on windows are > > because people > > > enable write caching on drives without battery backup. The > > same issue > > > we're facing here, it's *not* a problem in the fs, it's a > > problem in > > > the admin. Sure, there are lots of things that could be better with > > > ntfs, but I would definitly not call it unreliable. > > > > People enable? Isn't it the default? > > I dunno about workstation OS, but on the server OSes it certainly isn't > default. At least on XP Pro it is default. > > The professional probably tests it on his own desktop. I > > don't think PostgreSQL reaches the data center before passing > > the run on desktop. > > I can't speak for others, but I would always test a server product on a > server OS on server hardware. Certainly not as beefy as eventual > production server, but the same level. Otherwise the test is not fully > relevant. You are right, but it always does not happen so. Also think of developers who run a dev-server on a desktop. > > > > Why shouldn't we offer reliable option to win32? > > > > > > *we do offer a reliabel option*. > > > Same as on POSIX, we don't enable it by default for *non-server > > > hardware*. > > > > What do you mean here? AFAIK we try to be reliable on POSIX too. > > AFAIK fsync is slightly safer than open_sync, because it also flushes > the metadata. We don't default to that. At least for WAL, the metadata does not change so it should not matter. Now thinking about it, the guy had corrupt table, not WAL log. How is WAL->tables synched? Does the 'wal_sync_method' affect it or not? Ofcourse, postgres could get corrupt data from WAL and put it into table. (AFAIK NTFS does not log data, so we are back on wal_sync_method.) > > > > Options: > > > > > > > > - Win32 guy complains that PG is bit slow. > > > > We tell him to RTFM. > > > > > > What most often happens here is: > > > Win32 guy notices PG is very slow, changes to mysql or mssql. > > > > But lost database is no problem? > > It certainly is. That's not what I'm arguing. What I'm saying is that > you shouldn't expect server grade reliabilty on desktop hardware and > desktop OS. Regardless of platform. But we should expect server-grade speed? ;) -- marko
> > I dunno about workstation OS, but on the server OSes it certainly > > isn't default. > > At least on XP Pro it is default. Yuck. > > > The professional probably tests it on his own desktop. I don't > > > think PostgreSQL reaches the data center before passing > the run on > > > desktop. > > > > I can't speak for others, but I would always test a server > product on > > a server OS on server hardware. Certainly not as beefy as eventual > > production server, but the same level. Otherwise the test > is not fully > > relevant. > > You are right, but it always does not happen so. Also think > of developers who run a dev-server on a desktop. Well, with developers losing your data really isn't all that bad. It's a lot easier to deal with than losing a server :-) > > > > > Why shouldn't we offer reliable option to win32? > > > > > > > > *we do offer a reliabel option*. > > > > Same as on POSIX, we don't enable it by default for *non-server > > > > hardware*. > > > > > > What do you mean here? AFAIK we try to be reliable on POSIX too. > > > > AFAIK fsync is slightly safer than open_sync, because it > also flushes > > the metadata. We don't default to that. > > At least for WAL, the metadata does not change so it should > not matter. In most cases, right. In some cases it does (create a new WAL log segment for example). It's not a very common scenario,but I've seen error reports saying that an entire WAL segment is missing which is probably from metadata not beingon disk at crash time. (This is one thing that's "better" with the dbs that stuff evrything in a single precreated file (for example mssql) - theonly metadata in the filesystem there is the "latest write time", which is completely irrelevant to the data) > Now thinking about it, the guy had corrupt table, not WAL log. > How is WAL->tables synched? Does the 'wal_sync_method' > affect it or not? I *think* it always fsyncs() there as it is now, but I'm not 100% sure. > Ofcourse, postgres could get corrupt data from WAL and put it > into table. (AFAIK NTFS does not log data, so we are back on > wal_sync_method.) Correct, and I beleive that's true for most Unix journaling fs:s as well - they only journal metadata. Also, once a checkpoint has occured, postgresql will discard the WAL log. If the sync came through for the checkpoint recordin the WAL file but not in the contents of the datafile, the recovery process will think that the file is ok even thoughit isn't. > > It certainly is. That's not what I'm arguing. What I'm > saying is that > > you shouldn't expect server grade reliabilty on desktop > hardware and > > desktop OS. Regardless of platform. > > But we should expect server-grade speed? ;) Touché :-) //Magnus
On Tue, Aug 09, 2005 at 12:58:31PM +0200, Magnus Hagander wrote: > > Now thinking about it, the guy had corrupt table, not WAL log. > > How is WAL->tables synched? Does the 'wal_sync_method' > > affect it or not? > > I *think* it always fsyncs() there as it is now, but I'm not 100% sure. No. If fsync is off, then no fsync is done to the data files on checkpoint either. (See mdsync() on src/backend/storage/smgr/md.c) -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) A male gynecologist is like an auto mechanic who never owned a car. (Carrie Snow)
> > > Now thinking about it, the guy had corrupt table, not WAL log. > > > How is WAL->tables synched? Does the 'wal_sync_method' > > > affect it or not? > > > > I *think* it always fsyncs() there as it is now, but I'm > not 100% sure. > > No. If fsync is off, then no fsync is done to the data files > on checkpoint either. (See mdsync() on src/backend/storage/smgr/md.c) Right, but we're not talking fsync=off, we're talking when you are using fdatasync, O_SYNC etc. If you turn off fsync you're on your own, no matter the OS or other settings... //Magnus
On Tue, Aug 09, 2005 at 04:05:28PM +0200, Magnus Hagander wrote: > > > > Now thinking about it, the guy had corrupt table, not WAL log. > > > > How is WAL->tables synched? Does the 'wal_sync_method' > > > > affect it or not? > > > > > > I *think* it always fsyncs() there as it is now, but I'm > > not 100% sure. > > > > No. If fsync is off, then no fsync is done to the data files > > on checkpoint either. (See mdsync() on src/backend/storage/smgr/md.c) > > Right, but we're not talking fsync=off, we're talking when you are using > fdatasync, O_SYNC etc. Oh, sorry :-) At that point, pg_fsync is called, which can invoke commit() or fsync() depending on whether you have writethrough enabled. pg_fsync() on storage/file/fd.c -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) FOO MANE PADME HUM
On Tue, Aug 09, 2005 at 12:25:36PM +0300, Marko Kreen wrote: > On Tue, Aug 09, 2005 at 10:08:25AM +0200, Magnus Hagander wrote: > > Most filesystem corruptions that happen on windows are because people > > enable write caching on drives without battery backup. The same issue > > we're facing here, it's *not* a problem in the fs, it's a problem in the > > admin. Sure, there are lots of things that could be better with ntfs, > > but I would definitly not call it unreliable. > People enable? Isn't it the default? I think a little too much speculation in this thread, and not enough real data... :-) I only have Windows notebooks, and pre-configured systems by the company I work for to judge. The notebooks of course have it 'on' (battery packed, and if it wasn't on, I would have enabled it myself). I won't bother to check the corporate systems, as whatever they are, they may not be the Windows system default. Who knows for real? In any case - I disagreed with the conclusions presented that suggested that Windows had a poor file system, or should be linked with poor hardware. Seems like FUD to me, and doesn't match my experiences. I agree with the other poster that Windows hardware is usually better in actual professional server environments. It might be because people feel Windows requires better hardware to be stable, or it might be that Windows applications tend to use more memory and disk space, therefore the recommended entry level system is of higher quality. It doesn't matter why people do it - or even if their reasons are valid - what does matter, is that it isn't a fair conclusion that Windows boxes will use poorer hardware. The opposite may be true, or neither may be true. > > I don't know anybody who claims to run a professional business who uses > > IDE drives in a Windows server, for example. I know several who run > > linux or freebsd on it. > The professional probably tests it on his own desktop. I don't > think PostgreSQL reaches the data center before passing the run > on desktop. I don't know why this would be relevant. The 'professional' may do some sort of local testing, but this doesn't negate the requirement for server testing, as it should be well known that the environment is sufficiently different, and therefore the expectations should be sufficiently different. The 'professional' may choose to enable write caching, because they don't care about reliability on their local system. If it crashes, they re-clone their system, and re-populate the database. In any case, this is more speculation, and not productive. > > > Options: > > > - Win32 guy complains that PG is bit slow. > > > We tell him to RTFM. > > What most often happens here is: > > Win32 guy notices PG is very slow, changes to mysql or mssql. > But lost database is no problem? Personally, my only complaint regarding either choice is the assumption that a 'WIN32' guy is stupid, and that 'WIN32' itself is deficient. As long as the default is well documented, I don't have a problem with either 'faster but less reliable on systems configured for speed over reliability at the operating system level (write caching enabled)' or 'slower, but reliable, just in case the system is configured for speed over reliability at the operating system level (write caching enabled)'. As long as it is well documented, either is fine. I'm not convinced that Linux is really that much safer anyways, and when it comes to a standard WIN32 configuration option, I assume that the WIN32 administrator is somewhat competent. You guys are too deep-routed in UNIX-land. I can't entirely blame you - but the world is bigger than UNIX. :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
Magnus Hagander wrote: > > Now thinking about it, the guy had corrupt table, not WAL log. > > How is WAL->tables synched? Does the 'wal_sync_method' > > affect it or not? > > I *think* it always fsyncs() there as it is now, but I'm not 100% sure. wal_sync_method is also used to flush pages during a checkpoint, so it could lead to table corruption too, not just WAL corruption. However, on Unix, 99% of corruption is caused by bad disk or RAM. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Magnus Hagander wrote: > > > I dunno about workstation OS, but on the server OSes it certainly > > > isn't default. > > > > At least on XP Pro it is default. > > Yuck. I see "enable write caching" as enabled by default on my XP Pro laptop, though laptops can be said to already have battery-backed disks. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Magnus Hagander wrote: > > > > Now thinking about it, the guy had corrupt table, not WAL log. > > > > How is WAL->tables synched? Does the 'wal_sync_method' > > > > affect it or not? > > > > > > I *think* it always fsyncs() there as it is now, but I'm > > not 100% sure. > > > > wal_sync_method is also used to flush pages during a > > checkpoint, so it could lead to table corruption too, not > > just WAL corruption. > > > > However, on Unix, 99% of corruption is caused by bad disk or RAM. > > ... or iDE disks with write cache enabled. I've certainly seen more than > what I'd call 1% (though I haven't studied it to be sure) that's because > of write-cached disks... Personally, I can't remember a case that was caused by something other than bad RAM or bad disk. Let me write up a section in the manual on this for 8.1, and link it to the wal_sync_method documentation section, and see how it looks. Even re-ordering the items in the docs and making bullets has made it clearer to me what is happening, and what is the default. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
> > > Now thinking about it, the guy had corrupt table, not WAL log. > > > How is WAL->tables synched? Does the 'wal_sync_method' > > > affect it or not? > > > > I *think* it always fsyncs() there as it is now, but I'm > not 100% sure. > > wal_sync_method is also used to flush pages during a > checkpoint, so it could lead to table corruption too, not > just WAL corruption. > > However, on Unix, 99% of corruption is caused by bad disk or RAM. ... or iDE disks with write cache enabled. I've certainly seen more than what I'd call 1% (though I haven't studied it to be sure) that's because of write-cached disks... //Magnus
On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote: > ... or iDE disks with write cache enabled. I've certainly seen more than > what I'd call 1% (though I haven't studied it to be sure) that's because > of write-cached disks... Every SCSI disk I've looked at recently has had write cache enabled by default, fwiw. Turning it off isn't quite the performance killer that it is on IDE, of course, but it is there. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Andrew - Supernews <andrew+nonews@supernews.com> writes: > On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote: >> ... or iDE disks with write cache enabled. I've certainly seen more than >> what I'd call 1% (though I haven't studied it to be sure) that's because >> of write-cached disks... > Every SCSI disk I've looked at recently has had write cache enabled by > default, fwiw. On SCSI, write cacheing is default because the protocol is actually designed to support it: the drive can take the data, and then take some more, without giving the impression that the write has been done. If a SCSI drive reports write complete when it hasn't actually put the bits on the platter yet, then it's simply broken. regards, tom lane
On 2005-08-09, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew - Supernews <andrew+nonews@supernews.com> writes: >> On 2005-08-09, "Magnus Hagander" <mha@sollentuna.net> wrote: >>> ... or iDE disks with write cache enabled. I've certainly seen more than >>> what I'd call 1% (though I haven't studied it to be sure) that's because >>> of write-cached disks... > >> Every SCSI disk I've looked at recently has had write cache enabled by >> default, fwiw. > > On SCSI, write cacheing is default because the protocol is actually > designed to support it: the drive can take the data, and then take some > more, without giving the impression that the write has been done. Wrong. Write caching as controlled by the WCE parameter on mode page 8 for direct-access devices does in fact report the write operation as complete before the bits are on the disk. The protocol supplies a number of additional commands to flush the cache, etc., for which you'll have to consult the specs. The reason it's not so much of a performance killer to turn it off is that tag-queueing (which is what you are referring to) provides for some optimization of concurrent requests even with the cache off. > If a SCSI drive reports write complete when it hasn't actually put the > bits on the platter yet, then it's simply broken. I guess you haven't read the spec much, then. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
Andrew - Supernews <andrew+nonews@supernews.com> writes: >> If a SCSI drive reports write complete when it hasn't actually put the >> bits on the platter yet, then it's simply broken. > I guess you haven't read the spec much, then. [ shrug... ] I have seen that spec before: I was making a living by implementing SCSI device drivers in the mid-80's. I think that anyone who uses WCE in place of tagged command queueing is not someone whose code I would care to rely on for mission-critical applications. TCQ is a design that just works; WCE is someone's attempt to emulate all the worst features of IDE. regards, tom lane
On Tue, Aug 09, 2005 at 11:01:36PM -0400, Tom Lane wrote: > Andrew - Supernews <andrew+nonews@supernews.com> writes: > >> If a SCSI drive reports write complete when it hasn't actually put the > >> bits on the platter yet, then it's simply broken. > > I guess you haven't read the spec much, then. > [ shrug... ] I have seen that spec before: I was making a living by > implementing SCSI device drivers in the mid-80's. I think that anyone > who uses WCE in place of tagged command queueing is not someone whose > code I would care to rely on for mission-critical applications. TCQ > is a design that just works; WCE is someone's attempt to emulate all > the worst features of IDE. They're relying on you, not you on them. Is their reliance founded upon reasonable logic, or are they unreasonably putting the fault in your court? Depends on the issue... Many people would not like to need to know these 'under the hood' type issues. This doesn't mean they deserve to have their databases corrupted to teach them the hard way why these 'under the hood' type details are useful to know... :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On 8/9/05, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote: > Personally, my only complaint regarding either choice is the > assumption that a 'WIN32' guy is stupid, and that 'WIN32' itself is > deficient. As long as the default is well documented, I don't have a > problem with either 'faster but less reliable on systems configured > for speed over reliability at the operating system level (write > caching enabled)' or 'slower, but reliable, just in case the system is > configured for speed over reliability at the operating system level > (write caching enabled)'. As long as it is well documented, either is > fine. I'm not convinced that Linux is really that much safer anyways, > and when it comes to a standard WIN32 configuration option, I assume > that the WIN32 administrator is somewhat competent. Hello guys, There seem to be arguments for both possible default configurations "faster but less reliable" and "slower but reliable". I personally think that the safer configuration is better. Anyway, i have an idea : What do you think about letting the person who installs PostgreSQL on Win32 decide? For Windows, we have the graphical installer that can be improved so that the user is asked to choose between the two possible configurations. This way the user will be aware of this choice even if he/she does not read the docs. If we let this choice be made at installation time, it would be less important which is the default value because i think that the users who install PostgreSQL from sources on Win32 are fewer. And we can expect that, after bothering to install mingw and compile PostgreSQL, they will also bother to configure it according to their needs. Cheers, Adrian Maier
I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein it was apparently demonstrated that fsync was the fastest option among the 7.4.x wal_sync_method options. If there's a way to make this information more useful by providing more data, please let me know, and I'll see what I can do. -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Strategic Open Source: Open Your i™ http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax) On Aug 8, 2005, at 4:44 PM, Bruce Momjian wrote: > In summary, we added all those wal_sync_method values in hopes of > getting some data on which is best on which platform, but having gone > several years with few reports, I am thinking we should just choose > the > best ones we can and move on, rather than expose a confusing API to > the > users. > > Does anyone show a platform where the *data* options are slower > than the > non-*data* ones?
On 2005-08-10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew - Supernews <andrew+nonews@supernews.com> writes: >>> If a SCSI drive reports write complete when it hasn't actually put the >>> bits on the platter yet, then it's simply broken. > >> I guess you haven't read the spec much, then. > > [ shrug... ] I have seen that spec before: I was making a living by > implementing SCSI device drivers in the mid-80's. I think that anyone > who uses WCE in place of tagged command queueing is not someone whose > code I would care to rely on for mission-critical applications. TCQ > is a design that just works; WCE is someone's attempt to emulate all > the worst features of IDE. 1) Tag queueing and WCE are orthogonal concepts. It's not a question of using one "in place of" the other. My comment was that my recent observation of actual SCSI drives is that WCE is enabled by default and as such _will_ be used unless either you disable it manually, or the host OS does so. 2) What OSes in common use adapt to the WCE setting, either by turning it off, or using FUA or issuing SYNCHRONIZE CACHE commands? Since it is entirely transparent to the host OS, I do not believe any are, though it looks like very recent Linux development is moving in this direction. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services
On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote: > I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein > it was apparently demonstrated that fsync was the fastest option > among the 7.4.x wal_sync_method options. > > If there's a way to make this information more useful by providing > more data, please let me know, and I'll see what I can do. What would be really interesting to me to know is what Sun did between 8 and 9 to make that so. We don't use Solaris for databases any more, but fsync was a lot slower than whatever we ended up using on 8. I wouldn't be surprised if they'd wired fsync directly to something else; but I can hardly believe it'd be faster than any other option. (Mind, we were using Veritas filesyste with this, as well, which was at least half the headache.) A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
UFS was the filesystem on the Solaris 9 box. -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Strategic Open Source: Open Your i™ http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-469-5150 615-469-5151 (fax) On Aug 11, 2005, at 4:18 PM, Andrew Sullivan wrote: > On Wed, Aug 10, 2005 at 02:11:48AM -0500, Thomas F. O'Connell wrote: > >> I was recently witness to a benchmark of 7.4.5 on Solaris 9 wherein >> it was apparently demonstrated that fsync was the fastest option >> among the 7.4.x wal_sync_method options. >> >> If there's a way to make this information more useful by providing >> more data, please let me know, and I'll see what I can do. >> > > What would be really interesting to me to know is what Sun did > between 8 and 9 to make that so. We don't use Solaris for databases > any more, but fsync was a lot slower than whatever we ended up using > on 8. I wouldn't be surprised if they'd wired fsync directly to > something else; but I can hardly believe it'd be faster than any > other option. (Mind, we were using Veritas filesyste with this, as > well, which was at least half the headache.) > > A
On Mon, Aug 08, 2005 at 07:45:38PM -0400, Andrew Dunstan wrote: > So the short answer is possibly "You build the tests and we'll run 'em." Would some version of dbt2/3 work for this? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
On Sun, 21 Aug 2005 19:27:35 -0500 "Jim C. Nasby" <jnasby@pervasive.com> wrote: > On Mon, Aug 08, 2005 at 07:45:38PM -0400, Andrew Dunstan wrote: > > So the short answer is possibly "You build the tests and we'll run 'em." > > Would some version of dbt2/3 work for this? Yeah, trying... On the larger system I'm using I'm not seeing much of a performance difference but I'm looking for a way to see if we can identify any benefit to bypassing the kernel cache. I've been re-arranging disks due to failures and trying to tweak a couple of profiling things, but I'll try to get some data to share within a few days. Mark