Thread: The dangers of "-F"

The dangers of "-F"

From
Don Baccus
Date:
I've mentioned in the past that the fsynch following
every select, even when no data is modified, is a 
killer for high-volume web sites that make many short,
read-only hits on the database (for page customization,
for example).

I know that fixing this is on the "to do" list.  I've
known of the "-F" switch for some time, but the recent
round of posts triggered by someone observing lots of
disk thrashing and the fact that I'm getting close to
going online with my first round of web services based
on Postgres motivated me to give it a try.

It's very, very nice to have the disk silent when 
hitting it with a bunch of simultaneous "selects"
from different http connections.  It really increases
throughput, and is much, much kinder to the disk.
The difference for lots of short hits is very high.

So obviously I'm really looking forward to the day
when a read-only select doesn't trigger a write to
pg_log (which apparently is the problem?) and an
"fsynch the world" operation.

In the interim, just how dangerous is it to run with
"-F"? 

Am I risking corruption of the db and a total rebuild,
or will I just lose transactions but be left with a
consistent database if the machine goes down?




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at
http://donb.photo.net


Re: [HACKERS] The dangers of "-F"

From
Bruce Momjian
Date:
> So obviously I'm really looking forward to the day
> when a read-only select doesn't trigger a write to
> pg_log (which apparently is the problem?) and an
> "fsynch the world" operation.
> 
> In the interim, just how dangerous is it to run with
> "-F"? 
> 
> Am I risking corruption of the db and a total rebuild,
> or will I just lose transactions but be left with a
> consistent database if the machine goes down?

No Fsync is only dangerous if your OS or hardware crashes without
flushing the disk. Anything else is unaffected, and is just as reliable.

The database could be inconsistent, in the sense that partial
transactions are recorded as completed.

I think it is a major issue too.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] The dangers of "-F"

From
Don Baccus
Date:
At 06:38 PM 6/22/99 -0400, Bruce Momjian wrote:

>No Fsync is only dangerous if your OS or hardware crashes without
>flushing the disk. Anything else is unaffected, and is just as reliable.

Yes, this much I realize...

>The database could be inconsistent, in the sense that partial
>transactions are recorded as completed.

With recovery possible without a rebuild?  Or is rebuilding
from dumps required?  (I dump nightly and copy the results
to a second machine for additional safety, and soon will
be ftp'ing dump files to the east coast for even more
safety).  

Perhaps fsync'ing then is only LESS dangerous, since
a system can crash while blocks are being written even
when fsync is enabled.  The window of evil opportunity
for a system crash is much smaller than if the data's sitting
around for a lengthy time in the Linux FS cache, of course,
but not absent.

Or does the fact that the backend loses control over the
order in which stuff is written (in other words, blocks
are written whenever and in what order Linux choses rather
than fsync'd a file at a time) mean that the kind of 
inconsistency that might result is different?  I.E.
log file written before datablocks are, that kind of
thing.

>I think it is a major issue too.

Is there any estimate of the difficulty of fixing it?
>From previous discussions, it sounded as though new
bookkeeping would be needed to determine which queries
actually result in a change in data.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at
http://donb.photo.net


Re: [HACKERS] The dangers of "-F"

From
Bruce Momjian
Date:
> At 06:38 PM 6/22/99 -0400, Bruce Momjian wrote:
> 
> >No Fsync is only dangerous if your OS or hardware crashes without
> >flushing the disk. Anything else is unaffected, and is just as reliable.
> 
> Yes, this much I realize...
> 
> >The database could be inconsistent, in the sense that partial
> >transactions are recorded as completed.
> 
> With recovery possible without a rebuild?  Or is rebuilding
> from dumps required?  (I dump nightly and copy the results
> to a second machine for additional safety, and soon will
> be ftp'ing dump files to the east coast for even more
> safety).  


> 
> Perhaps fsync'ing then is only LESS dangerous, since
> a system can crash while blocks are being written even
> when fsync is enabled.  The window of evil opportunity
> for a system crash is much smaller than if the data's sitting
> around for a lengthy time in the Linux FS cache, of course,
> but not absent.

Yes, this is true, but much less likely because the ordering of the
flushing is done before the transaction is marked as completed.

> 
> Or does the fact that the backend loses control over the
> order in which stuff is written (in other words, blocks
> are written whenever and in what order Linux choses rather
> than fsync'd a file at a time) mean that the kind of 
> inconsistency that might result is different?  I.E.
> log file written before datablocks are, that kind of
> thing.

Yes.  It is not a problem that a give transaction aborts while it is
being done because it couldn't have been marked as completed, but the
previous transaction was marked as completed, and only some blocks could
be on the disk.


> 
> >I think it is a major issue too.
> 
> Is there any estimate of the difficulty of fixing it?
> >From previous discussions, it sounded as though new
> bookkeeping would be needed to determine which queries
> actually result in a change in data.

I hope for every release.  I tried to propose some solutions, but
couldn't code it.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] The dangers of "-F"

From
Philip Warner
Date:
Is there any chance each database could be setup differently? Some of my databases are updated once a month
(literally),while others are updated daily. It would be nice to have the -F setting on the read-mostly DBs...
 

----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: +61-03-5367 7422            |                 _________  \
Fax: +61-03-5367 7430            |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: [HACKERS] The dangers of "-F"

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> Is there any chance each database could be setup differently? Some of
> my databases are updated once a month (literally), while others are
> updated daily. It would be nice to have the -F setting on the
> read-mostly DBs...

I don't think this is practical, because all the backends in a given
installation will be sharing the same buffer cache and the same pg_log
file; you can't run some with -F and some without and expect to get
the behavior you want.  Problem is that any of the backends might be
the one that writes out a particular disk block from cache.

You could run the two sets of databases as different installations
(ie, two postmasters, two listen ports, two working directories)
but that'd require all your clients knowing which port to connect to
for each database; probably not worth the trouble.

In practice, if you have a reliable OS, reliable hardware, and a
reliable power supply (read UPS), I think the risks introduced by
running with -F are negligible compared to other sources of trouble
(ie backend bugs)...
        regards, tom lane


Re: [HACKERS] The dangers of "-F"

From
Don Baccus
Date:
At 08:36 PM 6/22/99 -0400, Bruce Momjian wrote:

>> 
>> Or does the fact that the backend loses control over the
>> order in which stuff is written (in other words, blocks
>> are written whenever and in what order Linux choses rather
>> than fsync'd a file at a time) mean that the kind of 
>> inconsistency that might result is different?  I.E.
>> log file written before datablocks are, that kind of
>> thing.

>Yes.  It is not a problem that a give transaction aborts while it is
>being done because it couldn't have been marked as completed, but the
>previous transaction was marked as completed, and only some blocks could
>be on the disk.

OK, this was what I suspected, and of course is the intuitively
obvious scenario.

In other words, "-F" considered - and proven! - harmful :)

>I hope for every release.  I tried to propose some solutions, but
>couldn't code it.

There was a bit of discussion about the cause of the problem
in this list earlier, so part of my re-raising it was an attempt
to encourage more discussion.  Not that I know enough about the
code to be of any help, I'm afraid.  When I first learned of
this problem (via my own experimentation) I dug around a bit
and it became clear that it wasn't obvious.  I.E. the disk
cache knows about dirty/not dirty buffers and takes great
care to only flush dirty ones, that level of stuff.  When I
heard that updating pg_log was apparently involved I realized
it was more of a higher-level than lower-level problem.

Sigh...

Or am I wrong?



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at
http://donb.photo.net


Re: [HACKERS] The dangers of "-F"

From
Don Baccus
Date:
At 08:36 PM 6/22/99 -0400, Bruce Momjian wrote:
>> At 06:38 PM 6/22/99 -0400, Bruce Momjian wrote:
>> 
>> >No Fsync is only dangerous if your OS or hardware crashes without
>> >flushing the disk. Anything else is unaffected, and is just as reliable.
>> 
>> Yes, this much I realize...
>> 
>> >The database could be inconsistent, in the sense that partial
>> >transactions are recorded as completed.
>> 
>> With recovery possible without a rebuild?  Or is rebuilding
>> from dumps required?  (I dump nightly and copy the results
>> to a second machine for additional safety, and soon will
>> be ftp'ing dump files to the east coast for even more
>> safety).  
>
>
>> 
>> Perhaps fsync'ing then is only LESS dangerous, since
>> a system can crash while blocks are being written even
>> when fsync is enabled.  The window of evil opportunity
>> for a system crash is much smaller than if the data's sitting
>> around for a lengthy time in the Linux FS cache, of course,
>> but not absent.
>
>Yes, this is true, but much less likely because the ordering of the
>flushing is done before the transaction is marked as completed.
>
>> 
>> Or does the fact that the backend loses control over the
>> order in which stuff is written (in other words, blocks
>> are written whenever and in what order Linux choses rather
>> than fsync'd a file at a time) mean that the kind of 
>> inconsistency that might result is different?  I.E.
>> log file written before datablocks are, that kind of
>> thing.
>
>Yes.  It is not a problem that a give transaction aborts while it is
>being done because it couldn't have been marked as completed, but the
>previous transaction was marked as completed, and only some blocks could
>be on the disk.
>
>
>> 
>> >I think it is a major issue too.
>> 
>> Is there any estimate of the difficulty of fixing it?
>> >From previous discussions, it sounded as though new
>> bookkeeping would be needed to determine which queries
>> actually result in a change in data.
>
>I hope for every release.  I tried to propose some solutions, but
>couldn't code it.
>
>-- 
>  Bruce Momjian                        |  http://www.op.net/~candle
>  maillist@candle.pha.pa.us            |  (610) 853-3000
>  +  If your life is a hard drive,     |  830 Blythe Avenue
>  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
>
>


- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at
http://donb.photo.net


Re: [HACKERS] The dangers of "-F"

From
Bruce Momjian
Date:
> >Yes.  It is not a problem that a give transaction aborts while it is
> >being done because it couldn't have been marked as completed, but the
> >previous transaction was marked as completed, and only some blocks could
> >be on the disk.
> 
> OK, this was what I suspected, and of course is the intuitively
> obvious scenario.
> 
> In other words, "-F" considered - and proven! - harmful :)
> 
> >I hope for every release.  I tried to propose some solutions, but
> >couldn't code it.
> 
> There was a bit of discussion about the cause of the problem
> in this list earlier, so part of my re-raising it was an attempt
> to encourage more discussion.  Not that I know enough about the
> code to be of any help, I'm afraid.  When I first learned of
> this problem (via my own experimentation) I dug around a bit
> and it became clear that it wasn't obvious.  I.E. the disk
> cache knows about dirty/not dirty buffers and takes great
> care to only flush dirty ones, that level of stuff.  When I
> heard that updating pg_log was apparently involved I realized
> it was more of a higher-level than lower-level problem.
> 
> Sigh...
> 
> Or am I wrong?

Writing the buffers to a file, and making sure they are on the disk are
different issues.  Also, fsync only comes into play in an OS crash, so
if that only happens once a year, and you are willing to restore from
tape in that case (or check the integrity of the data on reboot), -F
may be fine.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] The dangers of "-F"

From
Bruce Momjian
Date:
> Is there any chance each database could be setup differently?
> Some of my databases are updated once a month (literally), while
> others are updated daily. It would be nice to have the -F setting
> on the read-mostly DBs...

Not sure.  pg_log is shared by all databases, so it would be hard.

-- Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] The dangers of "-F"

From
Don Baccus
Date:
At 11:40 AM 6/23/99 -0400, Bruce Momjian wrote:

>Writing the buffers to a file, and making sure they are on the disk are
>different issues.  Also, fsync only comes into play in an OS crash, so
>if that only happens once a year, and you are willing to restore from
>tape in that case (or check the integrity of the data on reboot), -F
>may be fine.

Ironically, I ran all day yesterday with -F and my nightly
dump failed on table "foo", "couldn't read block 0".

I've seen this once before without use of -F so I think it's
mere coincidence.

I realize that writing buffers to a file and making sure they're
on disk are two different issues.  My point is that without the
fsynch, the backend loses control over the order in which blocks
are written to the disk.

For instance, if there are assumptions that all data blocks are
written before this fact is recorded in a log file, then
"write data blocks" "fsynch" "write log" "fsynch" doesn't break
that assumption, where "write data blocks" (no fsynch) "write log"
might, as the operating system's free to write the "write log"
blocks to disk before any of the data blocks are (though an
LRU algorithm most likely wouldn't).  You could end up in a
case where the log records a successful write of data, without
any data actually being on disk.

I don't know how postgres works internally.  So my question is
really "are any such assumptions broken by the use of -F, and
does breaking such assumptions lead to a more serious form
of failure if there's a crash?"

I agree that the risks of running -F are low with reliable
hardware and a UPS.  I'm just trying to get a handle on just
what a user might be facing in terms of corruption compared
to a crash with fsynch'ing enabled.  I can live with "the
database might well become corrupted and you'll have to
reload your latest dump".

My current plan is to implement a set of queries that do
fairly detailed consistency checks on my database every
night, before doing the nightly dump and copy to a second
machine, as well as each time I restart the web server
(typically only after crashes).  In this way I'll know
quickly if any harm's been done after a crash, I'll have
some assurance the database is in good shape before dumps
(my code, not just the backend, might have bugs!), etc.








- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at
http://donb.photo.net


Re: [HACKERS] The dangers of "-F"

From
Bruce Momjian
Date:
> I realize that writing buffers to a file and making sure they're
> on disk are two different issues.  My point is that without the
> fsynch, the backend loses control over the order in which blocks
> are written to the disk.

Yes, that is the problem.  One solution is to fync all modified file
descriptors every ~30 seconds, then write and fsync pg_log(), so you
only do an fsync every 30 seconds.  Of course you have to make sure
pg_log doesn't get put on disk until after all the file descriptors are
fsync'ed.  Of couse, you have a 30-second window of loss, but most file
systems do this every 30-seconds, so it is no less reliable than that. 
(Well, most OS's sync on file close, so you could say the file system is
has less loss.)  Anyway, this is how most commercial db's do it.  (One
easy way to do it would be do issue a "sync" every 30 seconds to flush
the whole OS, but that seems a little extreme.)


> For instance, if there are assumptions that all data blocks are
> written before this fact is recorded in a log file, then
> "write data blocks" "fsynch" "write log" "fsynch" doesn't break
> that assumption, where "write data blocks" (no fsynch) "write log"
> might, as the operating system's free to write the "write log"
> blocks to disk before any of the data blocks are (though an
> LRU algorithm most likely wouldn't).  You could end up in a
> case where the log records a successful write of data, without
> any data actually being on disk.
> 
> I don't know how postgres works internally.  So my question is
> really "are any such assumptions broken by the use of -F, and
> does breaking such assumptions lead to a more serious form
> of failure if there's a crash?"

It is possible in an OS crash because we don't have any info about what
order stuff is written to disk with -F.

> I agree that the risks of running -F are low with reliable
> hardware and a UPS.  I'm just trying to get a handle on just
> what a user might be facing in terms of corruption compared
> to a crash with fsynch'ing enabled.  I can live with "the
> database might well become corrupted and you'll have to
> reload your latest dump".
> 
> My current plan is to implement a set of queries that do
> fairly detailed consistency checks on my database every
> night, before doing the nightly dump and copy to a second
> machine, as well as each time I restart the web server
> (typically only after crashes).  In this way I'll know
> quickly if any harm's been done after a crash, I'll have
> some assurance the database is in good shape before dumps
> (my code, not just the backend, might have bugs!), etc.
> 

Sounds like a good plan.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] The dangers of "-F"

From
Don Baccus
Date:
At 02:29 PM 6/23/99 -0400, Bruce Momjian wrote:

>> I don't know how postgres works internally.  So my question is
>> really "are any such assumptions broken by the use of -F, and
>> does breaking such assumptions lead to a more serious form
>> of failure if there's a crash?"

>It is possible in an OS crash because we don't have any info about what
>order stuff is written to disk with -F.

OK.  This answers my question, thanks.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, and other goodies at
http://donb.photo.net