Thread: Mount options for Ext3?

Mount options for Ext3?

From
Josh Berkus
Date:
Folks,

What mount options to people use for Ext3, particularly what do you set "data
= " for a high-transaction database?  I'm used to ReiserFS ("noatime,
notail") and am not really sure where to go with Ext3.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Mount options for Ext3?

From
Kevin Brown
Date:
Josh Berkus wrote:
> Folks,
>
> What mount options to people use for Ext3, particularly what do you set "data
> = " for a high-transaction database?  I'm used to ReiserFS ("noatime,
> notail") and am not really sure where to go with Ext3.

For ReiserFS, I can certainly understand using "noatime", but I'm not
sure why you use "notail" except to allow LILO to operate properly on
it.

The default for ext3 is to do ordered writes: data is written before
the associated metadata transaction commits, but the data itself isn't
journalled.  But because PostgreSQL synchronously writes the
transaction log (using fsync() by default, if I'm not mistaken) and
uses sync() during a savepoint, I would think that ordered writes at
the filesystem level would probably buy you very little in the way of
additional data integrity in the event of a crash.

So if I'm right about that, then you might consider using the
"data=writeback" option for the filesystem that contains the actual
data (usually /usr/local/pgsql/data), but I'd use the default
("data=ordered") at the very least (I suppose there's no harm in using
"data=journal" if you're willing to put up with the performance hit,
but it's not clear to me what benefit, if any, there is) for
everything else.


I use ReiserFS also, so I'm basing the above on what knowledge I have
of the ext3 filesystem and the way PostgreSQL writes data.


The more interesting question in my mind is: if you use PostgreSQL on
an ext3 filesystem with "data=ordered" or "data=journal", can you get
away with turning off PostgreSQL's fsync altogether and still get the
same kind of data integrity that you'd get with fsync enabled?  If the
operating system is able to guarantee data integrity, is it still
necessary to worry about it at the database level?

I suspect the answer to that is that you can safely turn off fsync
only if the operating system will guarantee that write transactions
from a process are actually committed in the order they arrive from
that process.  Otherwise you'd have to worry about write transactions
to the transaction log committing before the writes to the data files
during a savepoint, which would leave the overall database in an
inconsistent state if the system were to crash after the transaction
log write (which marks the savepoint as completed) committed but
before the data file writes committed.  And my suspicion is that the
operating system rarely makes any such guarantee, journalled
filesystem or not.



--
Kevin Brown                          kevin@sysexperts.com

Re: Mount options for Ext3?

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> I suspect the answer to that is that you can safely turn off fsync
> only if the operating system will guarantee that write transactions
> from a process are actually committed in the order they arrive from
> that process.

Yeah.  We use fsync partly so that when we tell a client a transaction
is committed, it really is committed (ie, down to disk) --- but also
as a means of controlling write order.  I strongly doubt that any modern
filesystem will promise to execute writes exactly in the order issued,
unless prodded by means such as fsync.

> Otherwise you'd have to worry about write transactions
> to the transaction log committing before the writes to the data files
> during a savepoint,

Actually, the other way around is the problem.  The WAL algorithm works
so long as log writes hit disk before the data-file changes they
describe (that's why it's called write *ahead* log).

            regards, tom lane

Re: Mount options for Ext3?

From
Kevin Brown
Date:
Tom Lane wrote:
> > Otherwise you'd have to worry about write transactions
> > to the transaction log committing before the writes to the data files
> > during a savepoint,
>
> Actually, the other way around is the problem.  The WAL algorithm works
> so long as log writes hit disk before the data-file changes they
> describe (that's why it's called write *ahead* log).

Hmm...a case where the transaction data gets written to the files
before the transaction itself even manages to get written to the log?
True.  But I was thinking about the following:

I was presuming that when a savepoint occurs, a marker is written to
the log indicating which transactions had been committed to the data
files, and that this marker was paid attention to during database
startup.

So suppose the marker makes it to the log but not all of the data the
marker refers to makes it to the data files.  Then the system crashes.

When the database starts back up, the savepoint marker in the
transaction log shows that the transactions had already been committed
to disk.  But because the OS wrote the requested data (including the
savepoint marker) out of order, the savepoint marker made it to the
disk before some of the data made it to the data files.  And so, the
database is in an inconsistent state and it has no way to know about
it.

But then, I guess the easy way around the above problem is to always
commit all the transactions in the log to disk when the database comes
up, which renders the savepoint marker moot...and leads back to the
scenario you were referring to...

If the savepoint only commits the older transactions in the log (and
not all of them) to disk, the possibility of the situation you're
referring would, I'd think, be reduced (possibly quite considerably).



...or is my understanding of how all this works completely off?




--
Kevin Brown                          kevin@sysexperts.com

Re: Mount options for Ext3?

From
Josh Berkus
Date:
Kevin,

> So if I'm right about that, then you might consider using the
> "data=writeback" option for the filesystem that contains the actual
> data (usually /usr/local/pgsql/data), but I'd use the default
> ("data=ordered") at the very least (I suppose there's no harm in using
> "data=journal" if you're willing to put up with the performance hit,
> but it's not clear to me what benefit, if any, there is) for
> everything else.

Well, the only reason I use Ext3 rather than Ext2 is to prevent fsck's on
restart after a crash.    So I'm interested in the data option that gives the
minimum performance hit, even if it means that I sacrifice some reliability.
I'm running with fsynch on, and the DB is on a mirrored drive array, so I'm
not too worried about filesystem-level errors.

So would that be "data=writeback"?

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Mount options for Ext3?

From
Kevin Brown
Date:
Josh Berkus wrote:
> Well, the only reason I use Ext3 rather than Ext2 is to prevent fsck's on
> restart after a crash.    So I'm interested in the data option that gives the
> minimum performance hit, even if it means that I sacrifice some reliability.
> I'm running with fsynch on, and the DB is on a mirrored drive array, so I'm
> not too worried about filesystem-level errors.
>
> So would that be "data=writeback"?

Yes.  That should give almost the same semantics as ext2 does by
default, except that metadata is journalled, so no fsck needed.  :-)

In fact, I believe that's exactly how ReiserFS works, if I'm not
mistaken (I saw someone claim that it does data journalling, but I've
never seen any references to how to get ReiserFS to journal data).


BTW, why exactly are you running ext3?  It has some nice journalling
features but it sounds like you don't want to use them.  But at the
same time, it uses pre-allocated inodes just like ext2 does, so it's
possible to run out of inodes on ext2/3 while AFAIK that's not
possible under ReiserFS.  That's not likely to be a problem unless
you're running a news server or something, though.  :-)

On the other hand, ext3 with data=writeback will probably be faster
than ReiserFS for a number of things.

No idea how stable ext3 is versus ReiserFS...



--
Kevin Brown                          kevin@sysexperts.com

Re: Mount options for Ext3?

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> I was presuming that when a savepoint occurs, a marker is written to
> the log indicating which transactions had been committed to the data
> files, and that this marker was paid attention to during database
> startup.

Not quite.  The marker says that all datafile updates described by
log entries before point X have been flushed to disk by the checkpoint
--- and, therefore, if we need to restart we need only replay log
entries occurring after the last checkpoint's point X.

This has nothing directly to do with which transactions are committed
or not committed.  If we based checkpoint behavior on that, we'd need
to maintain an indefinitely large amount of WAL log to cope with
long-running transactions.

The actual checkpoint algorithm is

    take note of current logical end of WAL (this will be point X)
    write() all dirty buffers in shared buffer arena
    sync() to ensure that above writes, as well as previous ones,
        are on disk
    put checkpoint record referencing point X into WAL; write and
        fsync WAL
    update pg_control with new checkpoint record, fsync it

Since pg_control is what's examined after restart, the checkpoint is
effectively committed when the pg_control write hits disk.  At any
instant before that, a crash would result in replaying from the
prior checkpoint's point X.  The algorithm is correct if and only if
the pg_control write hits disk after all the other writes mentioned.

The key assumption we are making about the filesystem's behavior is that
writes scheduled by the sync() will occur before the pg_control write
that's issued after it.  People have occasionally faulted this algorithm
by quoting the sync() man page, which saith (in the Gospel According To
HP)

     The writing, although scheduled, is not necessarily complete upon
     return from sync.

This, however, is not a problem in itself.  What we need to know is
whether the filesystem will allow writes issued after the sync() to
complete before those "scheduled" by the sync().


> So suppose the marker makes it to the log but not all of the data the
> marker refers to makes it to the data files.  Then the system crashes.

I think that this analysis is not relevant to what we're doing.

            regards, tom lane

Re: Mount options for Ext3?

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > I was presuming that when a savepoint occurs, a marker is written to
> > the log indicating which transactions had been committed to the data
> > files, and that this marker was paid attention to during database
> > startup.
>
> Not quite.  The marker says that all datafile updates described by
> log entries before point X have been flushed to disk by the checkpoint
> --- and, therefore, if we need to restart we need only replay log
> entries occurring after the last checkpoint's point X.
>
> This has nothing directly to do with which transactions are committed
> or not committed.  If we based checkpoint behavior on that, we'd need
> to maintain an indefinitely large amount of WAL log to cope with
> long-running transactions.

Ah.  My apologies for my imprecise wording.  I should have said
"...indicating which transactions had been written to the data files"
instead of "...had been committed to the data files", and meant to say
"checkpoint" but instead said "savepoint".  I'll try to do better
here.

> The actual checkpoint algorithm is
>
>     take note of current logical end of WAL (this will be point X)
>     write() all dirty buffers in shared buffer arena
>     sync() to ensure that above writes, as well as previous ones,
>         are on disk
>     put checkpoint record referencing point X into WAL; write and
>         fsync WAL
>     update pg_control with new checkpoint record, fsync it
>
> Since pg_control is what's examined after restart, the checkpoint is
> effectively committed when the pg_control write hits disk.  At any
> instant before that, a crash would result in replaying from the
> prior checkpoint's point X.  The algorithm is correct if and only if
> the pg_control write hits disk after all the other writes mentioned.

[...]

> > So suppose the marker makes it to the log but not all of the data the
> > marker refers to makes it to the data files.  Then the system crashes.
>
> I think that this analysis is not relevant to what we're doing.

Agreed.  The context of that analysis is when synchronous writes by
the database are turned off and one is left to rely on the operating
system to do the right thing.  Clearly it doesn't apply when
synchronous writes are enabled.  As long as only one process handles a
checkpoint, an operating system that guarantees that a process' writes
are committed to disk in the same order that they were requested,
combined with a journalling filesystem that at least wrote all data
prior to committing the associated metadata transactions, would be
sufficient to guarantee the integrity of the database even if all
synchronous writes by the database were turned off.  This would hold
even if the operating system reordered writes from multiple processes.
It suggests an operating system feature that could be considered
highly desirable (and relates to the discussion elsewhere about
trading off shared buffers against OS file cache: it's often better to
rely on the abilities of the OS rather than roll your own mechanism).

One question I have is: in the event of a crash, why not simply replay
all the transactions found in the WAL?  Is the startup time of the
database that badly affected if pg_control is ignored?

If there exists somewhere a reasonably succinct description of the
reasoning behind the current transaction management scheme (including
an analysis of the pros and cons), I'd love to read it and quit
bugging you.  :-)


--
Kevin Brown                          kevin@sysexperts.com

WAL replay logic (was Re: Mount options for Ext3?)

From
Tom Lane
Date:
Kevin Brown <kevin@sysexperts.com> writes:
> One question I have is: in the event of a crash, why not simply replay
> all the transactions found in the WAL?  Is the startup time of the
> database that badly affected if pg_control is ignored?

Interesting thought, indeed.  Since we truncate the WAL after each
checkpoint, seems like this approach would no more than double the time
for restart.  The win is it'd eliminate pg_control as a single point of
failure.  It's always bothered me that we have to update pg_control on
every checkpoint --- it should be a write-pretty-darn-seldom file,
considering how critical it is.

I think we'd have to make some changes in the code for deleting old
WAL segments --- right now it's not careful to delete them in order.
But surely that can be coped with.

OTOH, this might just move the locus for fatal failures out of
pg_control and into the OS' algorithms for writing directory updates.
We would have no cross-check that the set of WAL file names visible in
pg_xlog is sensible or aligned with the true state of the datafile area.
We'd have to take it on faith that we should replay the visible files
in their name order.  This might mean we'd have to abandon the current
hack of recycling xlog segments by renaming them --- which would be a
nontrivial performance hit.

Comments anyone?

> If there exists somewhere a reasonably succinct description of the
> reasoning behind the current transaction management scheme (including
> an analysis of the pros and cons), I'd love to read it and quit
> bugging you.  :-)

Not that I know of.  Would you care to prepare such a writeup?  There
is a lot of material in the source-code comments, but no coherent
presentation.

            regards, tom lane

Re: WAL replay logic (was Re: Mount options for Ext3?)

From
Kevin Brown
Date:
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > One question I have is: in the event of a crash, why not simply replay
> > all the transactions found in the WAL?  Is the startup time of the
> > database that badly affected if pg_control is ignored?
>
> Interesting thought, indeed.  Since we truncate the WAL after each
> checkpoint, seems like this approach would no more than double the time
> for restart.

Hmm...truncating the WAL after each checkpoint minimizes the amount of
disk space eaten by the WAL, but on the other hand keeping older
segments around buys you some safety in the event that things get
really hosed.  But your later comments make it sound like the older
WAL segments are kept around anyway, just rotated.

> The win is it'd eliminate pg_control as a single point of
> failure.  It's always bothered me that we have to update pg_control on
> every checkpoint --- it should be a write-pretty-darn-seldom file,
> considering how critical it is.
>
> I think we'd have to make some changes in the code for deleting old
> WAL segments --- right now it's not careful to delete them in order.
> But surely that can be coped with.

Even that might not be necessary.  See below.

> OTOH, this might just move the locus for fatal failures out of
> pg_control and into the OS' algorithms for writing directory updates.
> We would have no cross-check that the set of WAL file names visible in
> pg_xlog is sensible or aligned with the true state of the datafile
> area.

Well, what we somehow need to guarantee is that there is always WAL
data that is older than the newest consistent data in the datafile
area, right?  Meaning that if the datafile area gets scribbled on in
an inconsistent manner, you always have WAL data to fill in the gaps.

Right now we do that by using fsync() and sync().  But I think it
would be highly desirable to be able to more or less guarantee
database consistency even if fsync were turned off.  The price for
that might be too high, though.

> We'd have to take it on faith that we should replay the visible files
> in their name order.  This might mean we'd have to abandon the current
> hack of recycling xlog segments by renaming them --- which would be a
> nontrivial performance hit.

It's probably a bad idea for the replay to be based on the filenames.
Instead, it should probably be based strictly on the contents of the
xlog segment files.  Seems to me the beginning of each segment file
should have some kind of header information that makes it clear where
in the scheme of things it belongs.  Additionally, writing some sort
of checksum, either at the beginning or the end, might not be a bad
idea either (doesn't have to be a strict checksum, but it needs to be
something that's reasonably likely to catch corruption within a
segment).

Do that, and you don't have to worry about renaming xlog segments at
all: you simply move on to the next logical segment in the list (a
replay just reads the header info for all the segments and orders the
list as it sees fit, and discards all segments prior to any gap it
finds.  It may be that you simply have to bail out if you find a gap,
though).  As long as the xlog segment checksum information is
consistent with the contents of the segment and as long as its
transactions pick up where the previous segment's left off (assuming
it's not the first segment, of course), you can safely replay the
transactions it contains.

I presume we're recycling xlog segments in order to avoid file
creation and unlink overhead?  Otherwise you can simply create new
segments as needed and unlink old segments as policy dictates.

> Comments anyone?
>
> > If there exists somewhere a reasonably succinct description of the
> > reasoning behind the current transaction management scheme (including
> > an analysis of the pros and cons), I'd love to read it and quit
> > bugging you.  :-)
>
> Not that I know of.  Would you care to prepare such a writeup?  There
> is a lot of material in the source-code comments, but no coherent
> presentation.

Be happy to.  Just point me to any non-obvious source files.

Thus far on my plate:

    1.  PID file locking for postmaster startup (doesn't strictly need
    to be the PID file but it may as well be, since we're already
    messing with it anyway).  I'm currently looking at how to do
    the autoconf tests, since I've never developed using autoconf
    before.

    2.  Documenting the transaction management scheme.

I was initially interested in implementing the explicit JOIN
reordering but based on your recent comments I think you have a much
better handle on that than I.  I'll be very interested to see what you
do, to see if it's anything close to what I figure has to happen...


--
Kevin Brown                          kevin@sysexperts.com

Re: Mount options for Ext3?

From
pgsql.spam@vinz.nl
Date:
On 2003-01-24 21:58:55 -0500, Tom Lane wrote:
> The key assumption we are making about the filesystem's behavior is that
> writes scheduled by the sync() will occur before the pg_control write
> that's issued after it.  People have occasionally faulted this algorithm
> by quoting the sync() man page, which saith (in the Gospel According To
> HP)
>
>      The writing, although scheduled, is not necessarily complete upon
>      return from sync.
>
> This, however, is not a problem in itself.  What we need to know is
> whether the filesystem will allow writes issued after the sync() to
> complete before those "scheduled" by the sync().
>

Certain linux 2.4.* kernels (not sure which, newer ones don't seem to have
it) have the following kernel config option:

Use the NOOP Elevator (WARNING)
CONFIG_BLK_DEV_ELEVATOR_NOOP
  If you are using a raid class top-level driver above the ATA/IDE core,
  one may find a performance boost by preventing a merging and re-sorting
  of the new requests.

  If unsure, say N.

If one were certain his OS wouldn't do any re-ordering of writes, would it be
safe to run with fsync = off? (not that I'm going to try this, but I'm just
curious)


Vincent van Leeuwen
Media Design

Re: Mount options for Ext3?

From
Tom Lane
Date:
pgsql.spam@vinz.nl writes:
> If one were certain his OS wouldn't do any re-ordering of writes, would it be
> safe to run with fsync = off? (not that I'm going to try this, but I'm just
> curious)

I suppose so ... but if your OS doesn't do *any* re-ordering of writes,
I'd say you need a better OS.  Even in Postgres, we'd often like the OS
to collapse multiple writes of the same disk page into one write.  And
we certainly want the various writes forced by a sync() to be done with
some intelligence about disk layout, not blindly in order of issuance.

            regards, tom lane

Re: Mount options for Ext3?

From
Ron Johnson
Date:
On Sat, 2003-01-25 at 23:34, Tom Lane wrote:
> pgsql.spam@vinz.nl writes:
> > If one were certain his OS wouldn't do any re-ordering of writes, would it be
> > safe to run with fsync = off? (not that I'm going to try this, but I'm just
> > curious)
>
> I suppose so ... but if your OS doesn't do *any* re-ordering of writes,
> I'd say you need a better OS.  Even in Postgres, we'd often like the OS
> to collapse multiple writes of the same disk page into one write.  And
> we certainly want the various writes forced by a sync() to be done with
> some intelligence about disk layout, not blindly in order of issuance.

And anyway, wouldn't SCSI's Tagged Command Queueing override it all,
no matter if the OS did re-ordering or not?

But then, it really means it when it says that fsync() succeeds, so does
TCQ matter in this case?

--
+---------------------------------------------------------------+
| Ron Johnson, Jr.        mailto:ron.l.johnson@cox.net          |
| Jefferson, LA  USA      http://members.cox.net/ron.l.johnson  |
|                                                               |
| "Fear the Penguin!!"                                          |
+---------------------------------------------------------------+


Re: Mount options for Ext3?

From
Josh Berkus
Date:
Kevin,

> BTW, why exactly are you running ext3?  It has some nice journalling
> features but it sounds like you don't want to use them.

Because our RAID array controller, an Adaptec 2200S, is only compatible with
RedHat 8.0, without some fancy device driver hacking.  It certainly wasn't my
first choice, I've been using Reiser for 4 years and am very happy with it.

Warning to anyone following this thread:  The web site info for the 2200S says
"Redhat and SuSE", but drivers are only available for RedHat.   Adaptec's
Linux guru, Brian, has been unable to get the web site maintainers to correct
the information on the site.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Mount options for Ext3?

From
Ron Johnson
Date:
On Mon, 2003-01-27 at 13:23, Josh Berkus wrote:
> Kevin,
>
> > BTW, why exactly are you running ext3?  It has some nice journalling
> > features but it sounds like you don't want to use them.
>
> Because our RAID array controller, an Adaptec 2200S, is only compatible with
> RedHat 8.0, without some fancy device driver hacking.  It certainly wasn't my

Binary-only, or OSS and just tuned to their kernels?

--
+---------------------------------------------------------------+
| Ron Johnson, Jr.        mailto:ron.l.johnson@cox.net          |
| Jefferson, LA  USA      http://members.cox.net/ron.l.johnson  |
|                                                               |
| "Fear the Penguin!!"                                          |
+---------------------------------------------------------------+


Re: Mount options for Ext3?

From
Bruce Momjian
Date:
Let me add that I have heard that on Linux XFS is better for PostgreSQL
than either ext3 or Reiser.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Josh Berkus wrote:
> > Well, the only reason I use Ext3 rather than Ext2 is to prevent fsck's on
> > restart after a crash.    So I'm interested in the data option that gives the
> > minimum performance hit, even if it means that I sacrifice some reliability.
> > I'm running with fsynch on, and the DB is on a mirrored drive array, so I'm
> > not too worried about filesystem-level errors.
> >
> > So would that be "data=writeback"?
>
> Yes.  That should give almost the same semantics as ext2 does by
> default, except that metadata is journalled, so no fsck needed.  :-)
>
> In fact, I believe that's exactly how ReiserFS works, if I'm not
> mistaken (I saw someone claim that it does data journalling, but I've
> never seen any references to how to get ReiserFS to journal data).
>
>
> BTW, why exactly are you running ext3?  It has some nice journalling
> features but it sounds like you don't want to use them.  But at the
> same time, it uses pre-allocated inodes just like ext2 does, so it's
> possible to run out of inodes on ext2/3 while AFAIK that's not
> possible under ReiserFS.  That's not likely to be a problem unless
> you're running a news server or something, though.  :-)
>
> On the other hand, ext3 with data=writeback will probably be faster
> than ReiserFS for a number of things.
>
> No idea how stable ext3 is versus ReiserFS...
>
>
>
> --
> Kevin Brown                          kevin@sysexperts.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL replay logic (was Re: Mount options for Ext3?)

From
Bruce Momjian
Date:
Is there a TODO here?  I like the idea of not writing pg_controldata, or
at least allowing it not to be read, perhaps with a pg_resetxlog flag so
we can cleanly recover from a corrupt pg_controldata if the WAL files
are OK.

We don't want to get rid of the WAL file rename optimization because
those are 16mb files and keeping them from checkpoint to checkpoint is
probably a win.  I also like the idea of allowing something between our
"at the instant" recovery, and no recovery with fsync off.  A "recover
from last checkpooint time" option would be really valuable for some.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin@sysexperts.com> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL?  Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> >
> > Interesting thought, indeed.  Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.
>
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed.  But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
>
> > The win is it'd eliminate pg_control as a single point of
> > failure.  It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> >
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
>
> Even that might not be necessary.  See below.
>
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
>
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right?  Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
>
> Right now we do that by using fsync() and sync().  But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off.  The price for
> that might be too high, though.
>
> > We'd have to take it on faith that we should replay the visible files
> > in their name order.  This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
>
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files.  Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs.  Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
>
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds.  It may be that you simply have to bail out if you find a gap,
> though).  As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
>
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead?  Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
>
> > Comments anyone?
> >
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you.  :-)
> >
> > Not that I know of.  Would you care to prepare such a writeup?  There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
>
> Be happy to.  Just point me to any non-obvious source files.
>
> Thus far on my plate:
>
>     1.  PID file locking for postmaster startup (doesn't strictly need
>     to be the PID file but it may as well be, since we're already
>     messing with it anyway).  I'm currently looking at how to do
>     the autoconf tests, since I've never developed using autoconf
>     before.
>
>     2.  Documenting the transaction management scheme.
>
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I.  I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
>
>
> --
> Kevin Brown                          kevin@sysexperts.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: WAL replay logic (was Re: Mount options for Ext3?)

From
Bruce Momjian
Date:
Is there a TODO here, like "Allow recovery from corrupt pg_control via
WAL"?

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <kevin@sysexperts.com> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL?  Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> >
> > Interesting thought, indeed.  Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.
>
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed.  But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
>
> > The win is it'd eliminate pg_control as a single point of
> > failure.  It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> >
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
>
> Even that might not be necessary.  See below.
>
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
>
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right?  Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
>
> Right now we do that by using fsync() and sync().  But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off.  The price for
> that might be too high, though.
>
> > We'd have to take it on faith that we should replay the visible files
> > in their name order.  This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
>
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files.  Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs.  Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
>
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds.  It may be that you simply have to bail out if you find a gap,
> though).  As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
>
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead?  Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
>
> > Comments anyone?
> >
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you.  :-)
> >
> > Not that I know of.  Would you care to prepare such a writeup?  There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
>
> Be happy to.  Just point me to any non-obvious source files.
>
> Thus far on my plate:
>
>     1.  PID file locking for postmaster startup (doesn't strictly need
>     to be the PID file but it may as well be, since we're already
>     messing with it anyway).  I'm currently looking at how to do
>     the autoconf tests, since I've never developed using autoconf
>     before.
>
>     2.  Documenting the transaction management scheme.
>
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I.  I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
>
>
> --
> Kevin Brown                          kevin@sysexperts.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [HACKERS] WAL replay logic (was Re: Mount options for

From
Curt Sampson
Date:
On Fri, 14 Feb 2003, Bruce Momjian wrote:

> Is there a TODO here, like "Allow recovery from corrupt pg_control via
> WAL"?

Isn't that already in section 12.2.1 of the documentation?

     Using pg_control to get the checkpoint position speeds up the
     recovery process, but to handle possible corruption of pg_control,
     we should actually implement the reading of existing log segments
     in reverse order -- newest to oldest -- in order to find the last
     checkpoint. This has not been implemented, yet.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC

Re: [HACKERS] WAL replay logic (was Re: Mount options for

From
Bruce Momjian
Date:
Added to TODO:

    * Allow WAL information to recover corrupted pg_controldata

---------------------------------------------------------------------------

Curt Sampson wrote:
> On Fri, 14 Feb 2003, Bruce Momjian wrote:
>
> > Is there a TODO here, like "Allow recovery from corrupt pg_control via
> > WAL"?
>
> Isn't that already in section 12.2.1 of the documentation?
>
>      Using pg_control to get the checkpoint position speeds up the
>      recovery process, but to handle possible corruption of pg_control,
>      we should actually implement the reading of existing log segments
>      in reverse order -- newest to oldest -- in order to find the last
>      checkpoint. This has not been implemented, yet.
>
> cjs
> --
> Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
>     Don't you know, in this new Dark Age, we're all light.  --XTC
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073