Home > mailing lists
Re: Issues Outstanding for Point In Time Recovery (PITR) - Mailing list pgsql-hackers

From	J. R. Nield
Subject	Re: Issues Outstanding for Point In Time Recovery (PITR)
Date	July 5, 2002 06:03:25
Msg-id	1025863303.11285.276.camel@localhost.localdomain Whole thread
In response to	Re: Issues Outstanding for Point In Time Recovery (PITR) (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses	Re: Issues Outstanding for Point In Time Recovery (PITR) Re: Issues Outstanding for Point In Time Recovery (PITR)
List	pgsql-hackers
Tree view
On Fri, 2002-07-05 at 01:42, Bruce Momjian wrote:
> 
> We have needed
> point-in-time recovery for a long time, 

Most thanks should go to vadim (and whoever else worked on this), since
his WAL code already does most of the work. The key thing is auditing
the backend to look for every case where we assume some action is not
visible until after commit, and therefore don't log its effects. Those
are the main cases that must be changed.


> ---------------------------------------------------------------------------
> 
> J. R. Nield wrote:
> > Hello:
> > 
> > I've got the logging system to the point where I can take a shutdown
> > consistent copy of a system, and play forward through multiple
> > checkpoints. It seems to handle CREATE TABLE/DROP TABLE/TRUNCATE

But notably not for the btree indexes! It looked like they were working,
because the files were there, and all indexes created before the backup
would work under insert/delete (including sys catalog indexes). This is
because btree insert/delete is logged, just not during build. So I
missed that one case.

You will end-up with up-to-date table data though, so it is something.

Adding logging support to btbuild is the next step, and I don't think it
should be too hard. I am working this now.

It is also a major advantage that most everything in the system gets
stored in the catalog tables, and so is logged already.


> Uh, we do log pre-page writes to WAL to recover from partial page
> writes to disk.  Is there something more we need here?
> 
> As for bad block detection, we have thought about adding a CRC to each
> page header, or at least making it optional.  WAL already has a CRC.
>

Yes this should be last to do, because it is not necessary for PITR,
only for performance (the option not to write pre-images without fear of
data loss). 
> Yes, there are a few places where we actually create a file, and if the
> server crashes, the file remains out there forever.  We need to track that
> better.  

OK, there is a bigger problem then just tracking the file though. We
sometimes do stuff to that file that we don't log. We assume that if we
commit, the file must be OK and will not need replay because the
transaction would not have committed if the file was not in a commitable
state. If we abort, the system never sees the file, so in a sense we
undo everything we did to the file. It is a kind of poor-man's rollback
for certain operations, like btbuild, create table, etc. But it means
that we cannot recover the file from the log, even after a commit.

> 
> > 
> >     ?1.1.1 - CREATE DATABASE is also unlogged
> > 
> >       This will cause the same replay problems as above.
> 
> Yep.  Again, seems a master cleanup on startup is needed.

The cleanup is not the problem, only a nuisance. Creating the files
during replay is the problem. I must recreate CREATE DATABASE from the
log exactly as it was done originally. I think just logging the
parameters to the command function should be sufficient, but I need to
think more about it.

> 
> >     b) If TRUNCATE TABLE fails, the system must PANIC. Otherwise, the table
> >     may be used in a future command, and a replay-recovered database may
> >     end-up with different data than the original.
> 
> We number based on oids.  You mean oid wraparound could cause the file
> to be used again?

That's not what I meant. Let's say I issue 'TRUNCATE TABLE foo'. Then,
right before smgrtruncate is called, I do an XLogInsert saying "Redo a
TRUNCATE TABLE on foo to nblocks if we crash". Then smgrtruncate fails
and we do an elog(ERROR)

Now the user decides that since TRUNCATE TABLE didn't work, he might as
well use the table, so he inserts some records into it, generating log
entries.

When I replay this log sequence later, what happens if the TRUNCATE
succeeds instead of failing?

I admit that there are other ways of handling it than to PANIC if the
truncate fails. All the ones I can come up with seem to amount to some
kind of ad-hoc UNDO log.


> 
> >     WAL must be flushed before truncate as well.
> > 
> >     WAL does not need to be flushed before create, if we don't mind 
> >     leaking files sometimes.
> 
> Cleanup?

Yes, we could garbage-collect leaked files. XLogFlush is not that
expensive though, so I don't have an opinion on this yet.

> 
> >     c) Redo code should treat writes to non-existent files as an error.
> >     Changes affect heap & nbtree AM's. [Check others]
> 
> Yep, once you log create/drop, if something doesn't match, it is an
> error, while before, we could ignore it.
> 
> >     d) rtree [and GiST? WTF is GiST? ] is not logged. A replay recovery of
> >       a database should mark all the rtree indices as corrupt.
> >       [ actually we should do that now, are we? ]
> 
> Known problem.  Not sure what is being done.  TODO has:
> 
>     * Add WAL index reliability improvement to non-btree indexes
> 
> so it is a known problem, and we aren't doing anything about it.  What
> more can I say?  ;-)

Once the other stuff works reliably, I will turn to rtree logging, which
I have looked at somewhat, although I could really use a copy of the
paper it is supposed to be based on. I have not figured out GiST enough
to work on it yet.

> 
> >     e) CREATE DATABASE must be logged properly, not use system(cp...)
> 
> OK, should be interesting.
> 
> >   ?1.3 - Status:
> > 
> >     All logged SMGR operations are now in a START_CRIT_SECTION()/
> >     END_CRIT_SECTION() pair enclosing the XLogInsert() and the underlying fs
> >     operations.
> > 
> >     Code has been added to smgr and xact modules to log:
> >         create (no XLogFlush)
> >         truncate (XLogFlush)
> >         pending deletes on commit record
> >         files to delete on abort record
> > 
> >     Code added to md.c to support redo ops
> > 
> >     Code added to smgr for RMGR redo/desc callbacks
> > 
> >     Code added to xact RMGR callbacks for redo/desc
> > 
> >     Database will do infinite shutdown consistent system recovery from the
> >     online logs, if you manually munge the control file to set state ==
> >     DB_IN_PRODUCTION instead of DB_SHUTDOWNED.
> 
> Wow, how did you get so far?

Because it was almost there to start with :-)

Besides, it sounded better before I realized there was still a remaining
problem with btree logging to fix. 


> >      In the case of LSNLast, we check to see if pd_lsn == the lsn in the
> >      last 64 bits of the page. If not, we assume the page is corrupt from
> >      a partial write (although it could be something else).
> 
> LSN?

Log Sequence Number (XLogRecPtr)

> 
> >      IMPORTANT ASSUMPTION:
> >         The OS/disk device will never write both the first part and
> >         last part of a block without writing the middle as well.
> >         This might be wrong in some cases, but at least it's fast.
> > 
> >     ?2.2.4 - GUC Variables
> > 
> >      The user should be able to configure what method is used:
> > 
> >        block_checking_write_method  = [ checksum | torn_page_flag | none ]
> > 
> >           Which method should be used for blocks we write?
> 
> Do we want torn page flag?  Seems like a pain to get that on every 512
> byte section of the 8k page.

Ok, this section (2.2) was badly written and hard to understand. What I
am proposing is that we put a copy of the log sequence number, which is
at the head of the page, into the 8 byte field that we are creating at
the end of the page, in place of the CRC. The log sequence number
increases every time the page is written (it already does this). I have
called this method 'LSNLast' internally, and the user would call it the
'torn_page_flag' method.

So when we read the page, we compare the Log Sequence Number at the
beginning and end of the page, and if they are different we assume a
torn page.

This version is weaker than the MS one we were talking about, because it
is not on every 512 byte section of the page, only the beginning and the
end. I'm simply looking for a fast alternative to CRC64, that doesn't
require massive reorganization of the page layout code.


> 
> >   ?2.3 - Status:
> > 
> >     Waiting for input from pgsql-hackers.
> > 
> >     Questions:
> > 
> >         Should we allow the user to have more detailed control over
> >         which parts of a database use block checking?
> 
> I don't think that is needed;  installation-wide settings are fine.
> 
> >         For example: use 'checksum' on all system catalogs in all databases, 
> >         'torn_page_flag' on the non-catalog parts of the production database,
> >         and 'none' on everything else?
> 
> Too complicated.  Let's get it implemented and in the field and see what
> people ask for.

Ok. I agree.

> 
> > ?3 - Detecting Shutdown Consistent System Recovery

> > 
> >     ?3.3 - Status:
> > 
> >        In progress.
> 
> Sorry, I was confused by this.

Let me re-write it, and I'll post it in the next version. The section
dealt with what to do when you have a valid restored controlfile from a
backup system, which is in the DB_SHUTDOWNED state, and that points to a
valid shutdown/checkpoint record in the log; only the checkpoint record
happens not to be the last one in the log. This is a situation that
could never happen now, but would in PITR.


> 
> > ?4 - Interactive Play-Forward Recovery for an Entire System
> > 
> >     Play-Forward File Recovery from a backup file must be interactive,
> >     because not all log files that we need are necessarily in the 
> >     archive directory. It may be possible that not all the archive files
> >     we need can even fit on disk at one time.
> > 
> >     The system needs to be able to prompt the system administrator to feed
> >     it more log files.
> > 
> >     TODO: More here
> 
> Yes, we can have someone working on the GUI once the command-line
> interface is defined.

Yes, and the system must not allow any concurrent activity during
recovery either. So it looks like a standalone backend operation.

> 
> > ?5 - Individual file consistent recovery
> > 
> >   ?5.1 - Problem:
> > 
> >     If a file detects corruption, and we restore it from backup, how do 
> >     we know what archived files we need for recovery?
> > 
> >     Should file corruption (partial write, bad disk block, etc.) outside 
> >     the system catalog cause us to abort the system, or should we just 
> >     take the relation or database off-line?
> 
> Offline is often best so they can get in there and recover if needed. 
> We usually allow them in with a special flag or utility like
> pg_resetxlog.
> 
> 
> >     Given a backup file, how do we determine the point in the log 
> >     where we should start recovery for the file? What is the highest LSN
> >     we can use that will fully recover the file?
> 
> That is tricky.  We have discussed it and your backup has to deal with
> some pretty strange things that can happen while 'tar' is traversing the
> directory.

Even if we shutdown before we copy the file, we don't want a file that
hasn't been written to in 5 weeks before it was backed up to require
five weeks of old log files to recover. So we need to track that
information somehow, because right now if we scanned the blocks in the
file looking for at the page LSN's, we greatest LSN we would see might
be much older than where it would be safe to recover from. That is the
biggest problem, I think.

> 
> 
> >   ?5.2 - Proposal: 
> >     
> >     Put a file header on each file, and update that header to the last
> >     checkpoint LSN at least once every 'file_lsn_time_slack' minutes, or
> >     at least once every dbsize/'file_lsn_log_slack' megabytes of log
> >     written, where dbsize is the estimated size of the database. Have
> >     these values be settable from the config file. These updates would be
> >     distributed throughout the hour, or interspersed between regular
> >     amounts of log generation.
> > 
> >     If we have a database backup program or command, it can update the
> >     header on the file before backup to the greatest value it can assure
> >     to be safe.
> 
> Not sure.
> 
> >   ?5.3 - Status:
> > 
> >     Waiting for input from pgsql-hackers.
> > 
> >     Questions:
> > 
> >         There are alternate methods than using a file header to get a
> >         known-good LSN lower bound for the starting point to recover a backup
> >         file. Is this the best way?
> 
> Not sure.
> 
> I am sure others will chime in with more information.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
> 
-- 
J. R. Nield
jrnield@usol.com
pgsql-hackers by date:
From: "Christopher Kings-Lynne"
Date: 05 July 2002, 04:13:25
Subject: Re: BETWEEN Node & DROP COLUMN
From: Alessio Bragadini
Date: 05 July 2002, 08:37:53
Subject: Re: Should next release by 8.0 (Was: Re: [GENERAL] I am
Re: Issues Outstanding for Point In Time Recovery (PITR) - Mailing list pgsql-hackers

Previous

Next