Thread: xlog loose ends, continued

xlog loose ends, continued

From
Tom Lane
Date:
There is another loose end that I forgot I needed to discuss with you.

xlog.c's ReadRecord formerly contained code that would zero out the rest
of the log segment (and delete the next log segment, if any) upon
detecting a missing or corrupted xlog record.  I removed that code
because I considered it horribly dangerous where it was.  If there is
anything wrong with either the xlog or pg_control's pointers to it,
that code was quite capable of wiping out all hope of recovery *and*
all evidence of what went wrong.

I think it's really bad to automatically destroy log data, especially
when we do not yet know if we are capable of recovering.  If we need
this functionality, it should be invoked only at the completion of
StartupXLOG, after we have finished the recovery phase.  However,
I'd be a lot happier if we could avoid wholesale zeroing at all.

I presume the point of this code was that if we recover and then suffer
a later crash at a point where we've just written an xlog record that
exactly fills an xlog page, a subsequent scan of the log might continue
on from that point and pick up xlog records from the prior (failed)
system run.  Is there a way to guard against that scenario without
having to zero out data during recovery?

One thought that comes to mind is to store StartUpID in XLOG page
headers, and abort log scanning if we come to a page with StartUpID
less than what came before.  Is that secure/sufficient?  Is there
a better way?
        regards, tom lane


RE: xlog loose ends, continued

From
"Mikheev, Vadim"
Date:
> I presume the point of this code was that if we recover and 
> then suffer
> a later crash at a point where we've just written an xlog record that
> exactly fills an xlog page, a subsequent scan of the log 
> might continue
> on from that point and pick up xlog records from the prior (failed)
> system run.  Is there a way to guard against that scenario without
> having to zero out data during recovery?
> 
> One thought that comes to mind is to store StartUpID in XLOG page
> headers, and abort log scanning if we come to a page with StartUpID
> less than what came before.  Is that secure/sufficient?  Is there
> a better way?

This code was from the old days when there was no CRC in log records.
Should we try to read log up to the *physical end* - ie end of last
log file - regardless invalid CRC-s/zero pages with attempt to
re-apply interim valid records? (Or do we already do this?)
This way we'll know where is actual end of log (last valid record)
to begin production from there. (Unfortunately, we'll have to read
empty files pre-created by checkpointer -:().
Anyway I like idea of StartUpID in page headers - this will help
if some log files disappeared. Should we add CRC to page header?
Hm, maybe XLogFileInit should initialize files with StartUpID & CRC
in pages? We would avoid reading empty files.

Vadim


Re: xlog loose ends, continued

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> This code was from the old days when there was no CRC in log records.

Ah, right.  The CRC makes things safer ... but there's still a risk
that old log pages could look like a valid continuation.

> Should we try to read log up to the *physical end* - ie end of last
> log file - regardless invalid CRC-s/zero pages with attempt to
> re-apply interim valid records? (Or do we already do this?)

That doesn't seem like a good idea --- once we fail to read an XLOG
record, it's probably best to stop there rather than continue on.
I think we want to try for a consistent recovery to a past point in
time (ie, wherever the xlog gap is) not a partial recovery to a later
time.

> Anyway I like idea of StartUpID in page headers - this will help
> if some log files disappeared. Should we add CRC to page header?

That seems like overkill.  I was hoping to keep the page header overhead
at eight bytes.  We could do that either by storing just the two LSBs
of StartUpID (and doing the sequence checking mod 64k) or by reducing
the magic number to two bytes so there's room for four bytes of
StartUpID.  I think I like the first alternative better --- comments?

> Hm, maybe XLogFileInit should initialize files with StartUpID & CRC
> in pages? We would avoid reading empty files.

We already stop when we hit a zeroed page (because it's not got the
right magic number).  That seems sufficient.
        regards, tom lane


RE: xlog loose ends, continued

From
"Mikheev, Vadim"
Date:
> > Should we try to read log up to the *physical end* - ie end of last
> > log file - regardless invalid CRC-s/zero pages with attempt to
> > re-apply interim valid records? (Or do we already do this?)
> 
> That doesn't seem like a good idea --- once we fail to read an XLOG
> record, it's probably best to stop there rather than continue on.
> I think we want to try for a consistent recovery to a past point in
> time (ie, wherever the xlog gap is) not a partial recovery to a later
> time.

No way for consistent recovery if there is gap in log due to
disk write re-ordering anyway (and we can't know what was
the reason of the gap). I thought that you wanted apply as much of log
as we can. If you don't then I missed your point in first message:

> xlog.c's ReadRecord formerly contained code that would zero 
> out the rest of the log segment (and delete the next log segment,
> if any) upon detecting a missing or corrupted xlog record.
> I removed that code because I considered it horribly dangerous
> where it was. If there is anything wrong with either the xlog or
> pg_control's pointers to it, that code was quite capable of wiping
> out all hope of recovery *and* all evidence of what went wrong. ^^^^^^^^^^^^^^^^^^^^^^^^

So, if we are not going to re-apply as much valid records as
we can read from log then zeroing is no more dangerous than
SUI in headers. But I totaly agreed that SUI is much better.

> > Anyway I like idea of StartUpID in page headers - this will help
> > if some log files disappeared. Should we add CRC to page header?
> 
> That seems like overkill. I was hoping to keep the page 
> header overhead at eight bytes. We could do that either by storing just
> the two LSBs of StartUpID (and doing the sequence checking mod 64k) or
> by reducing the magic number to two bytes so there's room for four bytes
of
> StartUpID.  I think I like the first alternative better --- comments?

I don't think a few additional bytes in header is a problem.
BTW, why not use CRC32 in header instead of magic?
Or just StartUpID instead of magic if you don't want to calculate
CRC for header - xlp_magic doesn't seem to be more useful than SUI.

> > Hm, maybe XLogFileInit should initialize files with StartUpID & CRC
> > in pages? We would avoid reading empty files.
> 
> We already stop when we hit a zeroed page (because it's not got the
> right magic number).  That seems sufficient.

What if the next page after zeroed one is correct (due to write
re-ordering)?
(But I take back SUI+CRC in XLogFileInit - useless -:))

Vadim


Re: xlog loose ends, continued

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> That doesn't seem like a good idea --- once we fail to read an XLOG
>> record, it's probably best to stop there rather than continue on.
>> I think we want to try for a consistent recovery to a past point in
>> time (ie, wherever the xlog gap is) not a partial recovery to a later
>> time.

> No way for consistent recovery if there is gap in log due to
> disk write re-ordering anyway (and we can't know what was
> the reason of the gap). I thought that you wanted apply as much of log
> as we can. If you don't then I missed your point in first message:

>> xlog.c's ReadRecord formerly contained code that would zero 
>> out the rest of the log segment (and delete the next log segment,
>> if any) upon detecting a missing or corrupted xlog record.
>> I removed that code because I considered it horribly dangerous
>> where it was. If there is anything wrong with either the xlog or
>> pg_control's pointers to it, that code was quite capable of wiping
>> out all hope of recovery *and* all evidence of what went wrong.
>   ^^^^^^^^^^^^^^^^^^^^^^^^

What I was thinking about in that last paragraph was manual analysis and
recovery.  I don't think it's a good idea for automatic system startup
to skip over gaps in the log.

> So, if we are not going to re-apply as much valid records as
> we can read from log then zeroing is no more dangerous than
> SUI in headers. But I totaly agreed that SUI is much better.

Okay, I will change the page headers to include SUI (or the low-order
bits of it anyway), and make ReadRecord stop if it notices a backwards
jump in SUI.

> I don't think a few additional bytes in header is a problem.
> BTW, why not use CRC32 in header instead of magic?

There is so little changeable information in a page header that a CRC
wouldn't be much more than an eight-byte magic number.  And we don't
need eight bytes worth of magic number (even four is more than enough,
really).  So I'd just as soon keep the headers simple and small.
        regards, tom lane


RE: xlog loose ends, continued

From
"Mikheev, Vadim"
Date:
> >> xlog.c's ReadRecord formerly contained code that would zero 
> >> out the rest of the log segment (and delete the next log segment,
> >> if any) upon detecting a missing or corrupted xlog record.
> >> I removed that code because I considered it horribly dangerous
> >> where it was. If there is anything wrong with either the xlog or
> >> pg_control's pointers to it, that code was quite capable of wiping
> >> out all hope of recovery *and* all evidence of what went wrong.
> >   ^^^^^^^^^^^^^^^^^^^^^^^^
> 
> What I was thinking about in that last paragraph was manual 
> analysis and recovery. I don't think it's a good idea for automatic
> system startup to skip over gaps in the log.

But if we'll not try to read after gap then after restart system will
not notice gap and valid records after it and just rewrite log space
with new records. Not much chance for manual analysis - ppl will
not report any problems.

Vadim


Re: xlog loose ends, continued

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> What I was thinking about in that last paragraph was manual 
>> analysis and recovery. I don't think it's a good idea for automatic
>> system startup to skip over gaps in the log.

> But if we'll not try to read after gap then after restart system will
> not notice gap and valid records after it and just rewrite log space
> with new records. Not much chance for manual analysis - ppl will
> not report any problems.

That'll be true in any case, unless we refuse to start up at all upon
detecting xlog corruption (which doesn't seem like the way to fly).
Not sure what we can do about that.
        regards, tom lane


RE: xlog loose ends, continued

From
"Mikheev, Vadim"
Date:
> >> What I was thinking about in that last paragraph was manual 
> >> analysis and recovery. I don't think it's a good idea for automatic
> >> system startup to skip over gaps in the log.
> 
> > But if we'll not try to read after gap then after restart 
> > system will not notice gap and valid records after it and
> > just rewrite log space with new records. Not much chance for
> > manual analysis - ppl will not report any problems.
> 
> That'll be true in any case, unless we refuse to start up at all upon
> detecting xlog corruption (which doesn't seem like the way to fly).
> Not sure what we can do about that.

What I would refuse in the event of log corruption is continuing
normal database operations. It's ok to dump such database for manual
recovery, but continuing to use it is VERY BAD THING. The fact that
users will use inconsistent DB may become big headache for us - just
imagine compains about index scans returning incorrect results
(index tuples pointing to free heap space was left and then that space
was used for tuple with different keys).

Failing to restart was bad but silent restart in the event of log
corruption is bad too. In first case we had at least chance to
discover original problem.

Vadim


RE: RE: xlog loose ends, continued

From
"Mikheev, Vadim"
Date:
> > It will be sufficient if DB will not use all 2^32 XIDs 
> > without shutdown.
> 
> I liked the xid wraparound idea, won't that be sufficient here too ?
> I don't like the idea to reuse a xid sooner than absolutely necessary.

We need it to reduce pg_log size requirements.

> This would complicate the search for potentially inconsistent pages
> after crash.

There is no such search currently and I can't imagine how/when/for what
to do such search.
?

Vadim


Re: xlog loose ends, continued

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> That'll be true in any case, unless we refuse to start up at all upon
>> detecting xlog corruption (which doesn't seem like the way to fly).
>> Not sure what we can do about that.

> What I would refuse in the event of log corruption is continuing
> normal database operations.

Hmm.  We could do that if we had some notion of a read-only operating
mode, perhaps.  But we don't have one now and I don't want to add it
for 7.1.  Can we agree to look at this more for 7.2?

If we did have that, it would make sense to scan the rest of the log
(after the last valid XLOG record) to see if we find any more records.
If we do then --- whether they're valid or not --- we have a corrupted
DB and we should go into the read-only state.  But for the moment I
think it's best not to make such a scan.
        regards, tom lane


RE: xlog loose ends, continued

From
"Mikheev, Vadim"
Date:
> > What I would refuse in the event of log corruption is continuing
> > normal database operations.
> 
> Hmm.  We could do that if we had some notion of a read-only operating
> mode, perhaps.  But we don't have one now and I don't want to add it
> for 7.1.  Can we agree to look at this more for 7.2?

We need not in full support of read-only mode - just set some flag in
shmem and disallow write ops. I think 7.1.1 or so is good for that.

Vadim


Re: xlog loose ends, continued

From
Thomas Lockhart
Date:
> >> That'll be true in any case, unless we refuse to start up at all upon
> >> detecting xlog corruption (which doesn't seem like the way to fly).
> >> Not sure what we can do about that.
> > What I would refuse in the event of log corruption is continuing
> > normal database operations.
> Hmm.  We could do that if we had some notion of a read-only operating
> mode, perhaps.  But we don't have one now and I don't want to add it
> for 7.1.  Can we agree to look at this more for 7.2?

I'd like to have a readonly mode driven by integrity requirements for
corruption recovery for database tables, for replication, and (in the
future) for distributed databases, so perhaps we can do a trial
implementation fairly soon. Not sure how it would impact the backend(s),
but istm that we might be able to do a first implementation for 7.1.x.
I'll bring it up again when appropriate...
                  - Thomas


Re: Re: xlog loose ends, continued

From
Justin Clift
Date:
Maybe there should be an error message like :

"PostgreSQL has detected severe xlog corruption.  Please fix this with
pg_recover (or similar) manually before restarting the database"?

Guess I'm suggesting a separate xlog recovery tool for "bad cases" of
xlog corruption, so decisions can by manually made by a DBA where
necessary.  Not everything has to be automatic I'm thinking.  There are
probably times where the dBA would prefer behaviour that doesn't seem
intuitive anyway.

Regards and best wishes,

Justin Clift

Tom Lane wrote:
> 
> "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> >> What I was thinking about in that last paragraph was manual
> >> analysis and recovery. I don't think it's a good idea for automatic
> >> system startup to skip over gaps in the log.
> 
> > But if we'll not try to read after gap then after restart system will
> > not notice gap and valid records after it and just rewrite log space
> > with new records. Not much chance for manual analysis - ppl will
> > not report any problems.
> 
> That'll be true in any case, unless we refuse to start up at all upon
> detecting xlog corruption (which doesn't seem like the way to fly).
> Not sure what we can do about that.
> 
>                         regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://www.postgresql.org/search.mpl


Re: RE: xlog loose ends, continued

From
Hiroshi Inoue
Date:
"Mikheev, Vadim" wrote:
> 
> > >> What I was thinking about in that last paragraph was manual
> > >> analysis and recovery. I don't think it's a good idea for automatic
> > >> system startup to skip over gaps in the log.
> >
> > > But if we'll not try to read after gap then after restart
> > > system will not notice gap and valid records after it and
> > > just rewrite log space with new records. Not much chance for
> > > manual analysis - ppl will not report any problems.
> >
> > That'll be true in any case, unless we refuse to start up at all upon
> > detecting xlog corruption (which doesn't seem like the way to fly).
> > Not sure what we can do about that.
> 
> What I would refuse in the event of log corruption is continuing
> normal database operations. 

Log corruption is never an unique cause of a recovery failure.
If there's a bug in redo stuff the result would also be a recovery
failure. Currently the redo stuff has to accomplish redo operations
completely. No matter how trivial the bug may be, it's always serious
unfortunately.

> It's ok to dump such database for manual
> recovery, but continuing to use it is VERY BAD THING. The fact that
> users will use inconsistent DB may become big headache for us - just

> imagine compains about index scans returning incorrect results
> (index tuples pointing to free heap space was left and then that space
> was used for tuple with different keys).
> 

Hmm this seems nothing worse than 7.0.
I would complain if postmaster couldn't restart due to this reason.
IIRC few ppl mind the (even system) index corruption.
> Failing to restart was bad but silent restart in the event of log
> corruption is bad too.

Agreed.

regards,
Hiroshi Inoue


Re: Re: xlog loose ends, continued

From
Andrew McMillan
Date:
Tom Lane wrote:
> 
> "Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> >> That'll be true in any case, unless we refuse to start up at all upon
> >> detecting xlog corruption (which doesn't seem like the way to fly).
> >> Not sure what we can do about that.
> 
> > What I would refuse in the event of log corruption is continuing
> > normal database operations.
> 
> Hmm.  We could do that if we had some notion of a read-only operating
> mode, perhaps.  But we don't have one now and I don't want to add it
> for 7.1.  Can we agree to look at this more for 7.2?

I'd love to see PostgreSQL have a read-only mode of some kind that let
enquiry function against a possibly otherwise corrupted database,
without the stress of worrying that you might be making things worse.

I know other DB servers that have this sort of thing, and it has been a
life-saver for me on occasion to allow critical information to be
extracted before you nuke it all and start over.

Cheers,            Andrew.
-- 
_____________________________________________________________________          Andrew McMillan, e-mail:
Andrew@catalyst.net.nz
Catalyst IT Ltd, PO Box 10-225, Level 22, 105 The Terrace, Wellington
Me: +64 (21) 635 694, Fax: +64 (4) 499 5596, Office: +64 (4) 499 2267