> I've reported the major problems to the mailing lists
> but gotten almost no feedback about what to do.
I can't comment without access to code -:(
> commit: 2001-02-26 17:19:57
> 0/0059996C: prv 0/00599948; xprv 0/00000000; xid 0;
> RM 0 info 00 len 32
> checkpoint: redo 0/0059996C; undo 0/00000000; sui 29;
> nextxid 18903; nextoid 35195; online
> -- this is the last normal-looking checkpoint record.
> -- Judging from the commit timestamps surrounding prior
> -- checkpoints, checkpoints were happening every five
> -- minutes approximately on the 5-minute mark, so
You can't count on this: postmaster runs checkpoint
"maker" in 5 minutes *after* prev checkpoint was created,
not from the moment "maker" started. And checkpoint can
take *minutes*.
> -- this one happened about 17:20.
> -- (There really should be a timestamp
> -- in the checkpoint records...)
Agreed.
> commit: 2001-02-26 17:26:02
> ReadRecord: record with zero len at 0/005A4B4C
> -- My dump program is unhappy here because the rest
> -- of the page is zero. Given that there is a
> -- continuation record at the start of the next
> -- page, there certainly should have been record(s)
> -- here. But it's worse than that: check the commit
> -- timestamps and the xid numbers before and after the
> -- discontinuity. Did time go backwards here?
Commit timestamps are created *before* XLogInsert call,
which can suspend backend for some time (in multi-user
env). Random xid-s are also ok, generally.
> -- Also notice the back-pointers in the first valid
> -- record on the next page; they point not into the
> -- zeroed space, which would suggest a mere failure
> -- to write a buffer after filling it, but into the
> -- middle of one of the valid records on the prior
> -- page. It almost looks like page 5A6000 came from
> -- a completely different run than page 5A4000.
> Unexpected page info flags 0001 at offset 5A6000
> Skipping unexpected continuation record at offset 5A6000
> 0/005A6904: prv 0/005A48B4(?); xprv 0/005A48B4; xid 19047; ^^^^^^^^^^ ^^^^^^^^^^
Same. So, TX 19047 really inserted record at 0/005A48B4
position.
> -- What's even nastier (and the immediate cause of
> -- Scott's inability to restart) is that the pg_control
> -- file's checkPoint pointer points to 0/005AF9F0, which
> -- is *not* the location of this checkpoint, but of
> -- the record after it.
Well, well. Checkpoint position is taken from
MyLastRecord - I wonder how could this internal var
take "invalid" data from concurrent backend.
Ok, we're leaving Krasnoyarsk in 8 hrs and should
arrive SF Feb 5 ~ 10pm.
Vadim
-----------------------------------------------
FREE! The World's Best Email Address @email.com
Reserve your name now at http://www.email.com