Thread: Errors during recovery of a postgres. Need some help understanding them...

Errors during recovery of a postgres. Need some help understanding them...

From
"Dhaval Shah"
Date:
Here is the situation:

I have a standby postgres which is fed a WAL File every 2 minutes.
Whenever it is fed a WAL file it logs the following:


---
LOG:  restored log file "000000010000000000000070" from archive
pg_restore::copyWALFile: Moving
/opt/data/mirror/000000010000000000000071 to pg_xlog/RECOVERYXLOG
LOG:  restored log file "000000010000000000000071" from archive
pg_restore::copyWALFile: Moving
/opt/data/mirror/000000010000000000000072 to pg_xlog/RECOVERYXLOG
LOG:  restored log file "000000010000000000000072" from archive
...
...
pg_restore::copyWALFile: Moving
/opt/data/mirror/000000010000000000000082 to pg_xlog/RECOVERYXLOG
LOG:  restored log file "000000010000000000000082" from archive
---

I assume that the above situation is a happy postgres in a recovery
mode. The "copyWALFile" is my message in the serverlog.

After a while, the primary gives up. That is it goes down and I am not
able to pull any WAL file from the primary. So I tell the standby that
I do not have any WAL File to give.

----
LOG:  could not open file "pg_xlog/000000010000000000000083" (log file
0, segment 131): No such file or directory
LOG:  redo done at 0/8200D280
Main: Triggering recovery
PANIC:  could not open file "pg_xlog/000000010000000000000082" (log
file 0, segment 130): No such file or directory
---

The issue above is that I do not have the "001...0083" file and I
return a "file not found". Further when the postgres asks me about
"001...0082", I do not have that either, since in the intervening
minutes, I have moved that file out of my /opt/data/mirror to
/opt/data/tape directory for long term tape storage. So how do I make
my standby postgres happy?

Having run into that situation, the standby also spits out the following:

---
LOG:  could not open file "pg_xlog/000000010000000000000082" (log file
0, segment 130): No such file or directory
LOG:  invalid primary checkpoint record
LOG:  could not open file "pg_xlog/000000010000000000000080" (log file
0, segment 128): No such file or directory
LOG:  invalid secondary checkpoint record
---

What is happening is that the postgres is looking behind in time for
the "0001...0082" and "0001...0080" files.

The question I have is, how far does it look behind in time? Then I
have to be careful of when I move the WAL file out to tape. Further if
I know how far back in time I have to keep my WAL file, then I can
device an effective strategy of removing older files. That is if I
come and say that I generate WAL file every 2 minutes, then do I keep
10 files or 120 files?

Any insight on this will help.

Regards
Dhaval

"Dhaval Shah" <dhaval.shah.m@gmail.com> writes:
> The question I have is, how far does it look behind in time?

I think you only need to hang onto the immediately preceding file;
it only backs up to the last applied WAL record, and that's certainly
not going to span multiple segment files.  The attempt to back up to the
last checkpoint isn't going to happen if you keep it from crashing at
the REDO DONE point.

            regards, tom lane

Re: Errors during recovery of a postgres. Need some help understanding them...

From
"Dhaval Shah"
Date:
I am still learning the ropes, I guess. I am not able to understand
the following:

> The attempt to back up to the
> last checkpoint isn't going to happen if you keep it from crashing at
> the REDO DONE point.

Does the above statement mean that I am crashing my primary server at
the REDO DONE point and if that is the case, how do I avoid crashing
at the REDO DONE? Or is this something to be done at the standby?

Regards
Dhaval

On 4/9/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Dhaval Shah" <dhaval.shah.m@gmail.com> writes:
> > The question I have is, how far does it look behind in time?
>
> I think you only need to hang onto the immediately preceding file;
> it only backs up to the last applied WAL record, and that's certainly
> not going to span multiple segment files.  The attempt to back up to the
> last checkpoint isn't going to happen if you keep it from crashing at
> the REDO DONE point.
>
>                         regards, tom lane
>


--
Dhaval Shah