Thread: Failing start-up archive recovery at Standby mode in PG9.2.4

Failing start-up archive recovery at Standby mode in PG9.2.4

From

KONDO Mitsumasa

Date:

24 April 2013, 11:40:15

Hi,

I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication.
I test same problem in PG9.2.3. But it is not occerd...

We search this problem patch in PostgreSQL git repository.
It may be occerd in this commit.
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ee8b95e8c807f31220f31bcff445d1dcb20c28b4

Attached sciript(replication_test.sh) is reproduced this problem.
If you use this script, you sholud set up about PGPATH and install pgbench.
Script execution summary is under following:

 1. Create Master and Standby instance.
 2. Insert Master to lot of record using pgbench and stop master by immediate shutdown.
 3. Start Master and Standby.
 4. Standby fails start-up achive recovery in ValidXLOGHeader, xlog.c:4395.
 5. Standby thinks achive recovery is finished. It is mistake!
 6. Standby connects to Master for stream replication, but it fail.

> cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file "000000020000000000000013" from archive
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  RestoreArchivedFile, xlog.c:3273
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: got WAL segment from archive
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  XLogFileReadAnyTLI, xlog.c:2930
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: compacted fsync request queue from 4096 entries to 2 entries
> [Standby] 2013-04-22 01:27:25 EDTCONTEXT:  xlog redo multi-insert (init): rel 1663/12870/24588; blk 41575; 61 tuples
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  CompactCheckpointerRequestQueue, checkpointer.c:1288
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: checkpoint sync: number=1 file=base/12870/24588 time=822.092 msec
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  mdsync, md.c:1161
> [Standby] 2013-04-22 01:27:25 EDTDEBUG:  00000: checkpoint sync: number=2 file=base/12870/24588_fsm time=0.908 msec
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  mdsync, md.c:1161
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restartpoint complete: wrote 6 buffers (0.1%); 0 transaction log
file(s)added, 3 removed, 4 recycled; write=0.437 s, sync=0.823 s, total=1.274 s; sync files=2, longest=0.822 s,
average=0.411s 
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  LogCheckpointEnd, xlog.c:7893
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: recovery restart point at 0/F0FF8D0
> [Standby] 2013-04-22 01:27:25 EDTDETAIL:  last completed transaction was at log time 2013-04-22 01:26:32.439575-04
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  CreateRestartPoint, xlog.c:8601
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restartpoint starting: xlog
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  LogCheckpointStart, xlog.c:7821
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
[* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: unexpected pageaddr 0/6000000 in log file
0,segment 20, offset 0 
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
> [Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: streaming replication successfully connected to primary
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
> [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL segment
000000030000000000000014has already been removed 
>
> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, libpqwalreceiver.c:389
>
> [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-22 01:27:30 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
> [Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
> [Standby] 2013-04-22 01:27:30 EDTLOG:  00000: streaming replication successfully connected to primary
> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
> [Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL segment
000000030000000000000014has already been removed 
>
> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_receive, libpqwalreceiver.c:389
> /pg_ctl -D Standby cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-22 01:27:35 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> [Standby] 2013-04-22 01:27:35 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> [Master] 2013-04-22 01:27:35 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> [Master] 2013-04-22 01:27:35 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:35 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> [Master] 2013-04-22 01:27:35 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> [Master] 2013-04-22 01:27:35 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed

We try to fix this problem with my company's colleaguea,
but it is too difficult and complicated to WAL contorol and validation method...

This problem is occerd in xlog.c:10692.
> 10692         if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> 10693                 goto next_record_is_invalid;

And problem is occerd this checking in ValidXLOGHeader().
>  4390         if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
>  4391         {
>  4392                 ereport(emode_for_corrupt_record(emode, recaddr),
>  4393                                 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
>  4394                                                 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
>  4395                                                 readId, readSeg, readOff)));
>  4396                 return false;
>  4397         }

We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
We cannot analyze why this value is uncorrect!


Please fix this problem with us!

Best regards

--
NTT OSS Center
Mitsumasa KONDO

Attachment

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Andres Freund

Date:

24 April 2013, 11:59:57

ello,

On 2013-04-24 17:43:39 +0900, KONDO Mitsumasa wrote:
> Hi,
>
> I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication.
> I test same problem in PG9.2.3. But it is not occerd...

> > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file "000000020000000000000013" from archive

I can't read the error message here, but this looks suspicious. The
recovery command seems to be returning success although it prints such
an error message?

> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: unexpected pageaddr 0/6000000 in log
file0, segment 20, offset 0 
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
> > [Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: streaming replication successfully connected to primary
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL
segment000000030000000000000014 has already been removed 
> >
> > [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, libpqwalreceiver.c:389
> >
> > [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:30 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
> > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
> > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
> > [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
> > [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
> > [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
> > [Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
> > [Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
> > [Standby] 2013-04-22 01:27:30 EDTLOG:  00000: streaming replication successfully connected to primary
> > [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
> > [Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL
segment000000030000000000000014 has already been removed 

And this seems to be the second problem. You probably need to configure
a higher wal_keep_segments on the primary or you need to fix your
recovery_command.


> We try to fix this problem with my company's colleaguea,
> but it is too difficult and complicated to WAL contorol and validation method...
>
> This problem is occerd in xlog.c:10692.
> > 10692         if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> > 10693                 goto next_record_is_invalid;
>
> And problem is occerd this checking in ValidXLOGHeader().
> >  4390         if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
> >  4391         {
> >  4392                 ereport(emode_for_corrupt_record(emode, recaddr),
> >  4393                                 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
> >  4394                                                 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
> >  4395                                                 readId, readSeg, readOff)));
> >  4396                 return false;
> >  4397         }
>
> We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
> We cannot analyze why this value is uncorrect!

I think the tests is just noticing that we don't have valid WAL - we are
probably reading old preallocated wal files from before starting the
node as a standby. Normally it will get the correct data from the
primary after that (you can see the START_REPLICATION commands in the
log), but that failed.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

24 April 2013, 13:16:19

Hello,

> > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> > [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file "000000020000000000000013" from archive

> I can't read the error message here, but this looks suspicious.

The error message is "No such file or directory" in Japanese :-).
Perhaps restore_command is 'cp ../arc/%f %p'.

> The recovery command seems to be returning success although it prints such
an error message?

It seems that, on 9.2.4, RestoreArchivedFile returns expected xlog
filename even if the restore command ends up in failure. And the
caller XLogFileOpen successfully ets and returns fd for the filename
if the file exists since before the restoration attempt. It does not
matter when the existing file and the file expected to be restored are
identical in content.

If the segment includes the timeline increment point, something might
happen, I suppose..

regards,

-- 
Kyotaro Horiguchi

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

24 April 2013, 13:25:22

Sorry,

> caller XLogFileOpen successfully ets and returns fd for the filename

The caller is XLogFileRead in this case.

# and 'ets' is gets, of course.

regards,
-- 
Kyotaro Horiguchi

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Andres Freund

Date:

24 April 2013, 13:33:52

On 2013-04-24 19:16:12 +0900, Kyotaro HORIGUCHI wrote:
> Hello,
>
> > > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
> > > [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file "000000020000000000000013" from archive
>
> > I can't read the error message here, but this looks suspicious.
>
> The error message is "No such file or directory" in Japanese :-).
> Perhaps restore_command is 'cp ../arc/%f %p'.
>
> > The recovery command seems to be returning success although it prints such
> an error message?
>
> It seems that, on 9.2.4, RestoreArchivedFile returns expected xlog
> filename even if the restore command ends up in failure. And the
> caller XLogFileOpen successfully ets and returns fd for the filename
> if the file exists since before the restoration attempt. It does not
> matter when the existing file and the file expected to be restored are
> identical in content.

But thats not what happening here, afaics the "restore log file ..."
message is only printed if the returncode is 0.

And the return value from RestoreArchivedFile() would be different in
the case you describe which would make XLogFileRead() fail, so it should
result in different errors.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

24 April 2013, 14:25:33

Oops,

> But thats not what happening here, afaics the "restore log file ..."
> message is only printed if the returncode is 0.

You're right. 'cp <nonexistent> <somewhere>' exits with the status
code 1 (or 256?).

The quoted log lines simply show that segment for tli=3 did not exist
and that for tli=2 had been gotten finally. It's quite normal and
irrelevant to the trouble mentioned. Sorry for the confusion.

Unfortunately, the script attached didn't reproduce the situation for
me. But in the log attached,

> [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL segment
000000030000000000000014has already been removed
 
> cp: cannot stat `../arc/000000030000000000000014': No such file or directory.
> [Standby] 2013-04-22 01:27:30 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0

seems showing that the standby received negative ack for the request
for 20th segment, and 5 seconds later, it tried to get that from
archive and somehow thought that it'd gotten but the header is of 6th
segment.

regards,

-- 
Kyotaro Horiguchi

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

25 April 2013, 04:32:53

I had a bit look on it and came up with an hypothesis.. umm or a
scenario.

==
Just after restartpoint, too old xlog files are recycled but its
page header has old content, specifically, xlogid and xrecoff.

Plus, if the master's LSN is at the head of new segment file, the
file for the segment may have not been created and the WAL send
request for that LSN from the standby might fail as "requested
WAL segment %s has already been removed", in spite of the segment
is "not yet created"...
So the standby received nack for the request and looks for
archive but the file is properly not existent. But the file of
that segment is certainly found in pg_xlog directory. So
XLogFileRead grabs the old-in-content file and...

> seems showing that the standby received negative ack for the request
> for 20th segment, and 5 seconds later, it tried to get that from
> archive and somehow thought that it'd gotten but the header is of 6th
> segment.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

KONDO Mitsumasa

Date:

25 April 2013, 05:46:21

Hi,

>> I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication.
>> I test same problem in PG9.2.3. But it is not occerd...
>
>>> cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません
>>> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: restored log file "000000020000000000000013" from archive
>
> I can't read the error message here, but this looks suspicious. The
> recovery command seems to be returning success although it prints such
> an error message?
Sorry.. This suspicious massage is written in japanese.
It means "missing file or directory". Beacause starting point of timeline 3　WAL file is 00000003000000000000002A.
>>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>> [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: unexpected pageaddr 0/6000000 in log
file0, segment 20, offset 0 
>>> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
>>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>>> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
>>> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
>>> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:25 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
>>> [Master] 2013-04-22 01:27:25 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:25 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
>>> [Master] 2013-04-22 01:27:25 EDTLOCATION:  XLogRead, walsender.c:1023
>>> [Standby] 2013-04-22 01:27:25 EDTLOG:  00000: streaming replication successfully connected to primary
>>> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
>>> [Standby] 2013-04-22 01:27:25 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL
segment000000030000000000000014 has already been removed 
>>>
>>> [Standby] 2013-04-22 01:27:25 EDTLOCATION:  libpqrcv_receive, libpqwalreceiver.c:389
>>>
>>> [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>>> [Standby] 2013-04-22 01:27:30 EDTDEBUG:  00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0
>>> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  ValidXLOGHeader, xlog.c:4395
>>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません
>>> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません
>>> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: IDENTIFY_SYSTEM
>>> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:30 EDTDEBUG:  00000: received replication command: START_REPLICATION 0/14000000
>>> [Master] 2013-04-22 01:27:30 EDTLOCATION:  HandleReplicationCommand, walsender.c:449
>>> [Master] 2013-04-22 01:27:30 EDTFATAL:  58P01: requested WAL segment 000000030000000000000014 has already been
removed
>>> [Master] 2013-04-22 01:27:30 EDTLOCATION:  XLogRead, walsender.c:1023
>>> [Standby] 2013-04-22 01:27:30 EDTLOG:  00000: streaming replication successfully connected to primary
>>> [Standby] 2013-04-22 01:27:30 EDTLOCATION:  libpqrcv_connect, libpqwalreceiver.c:171
>>> [Standby] 2013-04-22 01:27:30 EDTFATAL:  XX000: could not receive data from WAL stream: FATAL:  requested WAL
segment000000030000000000000014 has already been removed 
>
> And this seems to be the second problem. You probably need to configure
> a higher wal_keep_segments on the primary or you need to fix your
> recovery_command.
In this situation, I try to up timelineID by archive recovery, not also recover WAL files.
If achive recovery works correctly in Standby, recovery done point is 000000020000000000000029 in timelineID 2.
Please see detail in pg9.2.4.log. Situation of archive file is here.
>> We try to fix this problem with my company's colleaguea,
>> but it is too difficult and complicated to WAL contorol and validation method...
>>
>> This problem is occerd in xlog.c:10692.
>>> 10692         if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
>>> 10693                 goto next_record_is_invalid;
>>
>> And problem is occerd this checking in ValidXLOGHeader().
>>>   4390         if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
>>>   4391         {
>>>   4392                 ereport(emode_for_corrupt_record(emode, recaddr),
>>>   4393                                 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
>>>   4394                                                 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
>>>   4395                                                 readId, readSeg, readOff)));
>>>   4396                 return false;
>>>   4397         }
>>
>> We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace!
>> We cannot analyze why this value is uncorrect!
>
> I think the tests is just noticing that we don't have valid WAL - we are
> probably reading old preallocated wal files from before starting the
> node as a standby. Normally it will get the correct data from the
> primary after that (you can see the START_REPLICATION commands in the
> log), but that failed.
I hope so! I contenue to try to analayze this problem..

Best regards

--
NTT Open Sorce Software Center
Mitsumasa KONDO

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Amit Langote

Date:

25 April 2013, 09:13:58

I also had a similar observation when I could reproduce this.
I tried to find why restartpoint causes the recycled segment to be named
after timeline 3, but have not been able to determine that.

When I looked at the source, I found that, the function XLogFileReadAnyTLI
which returns a segment file for reading a XLog page iterates over a list
expectedTLIs which starts with 3 in such a case (that is, in case where this
error happens). XLogFileReadAnyTLI checks the segment in both archive and
pg_xlog. So, even if such a segment (that is with timeline 3) may not be in
the archive , it may be in pg_xlog, due to recycling as we have observed.
So, such a recycled segment may be returned by XLogFileReadAnyTLI as though
it were the next segment to recover from, resulting in the "unexpected
pageaddr ..." error.

I could not understand (in case this error happens) how expectedTLIs list
had 3 at its head (for example, when XLogFileReadAnyTLI used it as we
observed) whereas at other times (when this error does not happen), it has 2
at its head until all the segments of timeline 2 are recovered from?
Also, how does recycled segment gets timeline 3 whereas 2 is expected in
this case?

Is this the right way to look at the problem and its possible fix?

I have tried going through the source regarding this but have not been able
to determine where this could accidentally happen, partly because I do not
understand recovery process (and its code) very well. Will post if find
anything useful.

regards,
Amit Langote

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753221.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

25 April 2013, 17:55:55

Hmm. I think that I caught the tail of the problem..

Script with small change reproduced the situation for me.

The latest standby uses 3 as its TLI after the history file
0..3.history which could get from the archive. So the WAL files
recycled on this standby will have the TLI=3.
Nevertheless the LSN of the segment recycled on standby is on the
TLI=2 in the master, the standby makes the first request for each
segment with that LSN but TLI = 3 to the master because the standby
runs on recoveryTargetTLI=3. The master reasonably doesn't have it and
finally the standby finds that wrong WAL file in its pg_xlog directory
before the second request with TLI=2 would be made.

In conclusion, the standby should name the recycled WAL segment using
the same TLI for the LSN on the master. Or should never recycle WAL
files..

regards,

--
Kyotaro Horiguchi

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Heikki Linnakangas

Date:

25 April 2013, 18:13:03

On 25.04.2013 17:55, Kyotaro HORIGUCHI wrote:
> Hmm. I think that I caught the tail of the problem..
>
> Script with small change reproduced the situation for me.

Can you share the modified script, please?

> The latest standby uses 3 as its TLI after the history file
> 0..3.history which could get from the archive. So the WAL files
> recycled on this standby will have the TLI=3.
> Nevertheless the LSN of the segment recycled on standby is on the
> TLI=2 in the master, the standby makes the first request for each
> segment with that LSN but TLI = 3 to the master because the standby
> runs on recoveryTargetTLI=3. The master reasonably doesn't have it and
> finally the standby finds that wrong WAL file in its pg_xlog directory
> before the second request with TLI=2 would be made.

I'm not sure I understand what the problem is, though. When the standby 
opens the bogus, recycled, file in pg_xlog, it will notice that the 
header is incorrect, and retry reading the file from the archive.

> In conclusion, the standby should name the recycled WAL segment using
> the same TLI for the LSN on the master. Or should never recycle WAL
> files..

Perhaps, but it should nevertheless not get confused by recycled segments.

- Heikki

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

25 April 2013, 18:21:09

I forgot it.

> In conclusion, the standby should name the recycled WAL segment using
the same TLI for the LSN on the master. Or should never recycle WAL
files.

Or the standby should make the request with correct TLI at first
consulting the timeline history.
Or the standby should make retry with the more small TLIs  after it
gets broken segments.

regards,

--
Kyotaro Horiguchi

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

25 April 2013, 18:56:59

> Can you share the modified script, please?

Please find the attached files:
  test.sh : test script. most significant change is the load.
               I used simple insert instead of pgbench.
               It might need some more adjustment for other environment
               as my usual.
  xlog.c.diff : Additional log output I thought to be useful to diagnose.

> I'm not sure I understand what the problem is, though. When the standby
> opens the bogus, recycled, file in pg_xlog, it will notice that the header
> is incorrect, and retry reading the file from the archive.

It looks the corrupted header only once. It continues to make retry
without looking there after that.

> Perhaps, but it should nevertheless not get confused by recycled segments.

regards, and good night.
--
Kyotaro Horiguchi

Attachment

promtest_20130426.tar.gz

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Heikki Linnakangas

Date:

25 April 2013, 22:35:04

On 25.04.2013 18:56, Kyotaro HORIGUCHI wrote:
>> Can you share the modified script, please?
>
> Please find the attached files:
>    test.sh : test script. most significant change is the load.
>                 I used simple insert instead of pgbench.
>                 It might need some more adjustment for other environment
>                 as my usual.
>    xlog.c.diff : Additional log output I thought to be useful to diagnose.

Ok, thanks, I see what's going on now. The problem is that once
XLogFileRead() finds a file with tli X, it immediately sets curFileTLI =
X. XLogFileReadAnyTLI() never tries to read files with tli < curFileTLI.
So, if recovery finds a file with the right filename, e.g
000000030000000000000008, it never tries to open
000000020000000000000008 anymore, even if the contents of
000000030000000000000008 later turn out to be bogus.

One idea to fix this is to not set curFileTLI, until the page header on
the just-opened file has been verified. Another idea is to change the
check in XLogFileReadAnyTLI() that currently forbids curFileTLI from
moving backwards. We could allow curFileTLI to move backwards, as long
as the tli is >= ThisTimeLineID (ThisTimeLineID is the current timeline
we're recovering records from).

Attached is a patch for the 2nd approach. With the patch, the test
script works for me. Thoughts?

PS. This wasn't caused by the 9.2.4 change to do crash recovery before
entering archive recovery. The test script fails in the same way with
9.2.3 as well.

- Heikki

Attachment

dont-get-confused-by-recycled-segments-1.patch

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Kyotaro HORIGUCHI

Date:

26 April 2013, 07:02:10

<p>Thank you for the patch.<p>The test script finishes in success with that. And looks reasonable on a short
glance.<p>OnFri, Apr 26, 2013 at 4:34 AM, Heikki Linnakangas <<a
href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a>>wrote:<br /> > One idea to fix this is to not
setcurFileTLI, until the page header on the<br /> > just-opened file has been verified. Another idea is to change
thecheck in<br /> > XLogFileReadAnyTLI() that currently forbids curFileTLI from moving<br /> > backwards. We
couldallow curFileTLI to move backwards, as long as the<br /> > tli is >= ThisTimeLineID (ThisTimeLineID is the
currenttimeline we're<br /> > recovering records from).<br /> ><br /> > Attached is a patch for the 2nd
approach.With the patch, the test script<br /> > works for me. Thoughts?<p>I am uncertain a bit weather it is 
necessaryto move curFileTLI to anywhere randomly read . On a short glance, the random access occurs also for reading
checkpoint-relatedrecords.<br /> Also I don't have clear distinction between lastSegmentTLI and curFileTLI after the
patchapplied. Although , I need look closer around them to understand.<p>> PS. This wasn't caused by the 9.2.4
changeto do crash recovery before<br /> > entering archive recovery. The test script fails in the same way with
9.2.3<br/> > as well.<p>--<br /> Kyotaro Horiguchi

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Amit Langote

Date:

26 April 2013, 07:46:17

How would code after applying this patch behave if a recycled segment gets
renamed using the newest timeline (say 3) while we are still recovering from
a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns
that recycled segment as the next segment to recover from, we get the error.
And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to
be recoveryTargetTLI at all times, is that right?), it will return that
wrong (recycled segment) in the first iteration itself. 

The code for renaming the recycling segment remains unaffected by this
patch, right? That is, the code which assigns timelineID to newly created /
recycled segments is not changed. 

(a snippet from CreateRestartPoint() )

/** Update ThisTimeLineID to the recovery target timeline, so that* we install any recycled segments on the correct
timeline.*/

>>>> ThisTimeLineID = GetRecoveryTargetTLI();  <<<<< 

RemoveOldXlogFiles(_logId, _logSeg, endptr);

/** Make more log segments if needed.  (Do this after recycling old log* segments, since that may supply some of the
neededfiles.)*/
 
PreallocXlogFiles(endptr);


So, would we still get a similar behavior (wherein the wrongly named
recycled segments are produced) and cause the error to happen again? Do we
need to look here?

What do you think? 



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753352.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Amit Langote

Date:

26 April 2013, 07:47:36

How would code after applying this patch behave if a recycled segment gets
renamed using the newest timeline (say 3) while we are still recovering from
a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns
that recycled segment as the next segment to recover from, we get the error.
And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to
be recoveryTargetTLI at all times, is that right?), it will return that
wrong (recycled segment) in the first iteration itself. 

The code for renaming the recycling segment remains unaffected by this
patch, right? That is, the code which assigns timelineID to newly created /
recycled segments is not changed. 

(a snippet from CreateRestartPoint() ) 

/* * Update ThisTimeLineID to the recovery target timeline, so that * we install any recycled segments on the correct
timeline.*/ 
 

>>>> ThisTimeLineID = GetRecoveryTargetTLI();  <<<<< 

RemoveOldXlogFiles(_logId, _logSeg, endptr); 

/* * Make more log segments if needed.  (Do this after recycling old log * segments, since that may supply some of the
neededfiles.) */ 
 
PreallocXlogFiles(endptr); 


So, would we still get a similar behavior (wherein the wrongly named
recycled segments are produced) and cause the error to happen again? Do we
need to look here? 

What do you think?



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753353.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Heikki Linnakangas

Date:

26 April 2013, 10:09:54

On 26.04.2013 07:02, Kyotaro HORIGUCHI wrote:
> I am uncertain a bit weather it is  necessary to move curFileTLI to
> anywhere randomly read . On a short glance, the random access occurs also
> for reading checkpoint-related records.

I didn't understand that.

> Also I don't have clear distinction between lastSegmentTLI and curFileTLI
> after the patch applied.

In short, lastSegmentTLI is the TLI in first page of the last opened 
segment, while curFileTLI is the TLI in the filename of the last opened 
segment. Usually they're the same, but in a segment that contains a 
timeline switch, they differ. For example, if you perform PITR to a 
point in the middle of segment 000000010000000000000062, you end up with 
two segments:

000000010000000000000062
000000020000000000000062

The first half of those files are identical, but the latter contains a 
time-line changing checkpoint record in the middle, and from that point 
on the contents are different. When we open file 
000000020000000000000062, lastSegmentTLI is 1, because the first half of 
that segment contains WAL from timeline 1, but curFileTLI is 2, because 
that's the TLI in the filename.

- Heikki

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Amit Langote

Date:

26 April 2013, 10:59:31

What would happen if a recycled segment gets assigned a newer timeline than
the one we are currently recovering from? In the reported erroneous
behavior, that is what happens causing XLogFileReadAnyTLI() to return such
bogus segment causing the error. Since, expectedTLIs contains a newer
timeline at its head, and if such a wrongly recycled (that is one bearing a
newer timeline than curFileTLI) segment exists in pg_xlog,
XLogFileReadAnyTLI() would return fd of that segment itself causing the
error in question. In next retry, the same thing would happen and prevent
standby startup from proceeding any further.

So, should we also look at the segment recycling code to see how it names
the recycled segments (what timelineID does it use) ? For example, in
CreateRestartPoint() where segments are recycled? 



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753358.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Heikki Linnakangas

Date:

26 April 2013, 11:10:57

On 26.04.2013 07:47, Amit Langote wrote:
>   How would code after applying this patch behave if a recycled segment gets
> renamed using the newest timeline (say 3) while we are still recovering from
> a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns
> that recycled segment as the next segment to recover from, we get the error.
> And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to
> be recoveryTargetTLI at all times, is that right?), it will return that
> wrong (recycled segment) in the first iteration itself.

As long as the right segment is present in the archive, that's OK. Even
if a recycled segment with higher TLI is in pg_xlog, with the patch
we'll still read the segment with lower TLI from the archive. But there
is a corner-case where a recycled segment with a higher TLI masks a
segment with lower TLI in pg_xlog. For example, if you try to recover by
copying all the required WAL files directly into pg_xlog, without using
restore_command, you could run into problems.

So yeah, I think you're right and we need to rethink the recycling. The
first question is, do we have to recycle WAL segments during recovery at
all? It's pointless when we're restoring from archive with
restore_command; the recycled files will just get replaced with files
from the archive. It does help when walreceiver is active, but I wonder
how significant it is in practice.

I guess the safest, smallest change is to use a lower TLI when
installing the recycled files. So, instead of using the current recovery
target timeline, use the ID of the timeline we're currently recovering.
That way the reycycled segments will never have a higher TLI than other
files that recovery will try to replay. See attached patch.

- Heikki

Attachment

dont-get-confused-by-recycled-segments-2.patch

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

KONDO Mitsumasa

Date:

26 April 2013, 11:48:03

Hi,

I discavered the problem cause. I think taht horiguchi's discovery is another problem...
Problem has CreateRestartPoint. In recovery mode, PG should not WAL record.
Because PG does not know latest WAL file location.
But in this problem case, PG(standby) write WAL file at RestartPoint in archive recovery.
In recovery mode, I think that RestartPoint can write only MinRecoveryPoint.

Here is Standby's pg_xlog directory in problem caused.
> [mitsu-ko@localhost postgresql-9.2.4-c]$ ls Standby/pg_xlog/
> 000000020000000000000003  000000020000000000000007  00000002000000000000000B  00000003.history
> 000000020000000000000004  000000020000000000000008  00000002000000000000000C  00000003000000000000000E
> 000000020000000000000005  000000020000000000000009  00000002000000000000000D  00000003000000000000000F
> 000000020000000000000006  00000002000000000000000A  00000002000000000000000E  archive_status

This problem case is here.
> [Standby] 2013-04-26 04:26:44 EDT DEBUG:  00000: attempting to remove WAL segments older than log file
000000030000000000000002
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  RemoveOldXlogFiles, xlog.c:3568
> [Standby] 2013-04-26 04:26:44 EDT DEBUG:  00000: recycled transaction log file "000000010000000000000002"
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  RemoveOldXlogFiles, xlog.c:3607
> [Standby] 2013-04-26 04:26:44 EDT DEBUG:  00000: recycled transaction log file "000000020000000000000002"
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  RemoveOldXlogFiles, xlog.c:3607
> [Standby] 2013-04-26 04:26:44 EDT LOG:  00000: restartpoint complete: wrote 9 buffers (0.2%); 0 transaction log
file(s)added, 0 removed, 2 recycled; write=0.601 s, sync=1.178 s, total=1.781 s; sync files=3, longest=1.176 s,
average=0.392s 
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  LogCheckpointEnd, xlog.c:7893
> [Standby] 2013-04-26 04:26:44 EDT LOG:  00000: recovery restart point at 0/90FE448
> [Standby] 2013-04-26 04:26:44 EDT DETAIL:  last completed transaction was at log time 2013-04-26 04:25:53.203725-04
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  CreateRestartPoint, xlog.c:8601
> [Standby] 2013-04-26 04:26:44 EDT LOG:  00000: restartpoint starting: xlog
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  LogCheckpointStart, xlog.c:7821
> cp: cannot stat `../arc/00000003000000000000000F': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-26 04:26:44 EDT DEBUG:  00000: could not restore file "00000003000000000000000F" from archive:
returncode 256 
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  RestoreArchivedFile, xlog.c:3323
> [Standby] 2013-04-26 04:26:44 EDT LOG:  00000: unexpected pageaddr 0/2000000 in log file 0, segment 15, offset 0
> [Standby] 2013-04-26 04:26:44 EDT LOCATION:  ValidXLOGHeader, xlog.c:4395
> cp: cannot stat `../arc/00000003000000000000000F': そのようなファイルやディレクトリはありません
> [Standby] 2013-04-26 04:26:44 EDT DEBUG:  00000: could not restore file "00000003000000000000000F" from archive:
returncode 256 

In recovery, pg normary search WAL file at archive recovery.
If propery WAL file is nothing(archive command is failed), next search pg_xlog directory.
Normary, propety WAL file is nothing in pg_xlog.
But this case has propety name's WAL file(But it's mistaken and illegal) in pg_xlog.
So recovery is failed and broken Standby.

So I fix CreateRestartPoint at branching point of executing MinRecoveryPoint.
It seems to fix this problem, but attached patch is plain.


Best Regard,
--
NTT Open Source Software Center
Mitsumasa KONDO

Attachment

fix-ArchiveRecovery-in-StandbyMode.patch

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Mitsumasa KONDO

Date:

26 April 2013, 17:59:51

I explain more detail about this problem.

This problem was occurred by RestartPoint create illegal WAL file in during archive recovery. But I cannot recognize why illegal WAL file was created in CreateRestartPoint(). My attached patch is really plain…

In problem case at XLogFileReadAnyTLI(), first check WAL file does not get fd. Because it does not exists property WAL File in archive directory.

XLogFileReadAnyTLI()
>     if (sources & XLOG_FROM_ARCHIVE)
>     {
>       fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
>        if (fd != -1)
>        {
>           elog(DEBUG1, "got WAL segment from archive");
>           return fd;
>        }
>     }

Next search WAL file in pg_xlog. There are illegal WAL File in pg_xlog. And return illegal WAL File’s fd.

XLogFileReadAnyTLI()
>      if (sources & XLOG_FROM_PG_XLOG)
>      {
>         fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true);
>         if (fd != -1)
>            return fd;
>      }

Returned fd is be readFile value. Of cource readFile value is over 0. So out of for-loop.

XLogPageRead
>              readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
>                                      sources);
>               switched_segment = true;
>               if (readFile >= 0)
>                  break;

Next, problem function point. Illegal WAL file was read, and error.

XLogPageRead
>   if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
> {
>      ereport(emode_for_corrupt_record(emode, *RecPtr),
>            (errcode_for_file_access(),
>       errmsg("could not seek in log file %u, segment %u to offset %u: %m",
>            readId, readSeg, readOff)));
>      goto next_record_is_invalid;
>   }
>   if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
>   {
>      ereport(emode_for_corrupt_record(emode, *RecPtr),
>            (errcode_for_file_access(),
>       errmsg("could not read from log file %u, segment %u, offset %u: %m",
>            readId, readSeg, readOff)));
>      goto next_record_is_invalid;
>   }
>   if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
>      goto next_record_is_invalid;

I think that horiguchi's discovery point is after this point.
We must fix that CreateRestartPoint() does not create illegal WAL File.

Best regards,

--
Mitsumasa KONDO

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

Heikki Linnakangas

Date:

07 May 2013, 16:40:55

On 26.04.2013 11:51, KONDO Mitsumasa wrote:
> Hi,
>
> I discavered the problem cause. I think taht horiguchi's discovery is
> another problem...
> Problem has CreateRestartPoint. In recovery mode, PG should not WAL record.
> Because PG does not know latest WAL file location.
> But in this problem case, PG(standby) write WAL file at RestartPoint in
> archive recovery.
> In recovery mode, I think that RestartPoint can write only
> MinRecoveryPoint.
>
> Here is Standby's pg_xlog directory in problem caused.
>> [mitsu-ko@localhost postgresql-9.2.4-c]$ ls Standby/pg_xlog/
>> 000000020000000000000003 000000020000000000000007
>> 00000002000000000000000B 00000003.history
>> 000000020000000000000004 000000020000000000000008
>> 00000002000000000000000C 00000003000000000000000E
>> 000000020000000000000005 000000020000000000000009
>> 00000002000000000000000D 00000003000000000000000F
>> 000000020000000000000006 00000002000000000000000A
>> 00000002000000000000000E archive_status
>
> This problem case is here.
>> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: attempting to remove
>> WAL segments older than log file 000000030000000000000002
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles,
>> xlog.c:3568
>> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: recycled transaction
>> log file "000000010000000000000002"
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles,
>> xlog.c:3607
>> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: recycled transaction
>> log file "000000020000000000000002"
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles,
>> xlog.c:3607
>> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: restartpoint complete:
>> wrote 9 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2
>> recycled; write=0.601 s, sync=1.178 s, total=1.781 s; sync files=3,
>> longest=1.176 s, average=0.392 s
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointEnd, xlog.c:7893
>> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: recovery restart point
>> at 0/90FE448
>> [Standby] 2013-04-26 04:26:44 EDT DETAIL: last completed transaction
>> was at log time 2013-04-26 04:25:53.203725-04
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: CreateRestartPoint,
>> xlog.c:8601
>> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: restartpoint starting: xlog
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointStart,
>> xlog.c:7821
>> cp: cannot stat `../arc/00000003000000000000000F': そのようなファイル
>> やディレクトリはありません
>> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: could not restore file
>> "00000003000000000000000F" from archive: return code 256
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RestoreArchivedFile,
>> xlog.c:3323
>> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: unexpected pageaddr
>> 0/2000000 in log file 0, segment 15, offset 0
>> [Standby] 2013-04-26 04:26:44 EDT LOCATION: ValidXLOGHeader, xlog.c:4395
>> cp: cannot stat `../arc/00000003000000000000000F': そのようなファイル
>> やディレクトリはありません
>> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: could not restore file
>> "00000003000000000000000F" from archive: return code 256
>
> In recovery, pg normary search WAL file at archive recovery.
> If propery WAL file is nothing(archive command is failed), next search
> pg_xlog directory.
> Normary, propety WAL file is nothing in pg_xlog.
> But this case has propety name's WAL file(But it's mistaken and illegal)
> in pg_xlog.
> So recovery is failed and broken Standby.
>
> So I fix CreateRestartPoint at branching point of executing
> MinRecoveryPoint.
> It seems to fix this problem, but attached patch is plain.

I didn't understand this. I committed a fix for the issue where recycled
WAL files get in the way of recovery, but I'm not sure if you're
describing the same issue or something else. But before we dig any
deeper into this, can you verify if you're still seeing any issues on
9.3beta1?

- Heikki

Re: Failing start-up archive recovery at Standby mode in PG9.2.4

From

KONDO Mitsumasa

Date:

08 May 2013, 04:54:57

(2013/05/07 22:40), Heikki Linnakangas wrote:
> On 26.04.2013 11:51, KONDO Mitsumasa wrote:
>> So I fix CreateRestartPoint at branching point of executing
>> MinRecoveryPoint.
>> It seems to fix this problem, but attached patch is plain.
>
> I didn't understand this. I committed a fix for the issue where recycled WAL
> files get in the way of recovery, but I'm not sure if you're describing the same
> issue or something else. But before we dig any deeper into this, can you verify
> if you're still seeing any issues on 9.3beta1?
Yes, my fix was very plain and mistake fix point. I see your patch, your fix is 
right. I posted another problem that is cannot promote and PITR. It might be 
different problem, uut I could not reproduce it...
Please wait for more! I try to analyze and test in 9.2.4 with your patch and 
9.3beta1.

Best regards,
-- 
NTT Open Source Software Center
Mitsumasa KONDO