PITR COPY Failure (was Point in Time Recovery) - Mailing list pgsql-hackers

From Mark Kirkwood
Subject PITR COPY Failure (was Point in Time Recovery)
Date
Msg-id 40FA0F93.8010607@coretech.co.nz
Whole thread Raw
In response to Re: Point in Time Recovery  (markw@osdl.org)
Responses Re: PITR COPY Failure (was Point in Time Recovery)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: PITR COPY Failure (was Point in Time Recovery)  (Mark Kirkwood <markir@coretech.co.nz>)
Re: PITR COPY Failure (was Point in Time Recovery)  (Mark Kirkwood <markir@coretech.co.nz>)
List pgsql-hackers
I decided to produce a nice simple example, so that anyone could 
hopefully replicate what I am seeing.

The scenario is the same as before (the 11 steps), but the CREATE TABLE 
and COPY step has been reduced to:

CREATE TABLE test0 (filler  VARCHAR(120));
COPY test0 FROM '/data0/dump/test0.dat' USING DELIMITERS ',';

Now the file 'test0.dat' consists of (128293) identical lines, each of 
109 'a' charactors (plus end of line)


A script to run the whole business can be found here :

http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz

(It will need a bit of editing for things like location of Pg, PGDATA, 
and you will need to make your own data file)

The main points of interest are:
- anything <=128392 rows in test0.dat results in 1 archived log, and the 
recovery succeeds
- anything >=128393 rows in test0.dat results in 2 or more archived 
logs, and recovery fails on the second log (and gives the zero length 
redo at 0/1FFFFE0 message).

Let me know if I can do any more legwork on this (I am considering 
re-compiling with WAL_DEBUG now that example is simpler)

regards

Mark

Simon Riggs wrote:

>On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:
>  
>
>>I tried what I thought was a straightforward scenario, and seem to have 
>>broken it :-(
>>
>>Here is the little tale
>>
>>1) initdb
>>2) set archive_mode and archive_dest in postgresql.conf
>>3) startup
>>4) create database called 'test'
>>5) connect to 'test' and type 'checkpoint'
>>6) backup PGDATA using 'tar -zcvf'
>>7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
>>8) shutdown and remove PGDATA
>>9)  recover using 'tar -zxvf'
>>10) copy recovery.conf into PGDATA
>>11) startup
>>
>>This is what I get :
>>
>>LOG:  database system was interrupted at 2004-07-15 21:24:04 NZST
>>LOG:  recovery command file found...
>>LOG:  restore_program = cp %s/%s %s
>>LOG:  recovery_target_inclusive = true
>>LOG:  recovery_debug_log = true
>>LOG:  starting archive recovery
>>LOG:  restored log file "0000000000000000" from archive
>>LOG:  checkpoint record is at 0/A48054
>>LOG:  redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
>>LOG:  next transaction ID: 496; next OID: 25419
>>LOG:  database system was not properly shut down; automatic recovery in 
>>progress
>>LOG:  redo starts at 0/A48094
>>LOG:  restored log file "0000000000000001" from archive
>>LOG:  record with zero length at 0/1FFFFE0
>>LOG:  redo done at 0/1FFFF30
>>LOG:  restored log file "0000000000000001" from archive
>>LOG:  restored log file "0000000000000001" from archive
>>PANIC:  concurrent transaction log activity while database system is 
>>shutting down
>>LOG:  startup process (PID 13492) was terminated by signal 6
>>LOG:  aborting startup due to startup process failure
>>
>>The concurrent access is a bit of a puzzle, as this is my home machine 
>>(i.e. I am *sure* noone else is connected!)
>>
>>    
>>
>
>I can see what is wrong now, but you'll have to help me on details your
>end...
>
>The log shows that xlog 1 was restored from archive. It contains a zero
>length record, which indicates that it isn't yet full (or thats what the
>existing recovery code assumes it means). Which also indicates that it
>should never have been archived in the first place, and should not
>therefore be a candidate for a restore from archive.
>
>The double message "restored log file" can only occur after you've
>retrieved a partially full file from archive - which as I say, shouldn't
>be there.
>
>Other messages are essentially spurious in those circumstances.
>
>Either:
>- somehow the files have been mixed up in the archive directory, which
>is possible if the filing discipline is not strict - various ways,
>unfortunately I would guess this to be the most likely, somehow
>- the file that has been restored has been damaged in some way
>- the archiver has archived a file too early (very unlikely, IMHO -
>thats the most robust bit of the code)
>- some aspect of the code has written a zero length record to WAL (which
>is supposed to not be possible, but we musn't discount an error in
>recent committed work)
>
>- there may also be an effect going on with checkpoints that I don't
>understand...spurious checkpoint warning messages have already been
>observed and reported,
>
>Best regards, Simon Riggs
>
>
>
>
>  
>


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: NT + deadlock intended behaviour ?
Next
From: Alvaro Herrera
Date:
Subject: Re: NT + deadlock intended behaviour ?