Re: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup - Mailing list pgsql-general

From bricklen
Subject Re: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup
Date
Msg-id AANLkTinLZ=9J4TV+gRo=vgnKVAsQ6gG5CpGqeVtJOa47@mail.gmail.com
Whole thread Raw
In response to Re: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup  (bricklen <bricklen@gmail.com>)
List pgsql-general
On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> bricklen <bricklen@gmail.com> writes:
>> On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> What can you tell us about what was happening on the source DB while
>>> the backup was being taken?
>
>> The source db has between 1000 and 3000 transactions/s, so is
>> reasonably volatile. The two tables in question are not accessed very
>> heavily though.
>
>> Looking at the ctid and xmin between both databases, no, they don't
>> seem to match exactly. Pardon my ignorance, but would those have
>> changed due to vacuums, analyze, or any other forms of access?
>
> The difference in ctid, and the values of xmin and relfrozenxid,
> seems to confirm my suspicion that this wasn't just random cosmic rays.
> You did something on the source DB that rewrote the table with a new
> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM
> or ANALYZE wouldn't do it).  And for some reason the standby hasn't
> picked up that change in the pg_class row.  I suspect the explanation
> is that your technique for setting up the standby is flawed.  You can't
> just rsync and have a valid snapshot of the DB --- you need to be sure
> that enough WAL gets replayed to fix any inconsistencies arising from
> the time-extended nature of the rsync operation.  But you didn't say
> exactly how you did that.
>

Definitely no CLUSTER commands were issued, and there should have been
no ALTER commands issued (yesterday was a holiday, no one was here).
Would a TRUNCATE have the same effect though? I grep'd through our
application, and it appears that at least 3 tables get truncated, one
of them several times per hour. The often-truncated table wasn't one
of the bad ones, but the others are the ones I've already identified
as non-existent.


The standby setup process was

On source server:
- start shipping WAL files to standby @ /var/lib/pgsql/WAL_Archive dir
- "select pg_start_backup('standby')"
- rsync -av -e "ssh -p 9001" --progress --partial -z
/var/lib/pgsql/data postgres@standby-tunnel:/var/lib/pgsql/
- "select pg_stop_backup()"

On standby
- recovery.conf file:
restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live
/WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log'

Note: /WAL_Archive is a symlink to /var/lib/pgsql/WAL_Archive, that
didn't get touched though, as it is on the same directory level as the
data dir.

pg_ctl -D /var/lib/pgsql/data start
touch /var/lib/pgsql/WAL_Archive/go_live

Looking over the log of the restore process, there are a couple
messages that might be helpful in pinpointing the problem area.


2010-12-29 08:07:34 PST [16589]: [2-1] (user=) (rhost=) LOG:  starting
archive recovery
2010-12-29 08:07:34 PST [16589]: [3-1] (user=) (rhost=) LOG:
restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live
/WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log'
2010-12-29 08:07:34 PST [16589]: [4-1] (user=) (rhost=) LOG:
recovery_end_command = 'rm -f /WAL_Archive/go_live'
2010-12-29 08:07:46 PST [16589]: [5-1] (user=) (rhost=) LOG:  restored
log file "0000000100001FB400000080" from archive
2010-12-29 08:07:46 PST [16589]: [6-1] (user=) (rhost=) LOG:
automatic recovery in progress
2010-12-29 08:07:46 PST [16589]: [7-1] (user=) (rhost=) LOG:  restored
log file "0000000100001FB40000007B" from archive
2010-12-29 08:07:46 PST [16589]: [8-1] (user=) (rhost=) LOG:  redo
starts at 1FB4/7BD68388
2010-12-29 08:07:46 PST [16589]: [9-1] (user=) (rhost=) LOG:
consistent recovery state reached
2010-12-29 08:07:46 PST [16589]: [10-1] (user=) (rhost=) LOG:  file
"pg_clog/0494" doesn't exist, reading as zeroes
2010-12-29 08:07:46 PST [16589]: [11-1] (user=) (rhost=) CONTEXT:
xlog redo commit: 2010-12-28 22:35:18.152971-08
2010-12-29 08:08:21 PST [16589]: [12-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007C" from archive
2010-12-29 08:08:22 PST [16589]: [13-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007D" from archive
2010-12-29 08:08:22 PST [16589]: [14-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007E" from archive
... thousands of WAL logs later
2010-12-29 08:43:51 PST [16589]: [2529-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005B" from archive
2010-12-29 08:43:59 PST [16589]: [2530-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005C" from archive
2010-12-29 08:43:59 PST [16589]: [2531-1] (user=) (rhost=) LOG:  could
not open file "pg_xlog/0000000100001FBE0000005D" (log file 8126,
segment 93): No such file or directory
2010-12-29 08:43:59 PST [16589]: [2532-1] (user=) (rhost=) LOG:  redo
done at 1FBE/5CFFEA58
2010-12-29 08:43:59 PST [16589]: [2533-1] (user=) (rhost=) LOG:  last
completed transaction was at log time 2010-12-29 08:44:14.784874-08
2010-12-29 08:43:59 PST [16589]: [2534-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005C" from archive
2010-12-29 08:44:12 PST [16589]: [2535-1] (user=) (rhost=) LOG:
selected new timeline ID: 2
2010-12-29 08:44:24 PST [16589]: [2536-1] (user=) (rhost=) LOG:
archive recovery complete
2010-12-29 08:44:24 PST [25155]: [1-1] (user=) (rhost=) LOG:
autovacuum launcher started
2010-12-29 08:44:24 PST [16587]: [2-1] (user=) (rhost=) LOG:  database
system is ready to accept connections

Do those log entries look suspicious?

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup
Next
From: bricklen
Date:
Subject: B-tree + sorting + unique constraint