Thread: Log archiving failing. Seems to be wrong timeline

Log archiving failing. Seems to be wrong timeline

From
Chris Lewis
Date:
Hello,

We have 2 postgresql servers (v 9.4.2)  master and slave in streaming
replication. The overall cluster is controlled using pacemaker &
corosync and the pgsql cluster agent which handles failover to, and
promotion of, the slave.

Recently a failover occured and I noticed that log archiving was failing
on the master:

cp: cannot stat 'pg_xlog/000000020000000000000002': No such file or
directory
2016-06-30 11:49:48 BST [13816]: [1235-1] db=,user=,client= LOG: archive
command failed with exit code 1
2016-06-30 11:49:48 BST [13816]: [1236-1] db=,user=,client= DETAIL: The
failed archive command was: cp pg_xlog/000000020000000000000002
/mnt/pgsql/data/pg_archive/000000020000000000000002
cp: cannot stat 'pg_xlog/000000020000000000000002': No such file or
directory
2016-06-30 11:49:49 BST [13816]: [1237-1] db=,user=,client= LOG: archive
command failed with exit code 1
2016-06-30 11:49:49 BST [13816]: [1238-1] db=,user=,client= DETAIL: The
failed archive command was: cp pg_xlog/000000020000000000000002
/mnt/pgsql/data/pg_archive/000000020000000000000002
2016-06-30 11:49:49 BST [13816]: [1239-1] db=,user=,client= WARNING:
archiving transaction log file "000000020000000000000002" failed too
many times, will try again later


But the timeline we're on is different:

# /usr/lib/postgresql/9.4/bin/pg_controldata /mnt/pgsql/data
pg_control version number:            942
Catalog version number:               201409291
Database system identifier:           6198394727571912088
Database cluster state:               in production
pg_control last modified:             Thu 30 Jun 2016 11:42:42 BST
Latest checkpoint location:           2/EEE842E8
Prior checkpoint location:            2/EED64F68
Latest checkpoint's REDO location:    2/EEE4B610
Latest checkpoint's REDO WAL file:    0000002C00000002000000EE
Latest checkpoint's TimeLineID:       44
Latest checkpoint's PrevTimeLineID:   44
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/2947680
Latest checkpoint's NextOID:          74375
Latest checkpoint's NextMultiXactId:  464
Latest checkpoint's NextMultiOffset:  929
Latest checkpoint's oldestXID:        677
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  2947680
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint:            Thu 30 Jun 2016 11:42:27 BST
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            hot_standby
Current wal_log_hints setting:        off
Current max_connections setting:      250
Current max_worker_processes setting: 8
Current max_prepared_xacts setting:   10
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0


Why are we trying to archive logs which belong to an old timeline?

Any thoughts much appreciated.

Regards

Chris






Re: Log archiving failing. Seems to be wrong timeline

From
Jeff Janes
Date:
On Thu, Jun 30, 2016 at 3:53 AM, Chris Lewis <clewis@inview.co.uk> wrote:
> Hello,
>
> We have 2 postgresql servers (v 9.4.2)  master and slave in streaming
> replication. The overall cluster is controlled using pacemaker & corosync
> and the pgsql cluster agent which handles failover to, and promotion of, the
> slave.
>
> Recently a failover occured and I noticed that log archiving was failing on
> the master:

...

>
> Why are we trying to archive logs which belong to an old timeline?

Just because the timeline is old doesn't mean we want to destroy it.
Afterall, the reason for having timelines in the first place is to
preserve, not to destroy.

It sounds like someone removed the old timeline's log files from
pg_xlog, but did not remove the corresponding .ready files from
pg_xlog/archive_status.

If the old timeline's files are truly lost, then you will have to
carefully remove those corresponding .ready files.

Cheers,

Jeff


Re: Log archiving failing. Seems to be wrong timeline

From
Chris Lewis
Date:
Hi Jeff,

Done as you advised and now things are working again.

Many thanks

Chris

On 30/06/16 20:19, Jeff Janes wrote:
> On Thu, Jun 30, 2016 at 3:53 AM, Chris Lewis <clewis@inview.co.uk> wrote:
>> Hello,
>>
>> We have 2 postgresql servers (v 9.4.2)  master and slave in streaming
>> replication. The overall cluster is controlled using pacemaker & corosync
>> and the pgsql cluster agent which handles failover to, and promotion of, the
>> slave.
>>
>> Recently a failover occured and I noticed that log archiving was failing on
>> the master:
> ...
>
>> Why are we trying to archive logs which belong to an old timeline?
> Just because the timeline is old doesn't mean we want to destroy it.
> Afterall, the reason for having timelines in the first place is to
> preserve, not to destroy.
>
> It sounds like someone removed the old timeline's log files from
> pg_xlog, but did not remove the corresponding .ready files from
> pg_xlog/archive_status.
>
> If the old timeline's files are truly lost, then you will have to
> carefully remove those corresponding .ready files.
>
> Cheers,
>
> Jeff


--
Chris Lewis

Systems Administrator
Inview Technology Ltd.
T: +44 (0) 1606 812500
M: +44 (0) 7980 446907