Re: BUG #7500: hot-standby replica crash after an initial rsync - Mailing list pgsql-bugs

From Maxim Boguk
Subject Re: BUG #7500: hot-standby replica crash after an initial rsync
Date
Msg-id CAK-MWwQdxsE3fvi2HzHeiAr9sGNp3ESx=B6gD=TXqZW6e2mf8A@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7500: hot-standby replica crash after an initial rsync  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: BUG #7500: hot-standby replica crash after an initial rsync
List pgsql-bugs
> > Does that mean that file was damaged during rsync?
> Not necessarily. When did you initially set up that cluster? Normally the
> file
> should get zeroed out before its being used. If the cluster was copied
> improperly (i.e. no pg_start/stop backup or such) it could easily happen.
> But
> I would guess that you did it properly?
>

That server was master before planned switchover to replica server was
performed.
Procedure:
1)stop former master
2)trigger replica failover (thus server become new master)

There are unusual part because human error from system administrator:
2.1)start former master
2.2)stop former master

3)empty pg_xlog directory on former master
4)pg_start_backup on new master
5)re-resync datafiles to former master using:
rsync -aq --delete --exclude pg_xlog /db/data [former master]:/db/
6)pg_stop_backup on new master
7)start former master as replica...

that procedure used more than 10 times without any problem in the past.
May be I missing something in it.



>
> Could you show pg_controldata output for primary/replica? To see how new
> that
> multixact offset is.
>

On failing replica:
[pgsql@db1 ~/tmp/postgresql-9.0.8]$ pg_controldata /db/data_back
pg_control version number:            903
Catalog version number:               201008051
Database system identifier:           5687760461964967531
Database cluster state:               in archive recovery
pg_control last modified:             Wed Aug 22 13:47:35 2012
Latest checkpoint location:           925/2D41CAA0
Prior checkpoint location:            925/2D41CAA0
Latest checkpoint's REDO location:    925/2C0004C8
Latest checkpoint's TimeLineID:       5
Latest checkpoint's NextXID:          0/1768511856
Latest checkpoint's NextOID:          59880938
Latest checkpoint's NextMultiXactId:  361788644
Latest checkpoint's NextMultiOffset:  822580965
Latest checkpoint's oldestXID:        1569182525
Latest checkpoint's oldestXID's DB:   16424
Latest checkpoint's oldestActiveXID:  1768498919
Time of latest checkpoint:            Tue Aug 21 08:21:42 2012
Minimum recovery ending location:     925/2C0004C8
Backup start location:                925/2C0004C8
Current wal_level setting:            hot_standby
Current max_connections setting:      300
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   2048
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

On master no sense to look into it because master 1 day ahead of that
replica.
It's a production system.



>
> > What next I should examine?
> If you could give me a DEBUG3 log from startup till it crashes it would be
> helpful. Please #define MULTIXACT_DEBUG ontop of multixact.c before doing
> so.
> Thats probably going to be to big to go through the mailinglist.
>

I see nothing new after I rebuit postgres with that define   (
/usr/local/pgsql/tmp/postgresql/bin/postgres -d 3 -D /db/data_back/ ).
Just:

LOCATION:  readRecoveryCommandFile, xlog.c:5269
DEBUG:  00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
password=acilper'
LOCATION:  readRecoveryCommandFile, xlog.c:5276
DEBUG:  00000: trigger_file = '/db/data/failover'
LOCATION:  readRecoveryCommandFile, xlog.c:5283
LOG:  00000: entering standby mode
LOCATION:  StartupXLOG, xlog.c:5827
DEBUG:  00000: checkpoint record is at 925/2D41CAA0
LOCATION:  StartupXLOG, xlog.c:5894
DEBUG:  00000: redo record is at 925/2C0004C8; shutdown FALSE
LOCATION:  StartupXLOG, xlog.c:5929
DEBUG:  00000: next transaction ID: 0/1768511856; next OID: 59880938
LOCATION:  StartupXLOG, xlog.c:5933
DEBUG:  00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965
LOCATION:  StartupXLOG, xlog.c:5936
DEBUG:  00000: oldest unfrozen transaction ID: 1569182525, in database 16424
LOCATION:  StartupXLOG, xlog.c:5939
DEBUG:  00000: MultiXact: setting next multi to 361788644 offset 822580965
LOCATION:  MultiXactSetNextMXact, multixact.c:1668
DEBUG:  00000: transaction ID wrap limit is 3716666172, limited by database
with OID 16424
LOCATION:  SetTransactionIdLimit, varsup.c:334
WARNING:  01000: pid of startup is: 16919, sleeping for 60s
LOCATION:  StartupXLOG, xlog.c:5983
DEBUG:  00000: initializing for hot standby
LOCATION:  StartupXLOG, xlog.c:6073
LOG:  00000: redo starts at 925/2C0004C8
LOCATION:  StartupXLOG, xlog.c:6202
FATAL:  XX000: could not access status of transaction 361788644
DETAIL:  Could not read from file "pg_multixact/offsets/1590" at offset
114688: Unknown error: 0.
CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
1768508078
LOCATION:  SlruReportIOError, slru.c:840




>
> If youre feeling adventurous you could use the xlogdump project
> (https://github.com/snaga/xlogdump) to see what xlog records happen from
> the
> time it starts up till it crashes. I don't immediately see a way a
> XLOG_MULTIXACT_CREATE_ID can be generated without a preceding
> XLOG_MULTIXACT_ZERO_OFF_PAGE zeroing the page.
> But most of that should be already be visible in the log with the above
> define.
>
> Did you have any crashes, -m immediate shutdowns, or similar on the
> primary or
> replica?
>

I don't think. But I'm not sure because task was performed by system
administrator using provided instruction.
And I was informed only after replica doesn't start.



Kind Regards,
Maksym

pgsql-bugs by date:

Previous
From: Sachin Srivastava
Date:
Subject: Re: postgresql-9.1.4 macosx-10.6.8 (hostssl) psql -h hostname crashes
Next
From: Andres Freund
Date:
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync