Re: BUG #7500: hot-standby replica crash after an initial rsync - Mailing list pgsql-bugs

From Maxim Boguk
Subject Re: BUG #7500: hot-standby replica crash after an initial rsync
Date
Msg-id CAK-MWwR-FBf3bFVtD8vqPq6hErAujE447mniS6Us1ORQ6Z3XdA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7500: hot-standby replica crash after an initial rsync  (Maxim Boguk <maxim.boguk@gmail.com>)
Responses Re: BUG #7500: hot-standby replica crash after an initial rsync  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-bugs
On Tue, Aug 21, 2012 at 8:08 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:

>
> >
>> >
>> > I have kept all that database files for the future investigation.
>> >
>> > What I should look into first?
>> Could you reproduce the error with log_error_verbosity=verbose? Or even
>> better
>> provide a backtrace with gdb?
>>
>>
> There log with log_error_verbosity=verbose:
>
>
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
> database system was interrupted while in recovery at log time 2012-08-21
> 08:21:42 MSK
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []HINT:  If this
> has occurred more than once some data might be corrupted and you might need
> to choose an earlier recovery target.
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> StartupXLOG, xlog.c:5757
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
> creating missing WAL directory "pg_xlog/archive_status"
> 2012-08-21 14:04:18.973 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> ValidateXLOGDirectoryStructure, xlog.c:3452
> 2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOG:  00000:
> entering standby mode
> 2012-08-21 14:04:19.009 MSK 33950 @ from  [vxid: txid:0] []LOCATION:
> StartupXLOG, xlog.c:5827
> 2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOG:  00000:
> streaming replication successfully connected to primary
> 2012-08-21 14:04:19.041 MSK 33951 @ from  [vxid: txid:0] []LOCATION:
> libpqrcv_connect, libpqwalreceiver.c:169
> 2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOG:  00000:
> redo starts at 925/2C0004C8
> 2012-08-21 14:04:20.140 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
> StartupXLOG, xlog.c:6199
> 2012-08-21 14:04:20.160 MSK 33950 @ from  [vxid:1/0 txid:0] []FATAL:
> XX000: could not access status of transaction 361788644
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []DETAIL:
> Could not read from file "pg_multixact/offsets/1590" at offset 114688:
> Unknown error: 0.
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []CONTEXT:
> xlog redo create multixact 361788644 offset 822580965: 1768508061 1768508078
> 2012-08-21 14:04:20.161 MSK 33950 @ from  [vxid:1/0 txid:0] []LOCATION:
> SlruReportIOError, slru.c:840
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  00000:
> startup process (PID 33950) exited with exit code 1
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
> LogChildExit, postmaster.c:2840
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOG:  00000:
> terminating any other active server processes
> 2012-08-21 14:04:20.169 MSK 33948 @ from  [vxid: txid:0] []LOCATION:
> HandleChildCrash, postmaster.c:2674
>
>
I tried A bit play with more detailed debug info and got mode detailed log:

LOG:  00000: database system was interrupted while in recovery at log time
2012-08-21 08:21:42 MSK
HINT:  If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOCATION:  StartupXLOG, xlog.c:5757
DEBUG:  00000: standby_mode = 'on'
LOCATION:  readRecoveryCommandFile, xlog.c:5269
DEBUG:  00000: primary_conninfo = 'host=10.12.22.19 port=5433 user=replica
password=acilper'
LOCATION:  readRecoveryCommandFile, xlog.c:5276
DEBUG:  00000: trigger_file = '/db/data/failover'
LOCATION:  readRecoveryCommandFile, xlog.c:5283
LOG:  00000: entering standby mode
LOCATION:  StartupXLOG, xlog.c:5827
DEBUG:  00000: checkpoint record is at 925/2D41CAA0
LOCATION:  StartupXLOG, xlog.c:5894
DEBUG:  00000: redo record is at 925/2C0004C8; shutdown FALSE
LOCATION:  StartupXLOG, xlog.c:5929
DEBUG:  00000: next transaction ID: 0/1768511856; next OID: 59880938
LOCATION:  StartupXLOG, xlog.c:5933
DEBUG:  00000: next MultiXactId: 361788644; next MultiXactOffset: 822580965
LOCATION:  StartupXLOG, xlog.c:5936
DEBUG:  00000: oldest unfrozen transaction ID: 1569182525, in database 16424
LOCATION:  StartupXLOG, xlog.c:5939
DEBUG:  00000: transaction ID wrap limit is 3716666172, limited by database
with OID 16424
LOCATION:  SetTransactionIdLimit, varsup.c:334
DEBUG:  00000: initializing for hot standby
LOCATION:  StartupXLOG, xlog.c:6070
DEBUG:  00000: my backend id is 1
LOCATION:  SharedInvalBackendInit, sinvaladt.c:326
LOG:  00000: redo starts at 925/2C0004C8
LOCATION:  StartupXLOG, xlog.c:6199
DEBUG:  00000: record known xact 1768508078 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508078 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: remove KnownAssignedXid 1768508078
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.079655+04
LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
tid 725829/8
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725829/8; new 726017/7
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508076 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508076 rel 1663/16424/47760635;
tid 3075/136
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/47747; tid 43866/197
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/52546872; tid 44700/276
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
CONTEXT:  xlog redo log: rel 1663/16424/19211
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508082 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508082 rel 1663/16424/4759107;
tid 1098/7
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508076 latestObservedXid 0
CONTEXT:  xlog redo hot_update: rel 1663/16424/47760635; tid 3075/136; new
3075/140
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo exclusive_lock: xid 1768508080 rel 1663/16424/17282;
tid 725918/2
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508082 latestObservedXid 0
CONTEXT:  xlog redo hot_update: rel 1663/16424/4759107; tid 1098/7; new
1098/39
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo update: rel 1663/16424/17282; tid 725918/2; new 725918/6
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/47747; tid 43866/201
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/52546872; tid 44700/276
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508080 latestObservedXid 0
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: remove KnownAssignedXid 1768508080
CONTEXT:  xlog redo commit: 2012-08-21 08:21:42.082508+04
LOCATION:  KnownAssignedXidsRemove, procarray.c:2869
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/64654; tid 178183/44
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0
CONTEXT:  xlog redo insert: rel 1663/16424/95419; tid 38650/318
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
DEBUG:  00000: record known xact 1768508061 latestObservedXid 0


CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
1768508078
LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2375
FATAL:  XX000: could not access status of transaction 361788644
DETAIL:  Could not read from file "pg_multixact/offsets/1590" at offset
114688: Unknown error: 0.
CONTEXT:  xlog redo create multixact 361788644 offset 822580965: 1768508061
1768508078
LOCATION:  SlruReportIOError, slru.c:840


DEBUG:  00000: shmem_exit(1): 5 callbacks to make



I have some problems with debug startup process with gdb...
I following next sequence of commands (and got no useful results):

[pgsql@db1 ~/tmp/postgresql-9.0.8]$ /usr/local/bin/gdb
/usr/local/bin/postgres
...
Reading symbols from /usr/local/bin/postgres...done.
(gdb) b exit
Breakpoint 1 at 0x44d38c
(gdb) run -D /db/data_back/
Starting program: /usr/local/bin/postgres -D /db/data_back/

Program received signal SIGUSR1, User defined signal 1.
0x00000008018ddadc in select () from /lib/libc.so.7
(gdb) bt
#0  0x00000008018ddadc in select () from /lib/libc.so.7
#1  0x000000000065641c in ServerLoop () at postmaster.c:1396
#2  0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at
postmaster.c:1097
#3  0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188
(gdb) cont
Continuing.

Breakpoint 1, 0x000000080186bab0 in exit () from /lib/libc.so.7
(gdb) bt
#0  0x000000080186bab0 in exit () from /lib/libc.so.7
#1  0x000000000067d0a6 in proc_exit (code=1) at ipc.c:137
#2  0x0000000000659933 in ExitPostmaster (status=1) at postmaster.c:4172
#3  0x0000000000657d2d in reaper (postgres_signal_arg=20) at
postmaster.c:2354
#4  <signal handler called>
#5  <signal handler called>
#6  0x00000008018ddadc in select () from /lib/libc.so.7
#7  0x000000000065641c in ServerLoop () at postmaster.c:1396
#8  0x0000000000655d05 in PostmasterMain (argc=3, argv=0x7fffffffead0) at
postmaster.c:1097
#9  0x00000000005d974a in main (argc=3, argv=0x7fffffffead0) at main.c:188
(gdb) cont
Continuing.
[Inferior 1 (process 98857) exited with code 01]
(gdb) quit

pgsql-bugs by date:

Previous
From: Maxim Boguk
Date:
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync
Next
From: Andres Freund
Date:
Subject: Re: BUG #7500: hot-standby replica crash after an initial rsync