Thread: Warm standby stall -- what debug info would help?

Warm standby stall -- what debug info would help?

From

"Kevin Grittner"

Date:

09 July 2007, 12:55:08

We're running a number of warm standby instances on one server.  One of them stalled on Saturday.  When I found that
thismorning, I confirmed that the files were in the directory from which it should be pulling the WAL files.  The logs
showednormal processing up to the stall, with no messages afterwards.  I tried a restart and it resumed warm standby
statusand caught up quickly. 
It seems like this is probably a PostgreSQL bug of some sort, although maybe someone can spot a problem in our recovery
script. If it happens again, what information should I gather before the restart to help find the cause? 
-Kevin
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # tail postgresql-2007-07-07_000000.log
[2007-07-07 09:44:41.392 CDT] 5962 LOG:  restored log file "000000010000000C000000D2" from archive
[2007-07-07 12:24:53.597 CDT] 5962 LOG:  restored log file "000000010000000C000000D3" from archive
[2007-07-07 12:24:53.984 CDT] 5962 LOG:  restored log file "000000010000000C000000D4" from archive
[2007-07-07 12:24:54.351 CDT] 5962 LOG:  restored log file "000000010000000C000000D5" from archive
[2007-07-07 14:10:42.208 CDT] 5962 LOG:  restored log file "000000010000000C000000D6" from archive
[2007-07-07 14:10:42.634 CDT] 5962 LOG:  restored log file "000000010000000C000000D7" from archive
[2007-07-07 15:23:41.717 CDT] 5962 LOG:  restored log file "000000010000000C000000D8" from archive
[2007-07-07 18:24:26.933 CDT] 5962 LOG:  restored log file "000000010000000C000000D9" from archive
[2007-07-07 18:24:27.692 CDT] 5962 LOG:  restored log file "000000010000000C000000DA" from archive
[2007-07-07 18:24:28.051 CDT] 5962 LOG:  restored log file "000000010000000C000000DB" from archive
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_000000.log
[2007-07-09 08:21:50.200 CDT] 5904 LOG:  received fast shutdown request
[2007-07-09 08:21:50.201 CDT] 5962 FATAL:  could not restore file "000000010000000C000000DC" from archive: return code
15
[2007-07-09 08:21:50.485 CDT] 5904 LOG:  startup process (PID 5962) exited with exit code 1
[2007-07-09 08:21:50.485 CDT] 5904 LOG:  aborting startup due to startup process failure
[2007-07-09 08:21:50.555 CDT] 5960 LOG:  logger shutting down
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_082151.log | grep -v 'starting up'
[2007-07-09 08:21:51.718 CDT] 19076 LOG:  database system was interrupted while in recovery at log time 2007-07-07
15:00:02CDT 
[2007-07-09 08:21:51.718 CDT] 19076 HINT:  If this has occurred more than once some data may be corrupted and you may
needto choose an earlier recovery target. 
[2007-07-09 08:21:51.718 CDT] 19076 LOG:  starting archive recovery
[2007-07-09 08:21:51.725 CDT] 19076 LOG:  restore_command = "/usr/local/backup/recovery.sh %f %p"
[2007-07-09 08:21:52.079 CDT] 19076 LOG:  restored log file "000000010000000C000000D9" from archive
[2007-07-09 08:21:52.079 CDT] 19076 LOG:  checkpoint record is at C/D9000020
[2007-07-09 08:21:52.079 CDT] 19076 LOG:  redo record is at C/D9000020; undo record is at 0/0; shutdown FALSE
[2007-07-09 08:21:52.079 CDT] 19076 LOG:  next transaction ID: 0/10700115; next OID: 1387338
[2007-07-09 08:21:52.079 CDT] 19076 LOG:  next MultiXactId: 1; next MultiXactOffset: 0
[2007-07-09 08:21:52.079 CDT] 19076 LOG:  automatic recovery in progress
[2007-07-09 08:21:52.081 CDT] 19076 LOG:  redo starts at C/D9000068
[2007-07-09 08:21:52.429 CDT] 19076 LOG:  restored log file "000000010000000C000000DA" from archive
[2007-07-09 08:21:52.860 CDT] 19076 LOG:  restored log file "000000010000000C000000DB" from archive


PGBACKUP:/var/pgsql/data/county/ozaukee/data # grep -Ev '^([[:space:]]+)?($|#)' postgresql.conf
listen_addresses = '*'          # what IP address(es) to listen on;
port = 5445                             # (change requires restart)
max_connections = 50                    # (change requires restart)
shared_buffers = 160MB                  # min 128kB or max_connections*16kB
temp_buffers = 10MB                     # min 800kB
work_mem = 30MB                         # min 64kB
maintenance_work_mem = 160MB            # min 1MB
max_fsm_pages = 204800          # min max_fsm_relations*16, 6 bytes each
bgwriter_lru_percent = 20.0             # 0-100% of LRU buffers scanned/round
bgwriter_lru_maxpages = 200             # 0-1000 buffers max written/round
bgwriter_all_percent = 10.0             # 0-100% of all buffers scanned/round
bgwriter_all_maxpages = 600             # 0-1000 buffers max written/round
wal_buffers = 1MB                       # min 32kB
checkpoint_segments = 10                # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min              # range 30s-1h
archive_timeout = 3600          # force a logfile segment switch after this
seq_page_cost = 0.1                     # measured on an arbitrary scale
random_page_cost = 0.1                  # same scale as above
effective_cache_size = 3GB
redirect_stderr = on                    # Enable capturing of stderr into log
log_line_prefix = '[%m] %p %q<%u %d %r> '                       # Special values:
stats_block_level = on
stats_row_level = on
autovacuum = on                         # enable autovacuum subprocess?
autovacuum_naptime = 10s                # time between autovacuum runs
autovacuum_vacuum_threshold = 1 # min # of tuple updates before
autovacuum_analyze_threshold = 1        # min # of tuple updates before
datestyle = 'iso, mdy'
lc_messages = 'C'                       # locale for system error message
lc_monetary = 'C'                       # locale for monetary formatting
lc_numeric = 'C'                        # locale for number formatting
lc_time = 'C'                           # locale for time formatting
escape_string_warning = off
standard_conforming_strings = on
sql_inheritance = off


PGBACKUP:/var/pgsql/data/county/ozaukee/data # pg_controldata /var/pgsql/data/county/ozaukee/data/
pg_control version number:            822
Catalog version number:               200611241
Database system identifier:           5062210045799347511
Database cluster state:               in archive recovery
pg_control last modified:             Mon 09 Jul 2007 08:21:52 AM CDT
Current log file ID:                  8
Next log file segment:                150
Latest checkpoint location:           C/D9000020
Prior checkpoint location:            C/D9000020
Latest checkpoint's REDO location:    C/D9000020
Latest checkpoint's UNDO location:    C/D9000020
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/10700115
Latest checkpoint's NextOID:          1387338
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Time of latest checkpoint:            Sat 07 Jul 2007 03:00:02 PM CDT
Minimum recovery ending location:     8/9500A530
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Date/time type storage:               64-bit integers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C


PGBACKUP:/var/pgsql/data/county/ozaukee/data # cat /usr/local/backup/recovery.sh
#! /bin/bash
# Pick out county name from the back of the path.
# The value of $PWD will be: /var/pgsql/data/county/<countyName>/data
countyName=`dirname $PWD`
countyName=`basename $countyName`
while [ ! -f /var/pgsql/data/county/$countyName/wal-files/$1.gz \    -a ! -f /var/pgsql/data/county/$countyName/DONE \
 -o   -f /var/pgsql/data/county/$countyName/wal-files/rsync-in-progress ] 
do if ! ( echo "$1" | grep -Eq '^[0-9A-F]{24}$' ) ; then   exit 1 fi sleep 10   #         /* wait for ~10 sec */
done
gunzip < /var/pgsql/data/county/$countyName/wal-files/$1.gz > "$2"

Re: Warm standby stall -- what debug info would help?

From

Tom Lane

Date:

09 July 2007, 13:37:00

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> [2007-07-07 18:24:27.692 CDT] 5962 LOG:  restored log file "000000010000000=
> C000000DA" from archive
> [2007-07-07 18:24:28.051 CDT] 5962 LOG:  restored log file "000000010000000=
> C000000DB" from archive
> [2007-07-09 08:21:50.200 CDT] 5904 LOG:  received fast shutdown request
> [2007-07-09 08:21:50.201 CDT] 5962 FATAL:  could not restore file "00000001=
> 0000000C000000DC" from archive: return code 15

Evidently it was waiting for the restore_command script to give it back
a file.  So the problem is within your restore script.  Eyeing the
script, the only obvious thing that could block it is existence of
/var/pgsql/data/county/$countyName/wal-files/rsync-in-progress 
        regards, tom lane

Re: Warm standby stall -- what debug info would help?

From

"Kevin Grittner"

Date:

09 July 2007, 13:37:29

>>> On Mon, Jul 9, 2007 at 10:54 AM, in message
<469213FB.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:
> We're running a number of warm standby instances on one server.  One of them
> stalled on Saturday.  When I found that this morning, I confirmed that the
> files were in the directory from which it should be pulling the WAL files.
> The logs showed normal processing up to the stall, with no messages
> afterwards.  I tried a restart and it resumed warm standby status and caught
> up quickly.
No, actually it stalled on the same WAL file.  I've got another one in the
same state that I haven't touched yet.  I'll work on gathering what info I
can think of, but if there's something in particular you would like to see,
let me know.
More in a bit.
Should this be on the bugs list instead of hackers?
-Kevin

Re: Warm standby stall -- what debug info would help?

From

"Simon Riggs"

Date:

09 July 2007, 14:05:24

On Mon, 2007-07-09 at 12:36 -0400, Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > [2007-07-07 18:24:27.692 CDT] 5962 LOG:  restored log file "000000010000000=
> > C000000DA" from archive
> > [2007-07-07 18:24:28.051 CDT] 5962 LOG:  restored log file "000000010000000=
> > C000000DB" from archive
> > [2007-07-09 08:21:50.200 CDT] 5904 LOG:  received fast shutdown request
> > [2007-07-09 08:21:50.201 CDT] 5962 FATAL:  could not restore file "00000001=
> > 0000000C000000DC" from archive: return code 15
> 
> Evidently it was waiting for the restore_command script to give it back
> a file.  So the problem is within your restore script.

Agreed.

Kevin, Can you try pg_standby please? It would help me to diagnose
problems faster if that doesn't work. TIA.

--  Simon Riggs EnterpriseDB  http://www.enterprisedb.com

Re: Warm standby stall -- what debug info would help?

From

"Kevin Grittner"

Date:

09 July 2007, 14:44:02

>>> On Mon, Jul 9, 2007 at 11:36 AM, in message <23173.1183999016@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> [2007-07-07 18:24:27.692 CDT] 5962 LOG:  restored log file "000000010000000=
>> C000000DA" from archive
>> [2007-07-07 18:24:28.051 CDT] 5962 LOG:  restored log file "000000010000000=
>> C000000DB" from archive
>> [2007-07-09 08:21:50.200 CDT] 5904 LOG:  received fast shutdown request
>> [2007-07-09 08:21:50.201 CDT] 5962 FATAL:  could not restore file "00000001=
>> 0000000C000000DC" from archive: return code 15
>
> Evidently it was waiting for the restore_command script to give it back
> a file.  So the problem is within your restore script.  Eyeing the
> script, the only obvious thing that could block it is existence of
> /var/pgsql/data/county/$countyName/wal-files/rsync-in-progress
Sorry for the noise.  It wasn't the rsync file but something even more
obvious that I managed misread.  The rsync was failing to copy the files
from the counties to the directory where recovery reads them.  I could have
sworn I checked that, but I clearly messed up.
-Kevin