Thread: Warm standby stall -- what debug info would help?
We're running a number of warm standby instances on one server. One of them stalled on Saturday. When I found that thismorning, I confirmed that the files were in the directory from which it should be pulling the WAL files. The logs showednormal processing up to the stall, with no messages afterwards. I tried a restart and it resumed warm standby statusand caught up quickly. It seems like this is probably a PostgreSQL bug of some sort, although maybe someone can spot a problem in our recovery script. If it happens again, what information should I gather before the restart to help find the cause? -Kevin PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # tail postgresql-2007-07-07_000000.log [2007-07-07 09:44:41.392 CDT] 5962 LOG: restored log file "000000010000000C000000D2" from archive [2007-07-07 12:24:53.597 CDT] 5962 LOG: restored log file "000000010000000C000000D3" from archive [2007-07-07 12:24:53.984 CDT] 5962 LOG: restored log file "000000010000000C000000D4" from archive [2007-07-07 12:24:54.351 CDT] 5962 LOG: restored log file "000000010000000C000000D5" from archive [2007-07-07 14:10:42.208 CDT] 5962 LOG: restored log file "000000010000000C000000D6" from archive [2007-07-07 14:10:42.634 CDT] 5962 LOG: restored log file "000000010000000C000000D7" from archive [2007-07-07 15:23:41.717 CDT] 5962 LOG: restored log file "000000010000000C000000D8" from archive [2007-07-07 18:24:26.933 CDT] 5962 LOG: restored log file "000000010000000C000000D9" from archive [2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000C000000DA" from archive [2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000C000000DB" from archive PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_000000.log [2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request [2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "000000010000000C000000DC" from archive: return code 15 [2007-07-09 08:21:50.485 CDT] 5904 LOG: startup process (PID 5962) exited with exit code 1 [2007-07-09 08:21:50.485 CDT] 5904 LOG: aborting startup due to startup process failure [2007-07-09 08:21:50.555 CDT] 5960 LOG: logger shutting down PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_082151.log | grep -v 'starting up' [2007-07-09 08:21:51.718 CDT] 19076 LOG: database system was interrupted while in recovery at log time 2007-07-07 15:00:02CDT [2007-07-09 08:21:51.718 CDT] 19076 HINT: If this has occurred more than once some data may be corrupted and you may needto choose an earlier recovery target. [2007-07-09 08:21:51.718 CDT] 19076 LOG: starting archive recovery [2007-07-09 08:21:51.725 CDT] 19076 LOG: restore_command = "/usr/local/backup/recovery.sh %f %p" [2007-07-09 08:21:52.079 CDT] 19076 LOG: restored log file "000000010000000C000000D9" from archive [2007-07-09 08:21:52.079 CDT] 19076 LOG: checkpoint record is at C/D9000020 [2007-07-09 08:21:52.079 CDT] 19076 LOG: redo record is at C/D9000020; undo record is at 0/0; shutdown FALSE [2007-07-09 08:21:52.079 CDT] 19076 LOG: next transaction ID: 0/10700115; next OID: 1387338 [2007-07-09 08:21:52.079 CDT] 19076 LOG: next MultiXactId: 1; next MultiXactOffset: 0 [2007-07-09 08:21:52.079 CDT] 19076 LOG: automatic recovery in progress [2007-07-09 08:21:52.081 CDT] 19076 LOG: redo starts at C/D9000068 [2007-07-09 08:21:52.429 CDT] 19076 LOG: restored log file "000000010000000C000000DA" from archive [2007-07-09 08:21:52.860 CDT] 19076 LOG: restored log file "000000010000000C000000DB" from archive PGBACKUP:/var/pgsql/data/county/ozaukee/data # grep -Ev '^([[:space:]]+)?($|#)' postgresql.conf listen_addresses = '*' # what IP address(es) to listen on; port = 5445 # (change requires restart) max_connections = 50 # (change requires restart) shared_buffers = 160MB # min 128kB or max_connections*16kB temp_buffers = 10MB # min 800kB work_mem = 30MB # min 64kB maintenance_work_mem = 160MB # min 1MB max_fsm_pages = 204800 # min max_fsm_relations*16, 6 bytes each bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers scanned/round bgwriter_lru_maxpages = 200 # 0-1000 buffers max written/round bgwriter_all_percent = 10.0 # 0-100% of all buffers scanned/round bgwriter_all_maxpages = 600 # 0-1000 buffers max written/round wal_buffers = 1MB # min 32kB checkpoint_segments = 10 # in logfile segments, min 1, 16MB each checkpoint_timeout = 30min # range 30s-1h archive_timeout = 3600 # force a logfile segment switch after this seq_page_cost = 0.1 # measured on an arbitrary scale random_page_cost = 0.1 # same scale as above effective_cache_size = 3GB redirect_stderr = on # Enable capturing of stderr into log log_line_prefix = '[%m] %p %q<%u %d %r> ' # Special values: stats_block_level = on stats_row_level = on autovacuum = on # enable autovacuum subprocess? autovacuum_naptime = 10s # time between autovacuum runs autovacuum_vacuum_threshold = 1 # min # of tuple updates before autovacuum_analyze_threshold = 1 # min # of tuple updates before datestyle = 'iso, mdy' lc_messages = 'C' # locale for system error message lc_monetary = 'C' # locale for monetary formatting lc_numeric = 'C' # locale for number formatting lc_time = 'C' # locale for time formatting escape_string_warning = off standard_conforming_strings = on sql_inheritance = off PGBACKUP:/var/pgsql/data/county/ozaukee/data # pg_controldata /var/pgsql/data/county/ozaukee/data/ pg_control version number: 822 Catalog version number: 200611241 Database system identifier: 5062210045799347511 Database cluster state: in archive recovery pg_control last modified: Mon 09 Jul 2007 08:21:52 AM CDT Current log file ID: 8 Next log file segment: 150 Latest checkpoint location: C/D9000020 Prior checkpoint location: C/D9000020 Latest checkpoint's REDO location: C/D9000020 Latest checkpoint's UNDO location: C/D9000020 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 0/10700115 Latest checkpoint's NextOID: 1387338 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Time of latest checkpoint: Sat 07 Jul 2007 03:00:02 PM CDT Minimum recovery ending location: 8/9500A530 Maximum data alignment: 4 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Date/time type storage: 64-bit integers Maximum length of locale name: 128 LC_COLLATE: C LC_CTYPE: C PGBACKUP:/var/pgsql/data/county/ozaukee/data # cat /usr/local/backup/recovery.sh #! /bin/bash # Pick out county name from the back of the path. # The value of $PWD will be: /var/pgsql/data/county/<countyName>/data countyName=`dirname $PWD` countyName=`basename $countyName` while [ ! -f /var/pgsql/data/county/$countyName/wal-files/$1.gz \ -a ! -f /var/pgsql/data/county/$countyName/DONE \ -o -f /var/pgsql/data/county/$countyName/wal-files/rsync-in-progress ] do if ! ( echo "$1" | grep -Eq '^[0-9A-F]{24}$' ) ; then exit 1 fi sleep 10 # /* wait for ~10 sec */ done gunzip < /var/pgsql/data/county/$countyName/wal-files/$1.gz > "$2"
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > [2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000= > C000000DA" from archive > [2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000= > C000000DB" from archive > [2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request > [2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "00000001= > 0000000C000000DC" from archive: return code 15 Evidently it was waiting for the restore_command script to give it back a file. So the problem is within your restore script. Eyeing the script, the only obvious thing that could block it is existence of /var/pgsql/data/county/$countyName/wal-files/rsync-in-progress regards, tom lane
>>> On Mon, Jul 9, 2007 at 10:54 AM, in message <469213FB.EE98.0025.0@wicourts.gov>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > We're running a number of warm standby instances on one server. One of them > stalled on Saturday. When I found that this morning, I confirmed that the > files were in the directory from which it should be pulling the WAL files. > The logs showed normal processing up to the stall, with no messages > afterwards. I tried a restart and it resumed warm standby status and caught > up quickly. No, actually it stalled on the same WAL file. I've got another one in the same state that I haven't touched yet. I'll work on gathering what info I can think of, but if there's something in particular you would like to see, let me know. More in a bit. Should this be on the bugs list instead of hackers? -Kevin
On Mon, 2007-07-09 at 12:36 -0400, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > [2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000= > > C000000DA" from archive > > [2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000= > > C000000DB" from archive > > [2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request > > [2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "00000001= > > 0000000C000000DC" from archive: return code 15 > > Evidently it was waiting for the restore_command script to give it back > a file. So the problem is within your restore script. Agreed. Kevin, Can you try pg_standby please? It would help me to diagnose problems faster if that doesn't work. TIA. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
>>> On Mon, Jul 9, 2007 at 11:36 AM, in message <23173.1183999016@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> [2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000= >> C000000DA" from archive >> [2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000= >> C000000DB" from archive >> [2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request >> [2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "00000001= >> 0000000C000000DC" from archive: return code 15 > > Evidently it was waiting for the restore_command script to give it back > a file. So the problem is within your restore script. Eyeing the > script, the only obvious thing that could block it is existence of > /var/pgsql/data/county/$countyName/wal-files/rsync-in-progress Sorry for the noise. It wasn't the rsync file but something even more obvious that I managed misread. The rsync was failing to copy the files from the counties to the directory where recovery reads them. I could have sworn I checked that, but I clearly messed up. -Kevin