Re: BUG #7753: Cannot promote out of hot standby - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #7753: Cannot promote out of hot standby
Date
Msg-id 515181AB.8090302@vmware.com
Whole thread Raw
In response to BUG #7753: Cannot promote out of hot standby  (daniel@heroku.com)
List pgsql-bugs
(cleaning up my inbox..)

Did you ever figure out this one?

On 12.12.2012 23:36, daniel@heroku.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      7753
> Logged by:          Daniel Farina
> Email address:      daniel@heroku.com
> PostgreSQL version: 9.1.6
> Operating system:   Ubuntu 10.04
> Description:
>
> Touching a trigger file will not cause promotion out of hot standby.
> Basically, an apparently normally-working hot-standby database will
> not leave hot standby.  The database emitting WAL is version 9.1.4.
>
> Everything appears normal in the log (downloads and restoring of
> archived segments), and the server seems to take no notice of the
> trigger file.
>
> To force the issue, I introduced an error into the configuration of
> the restoration program to cause it to exit.  Normally that's no
> problem; postgres would just keep on trying to restore a segment over
> and over until the error is fixed.
>
> Instead, the server crashes:
>
>   [413-1]  [COPPER] LOG:  restored log file "000000010000034D00000050" from
> archive
> wal_e.worker.s3_worker INFO     MSG: completed download and
> decompression#012        DETAIL: Downloaded and decompressed
> "s3://archive-root/wal_005/000000010000034D00000051.lzo" to
> "pg_xlog/RECOVERYXLOG"
>   [414-1]  [COPPER] LOG:  restored log file "000000010000034D00000051" from
> archive
> wal_e.worker.s3_worker INFO     MSG: completed download and
> decompression#012        DETAIL: Downloaded and decompressed
> "s3://archive-root/wal_005/000000010000034D00000052.lzo" to
> "pg_xlog/RECOVERYXLOG"
>   [415-1]  [COPPER] LOG:  restored log file "000000010000034D00000052" from
> archive
>
> # I introduce the failure here
>
> wal_e.main   ERROR    MSG: no AWS_SECRET_ACCESS_KEY defined#012        HINT:
> Define the environment variable AWS_SECRET_ACCESS_KEY.
> LOG:  trigger file found: /etc/postgresql/wal-e.d/pull-env/STANDBY_OFF
> LOG:  redo done at 34D/52248590
> LOG:  last completed transaction was at log time 2012-12-10 wal_e.main
> ERROR    MSG: no AWS_SECRET_ACCESS_KEY defined#012        HINT: Define the
> environment variable AWS_SECRET_ACCESS_KEY.
> PANIC:  could not open file "pg_xlog/000000010000034D00000052" (log file
> 845, segment 82): No such file or directory
> LOG:  startup process (PID 7) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
>
> I can fix the configuration and restart the server, and everything is
> as fine as before.  Next, I try removing recovery.conf and restarting
> the server as an alternative way of promoting...but, no avail;
> however, a slightly different error message:
>
> # Server begins starting
> LOG:  loaded library "auto_explain"
> LOG:  loaded library "pg_stat_statements"
> LOG:  database system was interrupted while in recovery at log time
> 2012-12-10 15:20:03 UTC
> HINT:  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> LOG:  could not open file "pg_xlog/000000010000034E0000001A" (log file 846,
> segment 26): No such file or directory
> LOG:  invalid primary checkpoint record
> LOG:  could not open file "pg_xlog/000000010000034D000000F2" (log file 845,
> segment 242): No such file or directory
> LOG:  invalid secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 7) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure
> main process (24284) terminated with status 1
>
> pg_control looks like this around the same time, for reference:
>
> pg_control version number:            903
> Catalog version number:               201105231
> Database cluster state:               in archive recovery
> pg_control last modified:             Wed 12 Dec 2012 09:22:30 PM UTC
> Latest checkpoint location:           351/1FE194C0
> Prior checkpoint location:            351/FD64A78
> Latest checkpoint's REDO location:    351/131848C8
> Latest checkpoint's TimeLineID:       1
> Latest checkpoint's NextXID:          0/652342033
> Latest checkpoint's NextOID:          103224
> Latest checkpoint's NextMultiXactId:  1
> Latest checkpoint's NextMultiOffset:  0
> Latest checkpoint's oldestXID:        455900714
> Latest checkpoint's oldestXID's DB:   16385
> Latest checkpoint's oldestActiveXID:  652311442
> Time of latest checkpoint:            Mon 10 Dec 2012 07:19:23 PM UTC
> Minimum recovery ending location:     351/4BFFFE20
> Backup start location:                0/0
> Current wal_level setting:            hot_standby
> Current max_connections setting:      500
> Current max_prepared_xacts setting:   500
> Current max_locks_per_xact setting:   64
> Maximum data alignment:               8
> Database block size:                  8192
> Blocks per segment of large relation: 131072
> WAL block size:                       8192
> Bytes per WAL segment:                16777216
> Maximum length of identifiers:        64
> Maximum columns in an index:          32
> Maximum size of a TOAST chunk:        1996
> Date/time type storage:               64-bit integers
> Float4 argument passing:              by value
> Float8 argument passing:              by value
>
> In the course of all this messing around, it has never been a problem to go
> back to archive recovery.
>
>
>


--
- Heikki

pgsql-bugs by date:

Previous
From: Marcelo Matheus
Date:
Subject: Re: BUG #7983: Problem with pgAgent.
Next
From: webadmin@aldar.com
Date:
Subject: BUG #7998: Could not able to connect database