Thread: BUG #7753: Cannot promote out of hot standby

BUG #7753: Cannot promote out of hot standby

From
daniel@heroku.com
Date:
The following bug has been logged on the website:

Bug reference:      7753
Logged by:          Daniel Farina
Email address:      daniel@heroku.com
PostgreSQL version: 9.1.6
Operating system:   Ubuntu 10.04
Description:        =


Touching a trigger file will not cause promotion out of hot standby.
Basically, an apparently normally-working hot-standby database will
not leave hot standby.  The database emitting WAL is version 9.1.4.

Everything appears normal in the log (downloads and restoring of
archived segments), and the server seems to take no notice of the
trigger file.

To force the issue, I introduced an error into the configuration of
the restoration program to cause it to exit.  Normally that's no
problem; postgres would just keep on trying to restore a segment over
and over until the error is fixed.

Instead, the server crashes:

 [413-1]  [COPPER] LOG:  restored log file "000000010000034D00000050" from
archive
wal_e.worker.s3_worker INFO     MSG: completed download and
decompression#012        DETAIL: Downloaded and decompressed
"s3://archive-root/wal_005/000000010000034D00000051.lzo" to
"pg_xlog/RECOVERYXLOG"
 [414-1]  [COPPER] LOG:  restored log file "000000010000034D00000051" from
archive
wal_e.worker.s3_worker INFO     MSG: completed download and
decompression#012        DETAIL: Downloaded and decompressed
"s3://archive-root/wal_005/000000010000034D00000052.lzo" to
"pg_xlog/RECOVERYXLOG"
 [415-1]  [COPPER] LOG:  restored log file "000000010000034D00000052" from
archive

# I introduce the failure here

wal_e.main   ERROR    MSG: no AWS_SECRET_ACCESS_KEY defined#012        HINT:
Define the environment variable AWS_SECRET_ACCESS_KEY.
LOG:  trigger file found: /etc/postgresql/wal-e.d/pull-env/STANDBY_OFF
LOG:  redo done at 34D/52248590
LOG:  last completed transaction was at log time 2012-12-10 wal_e.main  =

ERROR    MSG: no AWS_SECRET_ACCESS_KEY defined#012        HINT: Define the
environment variable AWS_SECRET_ACCESS_KEY.
PANIC:  could not open file "pg_xlog/000000010000034D00000052" (log file
845, segment 82): No such file or directory
LOG:  startup process (PID 7) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

I can fix the configuration and restart the server, and everything is
as fine as before.  Next, I try removing recovery.conf and restarting
the server as an alternative way of promoting...but, no avail;
however, a slightly different error message:

# Server begins starting
LOG:  loaded library "auto_explain"
LOG:  loaded library "pg_stat_statements"
LOG:  database system was interrupted while in recovery at log time
2012-12-10 15:20:03 UTC
HINT:  If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOG:  could not open file "pg_xlog/000000010000034E0000001A" (log file 846,
segment 26): No such file or directory
LOG:  invalid primary checkpoint record
LOG:  could not open file "pg_xlog/000000010000034D000000F2" (log file 845,
segment 242): No such file or directory
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record
LOG:  startup process (PID 7) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
main process (24284) terminated with status 1

pg_control looks like this around the same time, for reference:

pg_control version number:            903
Catalog version number:               201105231
Database cluster state:               in archive recovery
pg_control last modified:             Wed 12 Dec 2012 09:22:30 PM UTC
Latest checkpoint location:           351/1FE194C0
Prior checkpoint location:            351/FD64A78
Latest checkpoint's REDO location:    351/131848C8
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/652342033
Latest checkpoint's NextOID:          103224
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        455900714
Latest checkpoint's oldestXID's DB:   16385
Latest checkpoint's oldestActiveXID:  652311442
Time of latest checkpoint:            Mon 10 Dec 2012 07:19:23 PM UTC
Minimum recovery ending location:     351/4BFFFE20
Backup start location:                0/0
Current wal_level setting:            hot_standby
Current max_connections setting:      500
Current max_prepared_xacts setting:   500
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

In the course of all this messing around, it has never been a problem to go
back to archive recovery.

Re: BUG #7753: Cannot promote out of hot standby

From
Heikki Linnakangas
Date:
(cleaning up my inbox..)

Did you ever figure out this one?

On 12.12.2012 23:36, daniel@heroku.com wrote:
> The following bug has been logged on the website:
>
> Bug reference:      7753
> Logged by:          Daniel Farina
> Email address:      daniel@heroku.com
> PostgreSQL version: 9.1.6
> Operating system:   Ubuntu 10.04
> Description:
>
> Touching a trigger file will not cause promotion out of hot standby.
> Basically, an apparently normally-working hot-standby database will
> not leave hot standby.  The database emitting WAL is version 9.1.4.
>
> Everything appears normal in the log (downloads and restoring of
> archived segments), and the server seems to take no notice of the
> trigger file.
>
> To force the issue, I introduced an error into the configuration of
> the restoration program to cause it to exit.  Normally that's no
> problem; postgres would just keep on trying to restore a segment over
> and over until the error is fixed.
>
> Instead, the server crashes:
>
>   [413-1]  [COPPER] LOG:  restored log file "000000010000034D00000050" from
> archive
> wal_e.worker.s3_worker INFO     MSG: completed download and
> decompression#012        DETAIL: Downloaded and decompressed
> "s3://archive-root/wal_005/000000010000034D00000051.lzo" to
> "pg_xlog/RECOVERYXLOG"
>   [414-1]  [COPPER] LOG:  restored log file "000000010000034D00000051" from
> archive
> wal_e.worker.s3_worker INFO     MSG: completed download and
> decompression#012        DETAIL: Downloaded and decompressed
> "s3://archive-root/wal_005/000000010000034D00000052.lzo" to
> "pg_xlog/RECOVERYXLOG"
>   [415-1]  [COPPER] LOG:  restored log file "000000010000034D00000052" from
> archive
>
> # I introduce the failure here
>
> wal_e.main   ERROR    MSG: no AWS_SECRET_ACCESS_KEY defined#012        HINT:
> Define the environment variable AWS_SECRET_ACCESS_KEY.
> LOG:  trigger file found: /etc/postgresql/wal-e.d/pull-env/STANDBY_OFF
> LOG:  redo done at 34D/52248590
> LOG:  last completed transaction was at log time 2012-12-10 wal_e.main
> ERROR    MSG: no AWS_SECRET_ACCESS_KEY defined#012        HINT: Define the
> environment variable AWS_SECRET_ACCESS_KEY.
> PANIC:  could not open file "pg_xlog/000000010000034D00000052" (log file
> 845, segment 82): No such file or directory
> LOG:  startup process (PID 7) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
>
> I can fix the configuration and restart the server, and everything is
> as fine as before.  Next, I try removing recovery.conf and restarting
> the server as an alternative way of promoting...but, no avail;
> however, a slightly different error message:
>
> # Server begins starting
> LOG:  loaded library "auto_explain"
> LOG:  loaded library "pg_stat_statements"
> LOG:  database system was interrupted while in recovery at log time
> 2012-12-10 15:20:03 UTC
> HINT:  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> LOG:  could not open file "pg_xlog/000000010000034E0000001A" (log file 846,
> segment 26): No such file or directory
> LOG:  invalid primary checkpoint record
> LOG:  could not open file "pg_xlog/000000010000034D000000F2" (log file 845,
> segment 242): No such file or directory
> LOG:  invalid secondary checkpoint record
> PANIC:  could not locate a valid checkpoint record
> LOG:  startup process (PID 7) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure
> main process (24284) terminated with status 1
>
> pg_control looks like this around the same time, for reference:
>
> pg_control version number:            903
> Catalog version number:               201105231
> Database cluster state:               in archive recovery
> pg_control last modified:             Wed 12 Dec 2012 09:22:30 PM UTC
> Latest checkpoint location:           351/1FE194C0
> Prior checkpoint location:            351/FD64A78
> Latest checkpoint's REDO location:    351/131848C8
> Latest checkpoint's TimeLineID:       1
> Latest checkpoint's NextXID:          0/652342033
> Latest checkpoint's NextOID:          103224
> Latest checkpoint's NextMultiXactId:  1
> Latest checkpoint's NextMultiOffset:  0
> Latest checkpoint's oldestXID:        455900714
> Latest checkpoint's oldestXID's DB:   16385
> Latest checkpoint's oldestActiveXID:  652311442
> Time of latest checkpoint:            Mon 10 Dec 2012 07:19:23 PM UTC
> Minimum recovery ending location:     351/4BFFFE20
> Backup start location:                0/0
> Current wal_level setting:            hot_standby
> Current max_connections setting:      500
> Current max_prepared_xacts setting:   500
> Current max_locks_per_xact setting:   64
> Maximum data alignment:               8
> Database block size:                  8192
> Blocks per segment of large relation: 131072
> WAL block size:                       8192
> Bytes per WAL segment:                16777216
> Maximum length of identifiers:        64
> Maximum columns in an index:          32
> Maximum size of a TOAST chunk:        1996
> Date/time type storage:               64-bit integers
> Float4 argument passing:              by value
> Float8 argument passing:              by value
>
> In the course of all this messing around, it has never been a problem to go
> back to archive recovery.
>
>
>


--
- Heikki