Thread: 9.1 doesn't start when died mid-backup
hi got new 9.1, straight from git. compiled, installed. did initdb, and then set config values using this script: perl -pi -e ' s/\A \s* (?: [#] \s* )? listen_addresses \s* = \s*.*/listen_addresses = \047*\047/x; s/\A \s* (?: [#] \s* )? log_destination \s* = \s*.*/log_destination = \047stderr\047/x; s/\A \s* (?: [#] \s* )? logging_collector \s* = \s*.*/logging_collector = on/x; s/\A \s* (?: [#] \s* )? log_min_duration_statement \s* = \s*.*/log_min_duration_statement = 0/x; s/\A \s* (?: [#] \s* )? log_line_prefix \s* = \s*.*/log_line_prefix = \047\%m \%u\@\%d \%p \%r \047/x; s/\A \s* (?: [#] \s* )? log_temp_files \s* = \s*.*/log_temp_files = 0/x; s/\A \s* (?: [#] \s* )? (log_checkpoints|log_connections|log_disconnections|log_lock_waits) \s* = \s*.*/$1 = on/x; s/\A \s* (?: [#] \s* )? wal_level \s* = \s*.*/wal_level = hot_standby/x; s/\A \s* (?: [#] \s* )? archive_mode \s* = \s*.*/archive_mode = on/x; s/\A \s* (?: [#] \s* )? archive_command \s* = \s*.*/archive_command = \047\/bin\/true\047/x; ' data/postgresql.conf afterwards I did start pg. then, i connected with psql, and issued: $ select pg_start_backup('xx'); pg_start_backup ----------------- 0/2000020 (1 row) $ create database pgdba; CREATE DATABASE $ create user postgres with superuser; CREATE ROLE $ create user depesz with superuser; CREATE ROLE $ create database depesz with owner depesz; CREATE DATABASE Afterwards, I exited psql, and simulated power-failure, with: =$ killall -9 postgres; killall -9 postmaster postmaster: no process found All pg processes were killed. When I then tried to start Pg, I got: =$ pg_ctl start pg_ctl: another server might be running; trying to start server anyway server starting but it didn't start. log looks like this: =$ cat postgresql-2011-04-13_093859.log 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was interrupted; last known up at 2011-04-13 09:38:43 CEST 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was not properly shut down; automatic recovery in progress 2011-04-13 09:38:59.852 CEST @ 3446 LOG: redo starts at 0/2000020 2011-04-13 09:39:00.304 CEST @ 3446 LOG: record with zero length at 0/20018E8 2011-04-13 09:39:00.304 CEST @ 3446 LOG: redo done at 0/2001898 2011-04-13 09:39:00.304 CEST @ 3446 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02 2011-04-13 09:39:00.304 CEST @ 3446 FATAL: WAL ends before end of online backup 2011-04-13 09:39:00.304 CEST @ 3446 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(),and all WAL up to that point must be available at recovery. 2011-04-13 09:39:00.304 CEST @ 3444 LOG: startup process (PID 3446) exited with exit code 1 2011-04-13 09:39:00.304 CEST @ 3444 LOG: aborting startup due to startup process failure interestingl7y backup_label got renamed to backup_label.old, and contains: START WAL LOCATION: 0/2000020 (file 000000010000000000000002) CHECKPOINT LOCATION: 0/2000058 START TIME: 2011-04-13 09:38:19 CEST LABEL: xx Given the file was renamed, I retried: =$ pg_ctl start server starting but again - it didn't work. Log looks like this: 2011-04-13 09:44:06.844 CEST @ 3783 LOG: database system was interrupted while in recovery at 2011-04-13 09:38:59 CEST 2011-04-13 09:44:06.844 CEST @ 3783 HINT: This probably means that some data is corrupted and you will have to use thelast backup for recovery. 2011-04-13 09:44:06.845 CEST @ 3783 LOG: database system was not properly shut down; automatic recovery in progress 2011-04-13 09:44:06.877 CEST @ 3783 LOG: redo starts at 0/2000020 2011-04-13 09:44:07.316 CEST @ 3783 LOG: record with zero length at 0/20018E8 2011-04-13 09:44:07.316 CEST @ 3783 LOG: redo done at 0/2001898 2011-04-13 09:44:07.316 CEST @ 3783 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02 2011-04-13 09:44:07.316 CEST @ 3783 FATAL: WAL ends before end of online backup 2011-04-13 09:44:07.316 CEST @ 3783 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(),and all WAL up to that point must be available at recovery. 2011-04-13 09:44:07.316 CEST @ 3776 LOG: startup process (PID 3783) exited with exit code 1 2011-04-13 09:44:07.316 CEST @ 3776 LOG: aborting startup due to startup process failure So, it looks to me that by killing postgres mid-backup I made it end up in some state that doesn't let it start again. I *think* it might be related to this: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01490.php and more specifically, with this: http://archives.postgresql.org/pgsql-hackers/2011-03/msg01627.php Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
On 13.04.2011 10:49, hubert depesz lubaczewski wrote: > hi > got new 9.1, straight from git. > compiled, installed. > did initdb, and then set config values using this script: > perl -pi -e ' > s/\A \s* (?: [#] \s* )? listen_addresses \s* = \s*.*/listen_addresses = \047*\047/x; > s/\A \s* (?: [#] \s* )? log_destination \s* = \s*.*/log_destination = \047stderr\047/x; > s/\A \s* (?: [#] \s* )? logging_collector \s* = \s*.*/logging_collector = on/x; > s/\A \s* (?: [#] \s* )? log_min_duration_statement \s* = \s*.*/log_min_duration_statement = 0/x; > s/\A \s* (?: [#] \s* )? log_line_prefix \s* = \s*.*/log_line_prefix = \047\%m \%u\@\%d \%p \%r \047/x; > s/\A \s* (?: [#] \s* )? log_temp_files \s* = \s*.*/log_temp_files = 0/x; > s/\A \s* (?: [#] \s* )? (log_checkpoints|log_connections|log_disconnections|log_lock_waits) \s* = \s*.*/$1 = on/x; > s/\A \s* (?: [#] \s* )? wal_level \s* = \s*.*/wal_level = hot_standby/x; > s/\A \s* (?: [#] \s* )? archive_mode \s* = \s*.*/archive_mode = on/x; > s/\A \s* (?: [#] \s* )? archive_command \s* = \s*.*/archive_command = \047\/bin\/true\047/x; > ' data/postgresql.conf > > afterwards I did start pg. > > then, i connected with psql, and issued: > $ select pg_start_backup('xx'); > pg_start_backup > ----------------- > 0/2000020 > (1 row) > > $ create database pgdba; > CREATE DATABASE > > $ create user postgres with superuser; > CREATE ROLE > > $ create user depesz with superuser; > CREATE ROLE > > $ create database depesz with owner depesz; > CREATE DATABASE > > Afterwards, I exited psql, and simulated power-failure, with: > > =$ killall -9 postgres; killall -9 postmaster > postmaster: no process found > > All pg processes were killed. > > When I then tried to start Pg, I got: > =$ pg_ctl start > pg_ctl: another server might be running; trying to start server anyway > server starting > > but it didn't start. > > log looks like this: > =$ cat postgresql-2011-04-13_093859.log > 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was interrupted; last known up at 2011-04-13 09:38:43 CEST > 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was not properly shut down; automatic recovery in progress > 2011-04-13 09:38:59.852 CEST @ 3446 LOG: redo starts at 0/2000020 > 2011-04-13 09:39:00.304 CEST @ 3446 LOG: record with zero length at 0/20018E8 > 2011-04-13 09:39:00.304 CEST @ 3446 LOG: redo done at 0/2001898 > 2011-04-13 09:39:00.304 CEST @ 3446 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02 > 2011-04-13 09:39:00.304 CEST @ 3446 FATAL: WAL ends before end of online backup > 2011-04-13 09:39:00.304 CEST @ 3446 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(),and all WAL up to that point must be available at recovery. > 2011-04-13 09:39:00.304 CEST @ 3444 LOG: startup process (PID 3446) exited with exit code 1 > 2011-04-13 09:39:00.304 CEST @ 3444 LOG: aborting startup due to startup process failure > > interestingl7y backup_label got renamed to backup_label.old, and > contains: > START WAL LOCATION: 0/2000020 (file 000000010000000000000002) > CHECKPOINT LOCATION: 0/2000058 > START TIME: 2011-04-13 09:38:19 CEST > LABEL: xx > > Given the file was renamed, I retried: > > =$ pg_ctl start > server starting > > but again - it didn't work. > > Log looks like this: > > 2011-04-13 09:44:06.844 CEST @ 3783 LOG: database system was interrupted while in recovery at 2011-04-13 09:38:59 CEST > 2011-04-13 09:44:06.844 CEST @ 3783 HINT: This probably means that some data is corrupted and you will have to use thelast backup for recovery. > 2011-04-13 09:44:06.845 CEST @ 3783 LOG: database system was not properly shut down; automatic recovery in progress > 2011-04-13 09:44:06.877 CEST @ 3783 LOG: redo starts at 0/2000020 > 2011-04-13 09:44:07.316 CEST @ 3783 LOG: record with zero length at 0/20018E8 > 2011-04-13 09:44:07.316 CEST @ 3783 LOG: redo done at 0/2001898 > 2011-04-13 09:44:07.316 CEST @ 3783 LOG: last completed transaction was at log time 2011-04-13 09:38:43.563356+02 > 2011-04-13 09:44:07.316 CEST @ 3783 FATAL: WAL ends before end of online backup > 2011-04-13 09:44:07.316 CEST @ 3783 HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(),and all WAL up to that point must be available at recovery. > 2011-04-13 09:44:07.316 CEST @ 3776 LOG: startup process (PID 3783) exited with exit code 1 > 2011-04-13 09:44:07.316 CEST @ 3776 LOG: aborting startup due to startup process failure > > > So, it looks to me that by killing postgres mid-backup I made it end up in some state that doesn't let it start again. > > I *think* it might be related to this: > http://archives.postgresql.org/pgsql-hackers/2011-03/msg01490.php and more > specifically, with this: > http://archives.postgresql.org/pgsql-hackers/2011-03/msg01627.php Yep :-(. I'm thinking we need to revert that change, now that we know there was a reason for only checking the end-of-WAL when doing archive recovery after all. We can keep it as a warning, though. With a hint explaining that if you're doing crash recovery from a database that crashed while an online backup was in progress, it's ok, but if you're recovering from an online backup, the backup is broken. For backups taken with pg_basebackup in 9.1, we can add a flag to the backup_label, indicating that the backup was taken with pg_basebackup. For such backups, the above scenario really should not happen, and we can still make it a hard error if it does. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 13.04.2011 14:18, Heikki Linnakangas wrote: > On 13.04.2011 10:49, hubert depesz lubaczewski wrote: >> hi >> got new 9.1, straight from git. >> compiled, installed. >> did initdb, and then set config values using this script: >> perl -pi -e ' >> s/\A \s* (?: [#] \s* )? listen_addresses \s* = \s*.*/listen_addresses >> = \047*\047/x; >> s/\A \s* (?: [#] \s* )? log_destination \s* = \s*.*/log_destination = >> \047stderr\047/x; >> s/\A \s* (?: [#] \s* )? logging_collector \s* = >> \s*.*/logging_collector = on/x; >> s/\A \s* (?: [#] \s* )? log_min_duration_statement \s* = >> \s*.*/log_min_duration_statement = 0/x; >> s/\A \s* (?: [#] \s* )? log_line_prefix \s* = \s*.*/log_line_prefix = >> \047\%m \%u\@\%d \%p \%r \047/x; >> s/\A \s* (?: [#] \s* )? log_temp_files \s* = \s*.*/log_temp_files = 0/x; >> s/\A \s* (?: [#] \s* )? >> (log_checkpoints|log_connections|log_disconnections|log_lock_waits) >> \s* = \s*.*/$1 = on/x; >> s/\A \s* (?: [#] \s* )? wal_level \s* = \s*.*/wal_level = hot_standby/x; >> s/\A \s* (?: [#] \s* )? archive_mode \s* = \s*.*/archive_mode = on/x; >> s/\A \s* (?: [#] \s* )? archive_command \s* = \s*.*/archive_command = >> \047\/bin\/true\047/x; >> ' data/postgresql.conf >> >> afterwards I did start pg. >> >> then, i connected with psql, and issued: >> $ select pg_start_backup('xx'); >> pg_start_backup >> ----------------- >> 0/2000020 >> (1 row) >> >> $ create database pgdba; >> CREATE DATABASE >> >> $ create user postgres with superuser; >> CREATE ROLE >> >> $ create user depesz with superuser; >> CREATE ROLE >> >> $ create database depesz with owner depesz; >> CREATE DATABASE >> >> Afterwards, I exited psql, and simulated power-failure, with: >> >> =$ killall -9 postgres; killall -9 postmaster >> postmaster: no process found >> >> All pg processes were killed. >> >> When I then tried to start Pg, I got: >> =$ pg_ctl start >> pg_ctl: another server might be running; trying to start server anyway >> server starting >> >> but it didn't start. >> >> log looks like this: >> =$ cat postgresql-2011-04-13_093859.log >> 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was >> interrupted; last known up at 2011-04-13 09:38:43 CEST >> 2011-04-13 09:38:59.807 CEST @ 3446 LOG: database system was not >> properly shut down; automatic recovery in progress >> 2011-04-13 09:38:59.852 CEST @ 3446 LOG: redo starts at 0/2000020 >> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: record with zero length at >> 0/20018E8 >> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: redo done at 0/2001898 >> 2011-04-13 09:39:00.304 CEST @ 3446 LOG: last completed transaction >> was at log time 2011-04-13 09:38:43.563356+02 >> 2011-04-13 09:39:00.304 CEST @ 3446 FATAL: WAL ends before end of >> online backup >> 2011-04-13 09:39:00.304 CEST @ 3446 HINT: Online backup started with >> pg_start_backup() must be ended with pg_stop_backup(), and all WAL up >> to that point must be available at recovery. >> 2011-04-13 09:39:00.304 CEST @ 3444 LOG: startup process (PID 3446) >> exited with exit code 1 >> 2011-04-13 09:39:00.304 CEST @ 3444 LOG: aborting startup due to >> startup process failure >> >> interestingl7y backup_label got renamed to backup_label.old, and >> contains: >> START WAL LOCATION: 0/2000020 (file 000000010000000000000002) >> CHECKPOINT LOCATION: 0/2000058 >> START TIME: 2011-04-13 09:38:19 CEST >> LABEL: xx >> >> Given the file was renamed, I retried: >> >> =$ pg_ctl start >> server starting >> >> but again - it didn't work. >> >> Log looks like this: >> >> 2011-04-13 09:44:06.844 CEST @ 3783 LOG: database system was >> interrupted while in recovery at 2011-04-13 09:38:59 CEST >> 2011-04-13 09:44:06.844 CEST @ 3783 HINT: This probably means that >> some data is corrupted and you will have to use the last backup for >> recovery. >> 2011-04-13 09:44:06.845 CEST @ 3783 LOG: database system was not >> properly shut down; automatic recovery in progress >> 2011-04-13 09:44:06.877 CEST @ 3783 LOG: redo starts at 0/2000020 >> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: record with zero length at >> 0/20018E8 >> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: redo done at 0/2001898 >> 2011-04-13 09:44:07.316 CEST @ 3783 LOG: last completed transaction >> was at log time 2011-04-13 09:38:43.563356+02 >> 2011-04-13 09:44:07.316 CEST @ 3783 FATAL: WAL ends before end of >> online backup >> 2011-04-13 09:44:07.316 CEST @ 3783 HINT: Online backup started with >> pg_start_backup() must be ended with pg_stop_backup(), and all WAL up >> to that point must be available at recovery. >> 2011-04-13 09:44:07.316 CEST @ 3776 LOG: startup process (PID 3783) >> exited with exit code 1 >> 2011-04-13 09:44:07.316 CEST @ 3776 LOG: aborting startup due to >> startup process failure >> >> >> So, it looks to me that by killing postgres mid-backup I made it end >> up in some state that doesn't let it start again. >> >> I *think* it might be related to this: >> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01490.php and >> more >> specifically, with this: >> http://archives.postgresql.org/pgsql-hackers/2011-03/msg01627.php > > Yep :-(. I'm thinking we need to revert that change, now that we know > there was a reason for only checking the end-of-WAL when doing archive > recovery after all. We can keep it as a warning, though. With a hint > explaining that if you're doing crash recovery from a database that > crashed while an online backup was in progress, it's ok, but if you're > recovering from an online backup, the backup is broken. Committed a quick revertion of this for now. > For backups taken with pg_basebackup in 9.1, we can add a flag to the > backup_label, indicating that the backup was taken with pg_basebackup. > For such backups, the above scenario really should not happen, and we > can still make it a hard error if it does. Added this to the open items list. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Apr 13, 2011 at 8:18 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Yep :-(. I'm thinking we need to revert that change, now that we know there > was a reason for only checking the end-of-WAL when doing archive recovery > after all. We can keep it as a warning, though. With a hint explaining that > if you're doing crash recovery from a database that crashed while an online > backup was in progress, it's ok, but if you're recovering from an online > backup, the backup is broken. Yeah, we cannot distinguish those two cases. I agree to emit such a warning. > For backups taken with pg_basebackup in 9.1, we can add a flag to the > backup_label, indicating that the backup was taken with pg_basebackup. For > such backups, the above scenario really should not happen, and we can still > make it a hard error if it does. Agreed. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center