PostgreSQL with BDR - PANIC: could not create replication identifier checkpoint - Mailing list pgsql-general

From Cameron Smith
Subject PostgreSQL with BDR - PANIC: could not create replication identifier checkpoint
Date
Msg-id CO2PR0801MB221456873A14376748F6C135A04A0@CO2PR0801MB2214.namprd08.prod.outlook.com
Whole thread Raw
Responses Re: PostgreSQL with BDR - PANIC: could not create replication identifier checkpoint  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: PostgreSQL with BDR - PANIC: could not create replication identifier checkpoint  (Christoph Moench-Tegeder <cmt@burggraben.net>)
List pgsql-general

Hi PostgreSQL community:


We have a three node postgresql BDR set up.  One of our nodes went down due to a power issue.  After bringing the server back online the OS reported the need to repair some files.  Once this completed and we restarted the postgresql service, we noticed that it was crashing very quickly.  Checking the logs revealed some panics (please see below).  The other two nodes appear to up and running and replicating with each other but our WAL backlog is slowly growing.


A description of what you are trying to achieve and what results you expect:
Repair a damaged node and get it replicating to/from the other nodes in our cluster.

PostgreSQL version number you are running:
PostgreSQL 9.4.4 on x86 64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit

How you installed PostgreSQL:
Downloaded postgresql with BDR RPMs from 2nd Quandrant.

Changes made to the settings in the postgresql.conf file:
"application_name";"pgAdmin III - Query Tool";"client"
"bdr.log_conflicts_to_table";"on";"configuration file"
"bytea_output";"escape";"session"
"client_encoding";"SQL_ASCII";"session"
"client_min_messages";"notice";"session"
"DateStyle";"ISO, MDY";"session"
"default_sequenceam";"bdr";"configuration file"
"default_text_search_config";"pg_catalog.english";"configuration file"
"dynamic_shared_memory_type";"posix";"configuration file"
"lc_messages";"en_US.UTF-8";"configuration file"
"lc_monetary";"en_US.UTF-8";"configuration file"
"lc_numeric";"en_US.UTF-8";"configuration file"
"lc_time";"en_US.UTF-8";"configuration file"
"listen_addresses";"*";"configuration file"
"log_destination";"stderr";"configuration file"
"log_directory";"/var/log/postgresql";"configuration file"
"log_error_verbosity";"default";"configuration file"
"log_filename";"postgresql-%Y-%m-%d_%H%M%S.log";"configuration file"
"log_line_prefix";"t:%m d=%d p=%p a=%a%q ";"configuration file"
"log_min_messages";"info";"configuration file"
"log_rotation_age";"1d";"configuration file"
"log_rotation_size";"0";"configuration file"
"log_timezone";"UTC";"configuration file"
"log_truncate_on_rotation";"on";"configuration file"
"logging_collector";"on";"configuration file"
"max_connections";"100";"configuration file"
"max_replication_slots";"10";"configuration file"
"max_stack_depth";"2MB";"environment variable"
"max_wal_senders";"10";"configuration file"
"max_worker_processes";"10";"configuration file"
"port";"54330";"command line"
"shared_buffers";"128MB";"configuration file"
"shared_preload_libraries";"bdr";"configuration file"
"TimeZone";"UTC";"configuration file"
"track_commit_timestamp";"on";"configuration file"
"wal_level";"logical";"configuration file"

Operating system and version:
CentOS release 6.6
Linux <removed server name> 2.6.32-504.el6.x86_64 #1 SMP Wed Oct 15 04:27:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

For questions about any kind of error:

What program you're using to connect to PostgreSQL:
Django 1.7.0

Is there anything relevant or unusual in the PostgreSQL server logs?:
Hit a panic when attempting to create a replication identifier checkpoint.  Subsequent restarts of postgresql failed with a panic complaining that it could not open a replication slot file.

...
t:2016-05-19 01:14:51.668 UTC d= p=144 a=PANIC:  could not create replication identifier checkpoint "pg_logical/checkpoints/8-F3923F98.ckpt.tmp": Invalid argument
t:2016-05-19 01:14:51.671 UTC d= p=9729 a=WARNING:  could not create relation-cache initialization file "global/pg_internal.init.9729": Invalid argument
t:2016-05-19 01:14:51.671 UTC d= p=9729 a=DETAIL:  Continuing anyway, but there's something wrong.
t:2016-05-19 01:14:51.674 UTC d= p=133 a=LOG:  checkpointer process (PID 144) was terminated by signal 6: Aborted
t:2016-05-19 01:14:51.674 UTC d= p=133 a=LOG:  terminating any other active server processes
t:2016-05-19 01:14:51.675 UTC d= p=147 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.675 UTC d= p=147 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.675 UTC d= p=147 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.675 UTC d= p=148 a=LOG:  could not open temporary statistics file "pg_stat/global.tmp": Invalid argument
t:2016-05-19 01:14:51.694 UTC d= p=9729 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.694 UTC d= p=9729 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.694 UTC d= p=9729 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.786 UTC d= p=9730 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.786 UTC d= p=9730 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.786 UTC d= p=9730 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.787 UTC d= p=9731 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.787 UTC d= p=9731 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.787 UTC d= p=9731 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.819 UTC d= p=133 a=LOG:  all server processes terminated; reinitializing
t:2016-05-19 01:14:51.819 UTC d= p=133 a=FATAL:  could not open "global/bdr.stat.tmp": Invalid argument
< end of the log file>

<start of next log file>
t:2016-05-19 15:01:58.989 UTC d= p=40 a=LOG:  database system was interrupted; last known up at 2016-05-19 00:58:43 UTC
t:2016-05-19 15:01:59.251 UTC d=postgres p=41 a=[unknown] FATAL:  the database system is starting up
t:2016-05-19 15:01:59.390 UTC d= p=40 a=PANIC:  could not open file "pg_replslot/bdr_19814_6258399326068244492_2_19814__/state": No such file or directory
t:2016-05-19 15:01:59.390 UTC d= p=38 a=LOG:  startup process (PID 40) was terminated by signal 6: Aborted
t:2016-05-19 15:01:59.390 UTC d= p=38 a=LOG:  aborting startup due to startup process failure
< end of the log file>

<start of next log file>
t:2016-05-19 15:04:15.139 UTC d= p=40 a=LOG:  database system was interrupted; last known up at 2016-05-19 00:58:43 UTC
t:2016-05-19 15:04:15.289 UTC d= p=40 a=PANIC:  could not open file "pg_replslot/bdr_19814_6258399326068244492_2_19814__/state": No such file or directory
t:2016-05-19 15:04:15.289 UTC d= p=38 a=LOG:  startup process (PID 40) was terminated by signal 6: Aborted
t:2016-05-19 15:04:15.289 UTC d= p=38 a=LOG:  aborting startup due to startup process failure
< end of the log file>


What you were doing when the error happened / how to cause the error:
May have been a power surge that caused the server to go offline.  On restart the OS found problems with files that needed to be repaired.  Starting postgresql seems  to complete but then the process crashes


Cameron Smith


This e-mail and any attachments are intended only for use by the addressee(s) named herein and may contain confidential information. If you are not the intended recipient of this e-mail, you are hereby notified any dissemination, distribution or copying of this email and any attachments is strictly prohibited. If you receive this email in error, please immediately notify the sender by return email and permanently delete the original, any copy and any printout thereof. The integrity and security of e-mail cannot be guaranteed.

pgsql-general by date:

Previous
From: Saiful Muhajir
Date:
Subject: Re: Londiste 3 pgq events_1_1 table huge
Next
From: Alvaro Herrera
Date:
Subject: Re: PostgreSQL with BDR - PANIC: could not create replication identifier checkpoint