Thread: PostgreSQL with BDR - PANIC: could not create replication identifier checkpoint

Hi PostgreSQL community:


We have a three node postgresql BDR set up.  One of our nodes went down due to a power issue.  After bringing the server back online the OS reported the need to repair some files.  Once this completed and we restarted the postgresql service, we noticed that it was crashing very quickly.  Checking the logs revealed some panics (please see below).  The other two nodes appear to up and running and replicating with each other but our WAL backlog is slowly growing.


A description of what you are trying to achieve and what results you expect:
Repair a damaged node and get it replicating to/from the other nodes in our cluster.

PostgreSQL version number you are running:
PostgreSQL 9.4.4 on x86 64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-11), 64-bit

How you installed PostgreSQL:
Downloaded postgresql with BDR RPMs from 2nd Quandrant.

Changes made to the settings in the postgresql.conf file:
"application_name";"pgAdmin III - Query Tool";"client"
"bdr.log_conflicts_to_table";"on";"configuration file"
"bytea_output";"escape";"session"
"client_encoding";"SQL_ASCII";"session"
"client_min_messages";"notice";"session"
"DateStyle";"ISO, MDY";"session"
"default_sequenceam";"bdr";"configuration file"
"default_text_search_config";"pg_catalog.english";"configuration file"
"dynamic_shared_memory_type";"posix";"configuration file"
"lc_messages";"en_US.UTF-8";"configuration file"
"lc_monetary";"en_US.UTF-8";"configuration file"
"lc_numeric";"en_US.UTF-8";"configuration file"
"lc_time";"en_US.UTF-8";"configuration file"
"listen_addresses";"*";"configuration file"
"log_destination";"stderr";"configuration file"
"log_directory";"/var/log/postgresql";"configuration file"
"log_error_verbosity";"default";"configuration file"
"log_filename";"postgresql-%Y-%m-%d_%H%M%S.log";"configuration file"
"log_line_prefix";"t:%m d=%d p=%p a=%a%q ";"configuration file"
"log_min_messages";"info";"configuration file"
"log_rotation_age";"1d";"configuration file"
"log_rotation_size";"0";"configuration file"
"log_timezone";"UTC";"configuration file"
"log_truncate_on_rotation";"on";"configuration file"
"logging_collector";"on";"configuration file"
"max_connections";"100";"configuration file"
"max_replication_slots";"10";"configuration file"
"max_stack_depth";"2MB";"environment variable"
"max_wal_senders";"10";"configuration file"
"max_worker_processes";"10";"configuration file"
"port";"54330";"command line"
"shared_buffers";"128MB";"configuration file"
"shared_preload_libraries";"bdr";"configuration file"
"TimeZone";"UTC";"configuration file"
"track_commit_timestamp";"on";"configuration file"
"wal_level";"logical";"configuration file"

Operating system and version:
CentOS release 6.6
Linux <removed server name> 2.6.32-504.el6.x86_64 #1 SMP Wed Oct 15 04:27:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

For questions about any kind of error:

What program you're using to connect to PostgreSQL:
Django 1.7.0

Is there anything relevant or unusual in the PostgreSQL server logs?:
Hit a panic when attempting to create a replication identifier checkpoint.  Subsequent restarts of postgresql failed with a panic complaining that it could not open a replication slot file.

...
t:2016-05-19 01:14:51.668 UTC d= p=144 a=PANIC:  could not create replication identifier checkpoint "pg_logical/checkpoints/8-F3923F98.ckpt.tmp": Invalid argument
t:2016-05-19 01:14:51.671 UTC d= p=9729 a=WARNING:  could not create relation-cache initialization file "global/pg_internal.init.9729": Invalid argument
t:2016-05-19 01:14:51.671 UTC d= p=9729 a=DETAIL:  Continuing anyway, but there's something wrong.
t:2016-05-19 01:14:51.674 UTC d= p=133 a=LOG:  checkpointer process (PID 144) was terminated by signal 6: Aborted
t:2016-05-19 01:14:51.674 UTC d= p=133 a=LOG:  terminating any other active server processes
t:2016-05-19 01:14:51.675 UTC d= p=147 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.675 UTC d= p=147 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.675 UTC d= p=147 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.675 UTC d= p=148 a=LOG:  could not open temporary statistics file "pg_stat/global.tmp": Invalid argument
t:2016-05-19 01:14:51.694 UTC d= p=9729 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.694 UTC d= p=9729 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.694 UTC d= p=9729 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.786 UTC d= p=9730 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.786 UTC d= p=9730 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.786 UTC d= p=9730 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.787 UTC d= p=9731 a=WARNING:  terminating connection because of crash of another server process
t:2016-05-19 01:14:51.787 UTC d= p=9731 a=DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
t:2016-05-19 01:14:51.787 UTC d= p=9731 a=HINT:  In a moment you should be able to reconnect to the database and repeat your command.
t:2016-05-19 01:14:51.819 UTC d= p=133 a=LOG:  all server processes terminated; reinitializing
t:2016-05-19 01:14:51.819 UTC d= p=133 a=FATAL:  could not open "global/bdr.stat.tmp": Invalid argument
< end of the log file>

<start of next log file>
t:2016-05-19 15:01:58.989 UTC d= p=40 a=LOG:  database system was interrupted; last known up at 2016-05-19 00:58:43 UTC
t:2016-05-19 15:01:59.251 UTC d=postgres p=41 a=[unknown] FATAL:  the database system is starting up
t:2016-05-19 15:01:59.390 UTC d= p=40 a=PANIC:  could not open file "pg_replslot/bdr_19814_6258399326068244492_2_19814__/state": No such file or directory
t:2016-05-19 15:01:59.390 UTC d= p=38 a=LOG:  startup process (PID 40) was terminated by signal 6: Aborted
t:2016-05-19 15:01:59.390 UTC d= p=38 a=LOG:  aborting startup due to startup process failure
< end of the log file>

<start of next log file>
t:2016-05-19 15:04:15.139 UTC d= p=40 a=LOG:  database system was interrupted; last known up at 2016-05-19 00:58:43 UTC
t:2016-05-19 15:04:15.289 UTC d= p=40 a=PANIC:  could not open file "pg_replslot/bdr_19814_6258399326068244492_2_19814__/state": No such file or directory
t:2016-05-19 15:04:15.289 UTC d= p=38 a=LOG:  startup process (PID 40) was terminated by signal 6: Aborted
t:2016-05-19 15:04:15.289 UTC d= p=38 a=LOG:  aborting startup due to startup process failure
< end of the log file>


What you were doing when the error happened / how to cause the error:
May have been a power surge that caused the server to go offline.  On restart the OS found problems with files that needed to be repaired.  Starting postgresql seems  to complete but then the process crashes


Cameron Smith


This e-mail and any attachments are intended only for use by the addressee(s) named herein and may contain confidential information. If you are not the intended recipient of this e-mail, you are hereby notified any dissemination, distribution or copying of this email and any attachments is strictly prohibited. If you receive this email in error, please immediately notify the sender by return email and permanently delete the original, any copy and any printout thereof. The integrity and security of e-mail cannot be guaranteed.
Cameron Smith wrote:

> t:2016-05-19 01:14:51.668 UTC d= p=144 a=PANIC:  could not create replication identifier checkpoint
"pg_logical/checkpoints/8-F3923F98.ckpt.tmp":Invalid argument 

This line corresponds to the following code in BDR's 9.4.4
src/backend/replication/logical/replication_identifier.c:

    /*
     * no other backend can perform this at the same time, we're protected by
     * CheckpointLock.
     */
    tmpfd = OpenTransientFile(tmppath,
                              O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
                              S_IRUSR | S_IWUSR);
    if (tmpfd < 0)
        ereport(PANIC,
                (errcode_for_file_access(),
                 errmsg("could not create replication identifier checkpoint \"%s\": %m",
                        tmppath)));

This file does not exist in 9.5, but instead we have
src/backend/replication/logical/origin.c which has identical code.

OpenTransientFile calls BasicOpenFile, which in turn calls open() and
propagates the errno.  My manpage doesn't list any possible reasons for
open() to return EINVAL, so I'm at a loss about what is happening here.
Maybe this is a filesystem problem?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: PostgreSQL with BDR - PANIC: could not create replication identifier checkpoint

From
Christoph Moench-Tegeder
Date:
## Cameron Smith (csmith@stereodllc.com):

> t:2016-05-19 01:14:51.668 UTC d= p=144 a=PANIC:  could not create replication identifier checkpoint
"pg_logical/checkpoints/8-F3923F98.ckpt.tmp":Invalid argument 
> t:2016-05-19 01:14:51.671 UTC d= p=9729 a=WARNING:  could not create relation-cache initialization file
"global/pg_internal.init.9729":Invalid argument 

This and the other "Invalid Argument" errors give the impression that
your "power issue" has wrecked your filesystem - as Alvaro already
pointed out, there's no sane way that an open() returns EINVAL in
PostgreSQL on any "recommended" filesystem.
This shouldn't happen with any decently modern filesystem and "safe"
hardware. Did you tweak anything on your filesystem? Did you check
your storage stack against https://wiki.postgresql.org/wiki/Reliable_Writes?

Regards,
Christoph

--
Spare Space


I'd agree:  most likely a file system problem.  Is there any hope that this file could be re-built?

My current plan is to use bdr_part_by_node_names to remove the failing node and then rebuild it from a fresh backup
(andprobably on a new server). 

Thank you for your help!

Cameron Smith


________________________________________
From: Alvaro Herrera <alvherre@2ndquadrant.com>
Sent: May 19, 2016 2:56 PM
To: Cameron Smith
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] PostgreSQL with BDR - PANIC:  could not create replication identifier checkpoint

CAUTION EXTERNAL EMAIL






Cameron Smith wrote:

> t:2016-05-19 01:14:51.668 UTC d= p=144 a=PANIC:  could not create replication identifier checkpoint
"pg_logical/checkpoints/8-F3923F98.ckpt.tmp":Invalid argument 

This line corresponds to the following code in BDR's 9.4.4
src/backend/replication/logical/replication_identifier.c:

    /*
     * no other backend can perform this at the same time, we're protected by
     * CheckpointLock.
     */
    tmpfd = OpenTransientFile(tmppath,
                              O_CREAT | O_EXCL | O_WRONLY | PG_BINARY,
                              S_IRUSR | S_IWUSR);
    if (tmpfd < 0)
        ereport(PANIC,
                (errcode_for_file_access(),
                 errmsg("could not create replication identifier checkpoint \"%s\": %m",
                        tmppath)));

This file does not exist in 9.5, but instead we have
src/backend/replication/logical/origin.c which has identical code.

OpenTransientFile calls BasicOpenFile, which in turn calls open() and
propagates the errno.  My manpage doesn't list any possible reasons for
open() to return EINVAL, so I'm at a loss about what is happening here.
Maybe this is a filesystem problem?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
DO NOT open attachments or click on links from unknown senders or unexpected emails





This e-mail and any attachments are intended only for use by the addressee(s) named herein and may contain confidential
information.If you are not the intended recipient of this e-mail, you are hereby notified any dissemination,
distributionor copying of this email and any attachments is strictly prohibited. If you receive this email in error,
pleaseimmediately notify the sender by return email and permanently delete the original, any copy and any printout
thereof.The integrity and security of e-mail cannot be guaranteed. 


El 19/05/16 a las 16:15, Cameron Smith escribió:
> I'd agree:  most likely a file system problem.  Is there any hope that this file could be re-built?
>
> My current plan is to use bdr_part_by_node_names to remove the failing node and then rebuild it from a fresh backup
(andprobably on a new server). 

I think the most sensible plan is to remove the node from the bdr
cluster with bdr_part_by_node_name(), maybe clean up the bdr_nodes table
(some won't be happy with me suggesting this :)), remove the data
directory on the failed node and rejoin with bdr_init_copy

I'd suggest following the suggestions from Christoph and check that you
have a sane file-system configuration.

Also check if you didn't end up with a damaged disk (run some stress
test on the hardware).

If this is on production (not a toy installation) I would suggest
replacing the disks all together.

Regards,

--
Martín Marqués                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services