Thread: bdr_init_copy fails when starting 2nd BDR node

bdr_init_copy fails when starting 2nd BDR node

From
John Casey
Date:

I’ve been having issues while attempting to begin BDR replication. If I set up the main node, then use bdr_init_copy, it always fails on second node, as shown below.

 

postgres$ rm -Rf $PGDATA

postgres$ echo db_password | pg_basebackup -X stream -h main_node_ip -p 5432 -U username -D $PGDATA

postgres$ cp $HOME/backup/postgresql.conf $PGDATA

postgres$ bdr_init_copy -U username -D $PGDATA

bdr_init_copy: starting...

Assigning new system identifier: 6098464173726284030...

Creating primary replication slots...

Creating restore point...

Could not connect to the remote server: could not connect to server: No such file or directory

        Is the server running locally and accepting

        connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

 

If I start both servers simply by using pg_ctl using conf set up for replication, I get the following error on the main node:.

 

LOG:  starting background worker process "bdr (6098483684958107256,1,16384,): dr: apply"

CONTEXT:  slot "bdr_16384_6098483684958107256_1_16384__", output plugin "bdr", in the startup callback

ERROR:  data stream ended

LOG:  worker process: bdr (6098483684958107256,1,16384,): dr: apply (PID 6294) exited with exit code 1

 

… and, I get the following error on the second node:

 

ERROR:  bdr output plugin: slot creation rejected, bdr.bdr_nodes entry for local node (sysid=6098483778037269710, timelineid=1, dboid=16384): status='i', bdr still starting up: applying initial dump of remote node

HINT:  Monitor pg_stat_activity and the logs, wait until the node has caught up

CONTEXT:  slot "bdr_16384_6098483684958107256_1_16384__", output plugin "bdr", in the startup callback

LOG:  could not receive data from client: Connection reset by peer

 

It will keep cycling these errors indefinitely.

 

I have gotten this working off and on; but, I keep running into this issue. I am on CentOS 6.5.  Both servers can execute psql against the databases on other nodes when not configured for replication, so it is not a connectivity or firewall issue. I have installed using the beta2 rpm as well as built it from source for rc1 (bdr stable).

 

Any ideas?

 

Re: bdr_init_copy fails when starting 2nd BDR node

From
Andres Freund
Date:
Hi,

On 2014-12-29 23:51:05 -0500, John Casey wrote:
> I've been having issues while attempting to begin BDR replication. If I set
> up the main node, then use bdr_init_copy, it always fails on second node, as
> shown below.
>
>
>
> postgres$ rm -Rf $PGDATA
>
> postgres$ echo db_password | pg_basebackup -X stream -h main_node_ip -p 5432
> -U username -D $PGDATA
>
> postgres$ cp $HOME/backup/postgresql.conf $PGDATA
>
> postgres$ bdr_init_copy -U username -D $PGDATA
>
> bdr_init_copy: starting...
>
> Assigning new system identifier: 6098464173726284030...
>
> Creating primary replication slots...
>
> Creating restore point...
>
> Could not connect to the remote server: could not connect to server: No such
> file or directory
>
>         Is the server running locally and accepting
>
>         connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

What was your bdr config at this point? The error message indicates that
it tries to connect to port 5432 on localhost - but the copy was taken
from 'main_node_ip'. Perhaps you forgot to specify th ehost in the
config?

What does 'git describe --tags' return?

Greetings,

Andres Freund


Re: bdr_init_copy fails when starting 2nd BDR node

From
John Casey
Date:
> What was your bdr config at this point? The error message indicates that
it tries to
> connect to port 5432 on localhost - but the copy was taken from
'main_node_ip'.
> Perhaps you forgot to specify the ehost in the config?

# Here is my conf on the DR server (where I am running bdr_init_copy)
bdr.connections = 'primary'
bdr.primary_dsn = 'dbname=my_db host=primary_ip user=my_username  port=5432'
bdr.primary_init_replica = on
bdr.primary_replica_local_dsn = 'dbname=my_db user=my_username port=5432'

# For reference, here is the conf on my Primary server:
bdr.connections = 'dr'
bdr.dr_dsn = 'dbname=my_db host=dr_ip user=my_username  port=5432'

> What does 'git describe --tags' return?

bdr-pg/REL9_4beta3-1-120-ga2725dd

-----Original Message-----
From: Andres Freund [mailto:andres@2ndquadrant.com]
Sent: Tuesday, December 30, 2014 12:57 PM
To: John Casey
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] bdr_init_copy fails when starting 2nd BDR node

Hi,

On 2014-12-29 23:51:05 -0500, John Casey wrote:
> I've been having issues while attempting to begin BDR replication. If
> I set up the main node, then use bdr_init_copy, it always fails on
> second node, as shown below.
>
>
>
> postgres$ rm -Rf $PGDATA
>
> postgres$ echo db_password | pg_basebackup -X stream -h main_node_ip
> -p 5432 -U username -D $PGDATA
>
> postgres$ cp $HOME/backup/postgresql.conf $PGDATA
>
> postgres$ bdr_init_copy -U username -D $PGDATA
>
> bdr_init_copy: starting...
>
> Assigning new system identifier: 6098464173726284030...
>
> Creating primary replication slots...
>
> Creating restore point...
>
> Could not connect to the remote server: could not connect to server:
> No such file or directory
>
>         Is the server running locally and accepting
>
>         connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

What was your bdr config at this point? The error message indicates that it
tries to connect to port 5432 on localhost - but the copy was taken from
'main_node_ip'. Perhaps you forgot to specify th ehost in the config?

What does 'git describe --tags' return?

Greetings,

Andres Freund



Re: bdr_init_copy fails when starting 2nd BDR node

From
'Andres Freund'
Date:
Hi,

On 2014-12-30 21:12:17 -0500, John Casey wrote:
>
> > What was your bdr config at this point? The error message indicates that
> it tries to
> > connect to port 5432 on localhost - but the copy was taken from
> 'main_node_ip'.
> > Perhaps you forgot to specify the ehost in the config?
>
> # Here is my conf on the DR server (where I am running bdr_init_copy)
> bdr.connections = 'primary'
> bdr.primary_dsn = 'dbname=my_db host=primary_ip user=my_username  port=5432'
> bdr.primary_init_replica = on
> bdr.primary_replica_local_dsn = 'dbname=my_db user=my_username port=5432'

My guess is that this is the source of the problem - you probably have
one system and one self compiled libpq around or something similar and
they disagree about the location of the unix socket directory. It
complains about:

> >         connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

which means given the above configuration it has to be
primary_replica_local_dsn. Could you a) try to explicitly set
unix_socket_directory=/tmp in postgresql.conf and host=/tmp in the above
config?

Also, please attach postgresql.conf.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: bdr_init_copy fails when starting 2nd BDR node

From
Craig Ringer
Date:
On 12/30/2014 12:51 PM, John Casey wrote:
> I’ve been having issues while attempting to begin BDR replication. If I
> set up the main node, then use bdr_init_copy, it always fails on second
> node, as shown below.

On a semi-related note, all the BDR node configuration and join support
is currently being rewritten. The coming 0.8.0 release will use
bdr.connections in the config file, but the major release after that
will switch to a new SQL-based node addition scheme.

>> What does 'git describe --tags' return?
>
> bdr-pg/REL9_4beta3-1-120-ga2725dd

What about for your bdr-plugin tree?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: bdr_init_copy fails when starting 2nd BDR node

From
"John Casey"
Date:
I'm still experiencing similar problems. I'm not certain what parameter you
are referring to when you say 'ehost'. Otherwise,  I did want to clarify a
couple of things. I have tried several combinations, each one fails in
various ways. So ...

(1) What is the exact syntax when calling bdr_init_copy from new nodes when
your database name is not 'postgres' and your user name is not 'postgres'.
Please note if you supply local or remote host/port in the command.
(2) Should you do a pg_ctl start on new node before trying to execute
bdr_init_copy. If I don't I get the error I posted earlier.

I've attached the new nodes (dr) postgresql.conf file.

-----Original Message-----
From: 'Andres Freund' [mailto:andres@2ndquadrant.com]
Sent: Wednesday, December 31, 2014 5:04 AM
To: John Casey
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] bdr_init_copy fails when starting 2nd BDR node

Hi,

On 2014-12-30 21:12:17 -0500, John Casey wrote:
>
> > What was your bdr config at this point? The error message indicates
> > that
> it tries to
> > connect to port 5432 on localhost - but the copy was taken from
> 'main_node_ip'.
> > Perhaps you forgot to specify the ehost in the config?
>
> # Here is my conf on the DR server (where I am running bdr_init_copy)
> bdr.connections = 'primary'
> bdr.primary_dsn = 'dbname=my_db host=primary_ip user=my_username
port=5432'
> bdr.primary_init_replica = on
> bdr.primary_replica_local_dsn = 'dbname=my_db user=my_username port=5432'

My guess is that this is the source of the problem - you probably have one
system and one self compiled libpq around or something similar and they
disagree about the location of the unix socket directory. It complains
about:

> >         connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

which means given the above configuration it has to be
primary_replica_local_dsn. Could you a) try to explicitly set
unix_socket_directory=/tmp in postgresql.conf and host=/tmp in the above
config?

Also, please attach postgresql.conf.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: bdr_init_copy fails when starting 2nd BDR node

From
Craig Ringer
Date:


On 4 January 2015 at 02:52, John Casey <john.casey@innovisors.com> wrote:
I'm still experiencing similar problems. I'm not certain what parameter you
are referring to when you say 'ehost'. Otherwise,  I did want to clarify a
couple of things. I have tried several combinations, each one fails in
various ways. So ...

(1) What is the exact syntax when calling bdr_init_copy from new nodes when
your database name is not 'postgres' and your user name is not 'postgres'.
Please note if you supply local or remote host/port in the command.

Use a connection string to identify the remote and the local ends. e.g.:

  bdr_init_copy --remote-dbname="host=node1 dbname=mydb" \
                --local-dbname="dbname=mydb" \
                -D datadir

Both --remote-dbname and --local-dbname are libpq connection strings.


 
> (2) Should you do a pg_ctl start on new node before trying to execute
> bdr_init_copy. If I don't I get the error I posted earlier.

No, you should not and must not start the server before running bdr_init_copy.

(In the current development version of BDR this has all gone away, and bdr_init_copy will make a base backup for you).