Thread: BDR Rejoin of failed node, hangs.

BDR Rejoin of failed node, hangs.

From
Steve Pribyl
Date:
Good Afternoon,

I am in the process of testing out BDR and am having problems rejoining after a simulated loss of one node.   The join
hangswhile waiting to complete and there interesting errors in the logs. 


Simulation process
After creating the database on both nodes.

Shutdown postgres and reset db on node 2
   rm -rf /var/lib/postgresql/9.4/main/*
   /usr/lib/postgresql/9.4/bin/initdb -D /var/lib/postgresql/9.4/main -A trust -U postgre

Clean up node 1.
   select bdr.bdr_part_by_node_names('{node2}');
   delete from bdr.bdr_nodes where node_status='k';

When I try to re-add node2
Using the NODE 2 Create sql again, it hangs
SELECT bdr.bdr_node_join_wait_for_ready(); hangs.

The log on node2  has errors
http://pastebin.com/8ZsTe5cG
55000: System identification mismatch between connection and slot
00000: worker process: bdr db: bdrdemo (PID 12042) exited with exit code 1

The log on node1 does not have any errors.
http://pastebin.com/njVJ9WX7

Both nodes show up in select * from bdr.bdr_nodes; on node1 and node2;

Database Creation:
NODE 1
create database bdrdemo;
\connect bdrdemo
CREATE EXTENSION btree_gist;
CREATE EXTENSION bdr;
select bdr.bdr_group_create(local_node_name := 'node1', node_external_dsn := 'host=192.168.101.41 port=5432
dbname=bdrdemo');
SELECT bdr.bdr_node_join_wait_for_ready();

NODE 2
create database bdrdemo;
\connect bdrdemo
CREATE EXTENSION btree_gist;
CREATE EXTENSION bdr;
select bdr.bdr_group_join(local_node_name := 'node2', node_external_dsn := 'host=192.168.101.42 port=5432
dbname=bdrdemo',join_using_dsn := 'host=192.168.101.41 port=5432 dbname=bdrdemo'); 
SELECT bdr.bdr_node_join_wait_for_ready();


Ubuntu 14.04 packages
ii  postgresql-bdr-9.4                  9.4.4-1trusty                 amd64        object-relational SQL database,
version9.4 server 
ii  postgresql-bdr-9.4-bdr-plugin       0.9.2-1trusty                 amd64        BDR Plugin for PostgreSQL-BDR 9.4
ii  postgresql-bdr-client-9.4           9.4.4-1trusty                 amd64        front-end programs for
PostgreSQL-BDR9.4 
ii  postgresql-bdr-contrib-9.4          9.4.4-1trusty                 amd64        additional facilities for PostgreSQL
ii  postgresql-client-common            154                           all          manager for multiple PostgreSQL
clientversions 
ii  postgresql-common                   154                           all          PostgreSQL database-cluster manager

TIA,
Steve
________________________________
 [http://www.akunacapital.com/images/akuna.png]
Steve Pribyl | Senior Systems Engineer
Akuna Capital LLC
36 S Wabash, Suite 310 Chicago IL 60603 USA | www.akunacapital.com <http://www.akunacapital.com>
p: +1 312 994 4646 | m:  | f: +1 312 750 1667 | spribyl@akunacapital.com

Please consider the environment, before printing this email.

This electronic message contains information from Akuna Capital LLC that may be confidential, legally privileged or
otherwiseprotected from disclosure. This information is intended for the use of the addressee only and is not offered
asinvestment advice to be relied upon for personal or professional use. Additionally, all electronic messages are
recordedand stored in compliance pursuant to applicable SEC rules. If you are not the intended recipient, you are
herebynotified that any disclosure, copying, distribution, printing or any other use of, or any action in reliance on,
thecontents of this electronic message is strictly prohibited. If you have received this communication in error, please
notifyus by telephone at (312)994-4640 and destroy the original message. 


Re: BDR Rejoin of failed node, hangs.

From
Jim Nasby
Date:
On 10/1/15 12:27 PM, Steve Pribyl wrote:
> I am in the process of testing out BDR

Please send BDR requests to the BDR mailing list. Thanks!
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: BDR Rejoin of failed node, hangs.

From
Alvaro Herrera
Date:
Jim Nasby wrote:
> On 10/1/15 12:27 PM, Steve Pribyl wrote:
> >I am in the process of testing out BDR
>
> Please send BDR requests to the BDR mailing list. Thanks!

pgsql-general remains the BDR list.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BDR Rejoin of failed node, hangs.

From
Steve Pribyl
Date:
Good Morning,

Has anyone had a moment to look at this?
It is a bit of a show stopper.

Thanks Steve

________________________________________
From: pgsql-general-owner@postgresql.org <pgsql-general-owner@postgresql.org> on behalf of Steve Pribyl
<spribyl@akunacapital.com>
Sent: Thursday, October 1, 2015 12:27 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] BDR Rejoin of failed node, hangs.

Good Afternoon,

I am in the process of testing out BDR and am having problems rejoining after a simulated loss of one node.   The join
hangswhile waiting to complete and there interesting errors in the logs. 


Simulation process
After creating the database on both nodes.

Shutdown postgres and reset db on node 2
   rm -rf /var/lib/postgresql/9.4/main/*
   /usr/lib/postgresql/9.4/bin/initdb -D /var/lib/postgresql/9.4/main -A trust -U postgre

Clean up node 1.
   select bdr.bdr_part_by_node_names('{node2}');
   delete from bdr.bdr_nodes where node_status='k';

When I try to re-add node2
Using the NODE 2 Create sql again, it hangs
SELECT bdr.bdr_node_join_wait_for_ready(); hangs.

The log on node2  has errors
http://pastebin.com/8ZsTe5cG
55000: System identification mismatch between connection and slot
00000: worker process: bdr db: bdrdemo (PID 12042) exited with exit code 1

The log on node1 does not have any errors.
http://pastebin.com/njVJ9WX7

Both nodes show up in select * from bdr.bdr_nodes; on node1 and node2;

Database Creation:
NODE 1
create database bdrdemo;
\connect bdrdemo
CREATE EXTENSION btree_gist;
CREATE EXTENSION bdr;
select bdr.bdr_group_create(local_node_name := 'node1', node_external_dsn := 'host=192.168.101.41 port=5432
dbname=bdrdemo');
SELECT bdr.bdr_node_join_wait_for_ready();

NODE 2
create database bdrdemo;
\connect bdrdemo
CREATE EXTENSION btree_gist;
CREATE EXTENSION bdr;
select bdr.bdr_group_join(local_node_name := 'node2', node_external_dsn := 'host=192.168.101.42 port=5432
dbname=bdrdemo',join_using_dsn := 'host=192.168.101.41 port=5432 dbname=bdrdemo'); 
SELECT bdr.bdr_node_join_wait_for_ready();


Ubuntu 14.04 packages
ii  postgresql-bdr-9.4                  9.4.4-1trusty                 amd64        object-relational SQL database,
version9.4 server 
ii  postgresql-bdr-9.4-bdr-plugin       0.9.2-1trusty                 amd64        BDR Plugin for PostgreSQL-BDR 9.4
ii  postgresql-bdr-client-9.4           9.4.4-1trusty                 amd64        front-end programs for
PostgreSQL-BDR9.4 
ii  postgresql-bdr-contrib-9.4          9.4.4-1trusty                 amd64        additional facilities for PostgreSQL
ii  postgresql-client-common            154                           all          manager for multiple PostgreSQL
clientversions 
ii  postgresql-common                   154                           all          PostgreSQL database-cluster manager

TIA,
Steve
________________________________
 [http://www.akunacapital.com/images/akuna.png]
Steve Pribyl | Senior Systems Engineer
Akuna Capital LLC
36 S Wabash, Suite 310 Chicago IL 60603 USA | www.akunacapital.com <http://www.akunacapital.com>
p: +1 312 994 4646 | m:  | f: +1 312 750 1667 | spribyl@akunacapital.com

Please consider the environment, before printing this email.

This electronic message contains information from Akuna Capital LLC that may be confidential, legally privileged or
otherwiseprotected from disclosure. This information is intended for the use of the addressee only and is not offered
asinvestment advice to be relied upon for personal or professional use. Additionally, all electronic messages are
recordedand stored in compliance pursuant to applicable SEC rules. If you are not the intended recipient, you are
herebynotified that any disclosure, copying, distribution, printing or any other use of, or any action in reliance on,
thecontents of this electronic message is strictly prohibited. If you have received this communication in error, please
notifyus by telephone at (312)994-4640 and destroy the original message. 


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
________________________________
 [http://www.akunacapital.com/images/akuna.png]
Steve Pribyl | Senior Systems Engineer
Akuna Capital LLC
36 S Wabash, Suite 310 Chicago IL 60603 USA | www.akunacapital.com <http://www.akunacapital.com>
p: +1 312 994 4646 | m:  | f: +1 312 750 1667 | spribyl@akunacapital.com

Please consider the environment, before printing this email.

This electronic message contains information from Akuna Capital LLC that may be confidential, legally privileged or
otherwiseprotected from disclosure. This information is intended for the use of the addressee only and is not offered
asinvestment advice to be relied upon for personal or professional use. Additionally, all electronic messages are
recordedand stored in compliance pursuant to applicable SEC rules. If you are not the intended recipient, you are
herebynotified that any disclosure, copying, distribution, printing or any other use of, or any action in reliance on,
thecontents of this electronic message is strictly prohibited. If you have received this communication in error, please
notifyus by telephone at (312)994-4640 and destroy the original message. 


Re: BDR Rejoin of failed node, hangs.

From
Craig Ringer
Date:
On 5 October 2015 at 20:58, Steve Pribyl <spribyl@akunacapital.com> wrote:

> Clean up node 1.
>    select bdr.bdr_part_by_node_names('{node2}');
>    delete from bdr.bdr_nodes where node_status='k';

You need to delete the bdr.bdr_connections entry too.

0.9.3 will fix that, so orphan connections entries and those
associated with terminated nodes are ignored.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: BDR Rejoin of failed node, hangs.

From
Steve Pribyl
Date:
That was it thanks.

Steve Pribyl
Sr. Systems Engineer
steve.pribyl@akunacapital.com
Desk: 312-994-4646


________________________________________
From: Craig Ringer <craig@2ndquadrant.com>
Sent: Tuesday, October 6, 2015 12:35 AM
To: Steve Pribyl
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] BDR Rejoin of failed node, hangs.

On 5 October 2015 at 20:58, Steve Pribyl <spribyl@akunacapital.com> wrote:

> Clean up node 1.
>    select bdr.bdr_part_by_node_names('{node2}');
>    delete from bdr.bdr_nodes where node_status='k';

You need to delete the bdr.bdr_connections entry too.

0.9.3 will fix that, so orphan connections entries and those
associated with terminated nodes are ignored.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
________________________________
 [http://www.akunacapital.com/images/akuna.png]
Steve Pribyl | Senior Systems Engineer
Akuna Capital LLC
36 S Wabash, Suite 310 Chicago IL 60603 USA | www.akunacapital.com <http://www.akunacapital.com>
p: +1 312 994 4646 | m:  | f: +1 312 750 1667 | spribyl@akunacapital.com

Please consider the environment, before printing this email.

This electronic message contains information from Akuna Capital LLC that may be confidential, legally privileged or
otherwiseprotected from disclosure. This information is intended for the use of the addressee only and is not offered
asinvestment advice to be relied upon for personal or professional use. Additionally, all electronic messages are
recordedand stored in compliance pursuant to applicable SEC rules. If you are not the intended recipient, you are
herebynotified that any disclosure, copying, distribution, printing or any other use of, or any action in reliance on,
thecontents of this electronic message is strictly prohibited. If you have received this communication in error, please
notifyus by telephone at (312)994-4640 and destroy the original message.