Thread: [BDR] Node Join Question

[BDR] Node Join Question

From
"Wayne E. Seguin"
Date:
To the wonderful BDR folks,

I have a 5 node pg+bdr cluster that I've spun up. I have verified that each node can connect via psql to each other node. I am having an issue creating the bdr group and having the nodes join each other. 

The steps I am taking are as follows (I used pg compiled using getting started guide instructions):
1. initdb
2. start postgres
3. create database pgbdr
4. create extension btree_gist (actually I'm doing this on template1 for my purposes)
5. create extension bdr
6. on node 0 I am doing:
    SELECT bdr.bdr_group_create(
    local_node_name := 'pgbdr0',
    node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
    SELECT bdr.bdr_group_join(
    local_node_name := 'pgbdr1',
    node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr',
    join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );

Sometimes this works and all nodes are joined, but it isn't consistent and I'm not sure at the moment how to debug why.

From the most recent attempt on node 0:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    | node_init_from_dsn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
 6147857338297172162 |             1 |      16908 | r           | rdpg0     | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
(1 row)

From the most recent attempt on  node 1:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    |                  node_init_from_d
sn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
--------------------
 6147857345042854082 |             1 |      16908 | i           | rdpg1     | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=po
stgres dbname=pgbdr
(1 row)

I have pasted some relevant logs from node 0 further below.

Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will usually just hang.

How can I debug why the nodes are not connecting to each other find out where they getting stuck trying to join?




d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM                                         [171/773]
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857345042854082_1_16908__"
 LOGICAL bdr
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL:  There are no running transactions.
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1271
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857344392790209_1_16908__"
 LOGICAL bdr
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340000055490_1_16908__"
 LOGICAL bdr
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340923527359_1_16908__"
 LOGICAL bdr
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d= p=392 a=DEBUG:  00000: started 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_start_elections, bdr_seq.c:688
d= p=392 a=DEBUG:  00000: started 0 votes
d= p=392 a=LOCATION:  bdr_sequencer_vote, bdr_seq.c:629
d= p=392 a=DEBUG:  00000: tallied 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_tally, bdr_seq.c:755
d= p=392 a=DEBUG:  00000: checked 0 sequences for filling
d= p=392 a=LOCATION:  bdr_sequencer_fill_sequences, bdr_seq.c:1034
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  2 transactions need to finish.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECAD0
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  Transaction ID 1823 finished; no more running transactions.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SocketBackend, postgres.c:344

Re: [BDR] Node Join Question

From
"Wayne E. Seguin"
Date:
Also,

Is there a way to remove these things from the init target node easier?

d= p=504 a=ERROR:  55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL:  Found bdr.bdr_nodes entry for bdr (6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT:  Remove all replication identifiers and slots corresponding to this node from the init target node then drop and recreate this database and try again



On Mon, May 11, 2015 at 11:33 PM, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
To the wonderful BDR folks,

I have a 5 node pg+bdr cluster that I've spun up. I have verified that each node can connect via psql to each other node. I am having an issue creating the bdr group and having the nodes join each other. 

The steps I am taking are as follows (I used pg compiled using getting started guide instructions):
1. initdb
2. start postgres
3. create database pgbdr
4. create extension btree_gist (actually I'm doing this on template1 for my purposes)
5. create extension bdr
6. on node 0 I am doing:
    SELECT bdr.bdr_group_create(
    local_node_name := 'pgbdr0',
    node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
    SELECT bdr.bdr_group_join(
    local_node_name := 'pgbdr1',
    node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr',
    join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );

Sometimes this works and all nodes are joined, but it isn't consistent and I'm not sure at the moment how to debug why.

From the most recent attempt on node 0:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    | node_init_from_dsn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
 6147857338297172162 |             1 |      16908 | r           | rdpg0     | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
(1 row)

From the most recent attempt on  node 1:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    |                  node_init_from_d
sn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
--------------------
 6147857345042854082 |             1 |      16908 | i           | rdpg1     | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=po
stgres dbname=pgbdr
(1 row)

I have pasted some relevant logs from node 0 further below.

Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will usually just hang.

How can I debug why the nodes are not connecting to each other find out where they getting stuck trying to join?




d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM                                         [171/773]
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857345042854082_1_16908__"
 LOGICAL bdr
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL:  There are no running transactions.
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1271
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857344392790209_1_16908__"
 LOGICAL bdr
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340000055490_1_16908__"
 LOGICAL bdr
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340923527359_1_16908__"
 LOGICAL bdr
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d= p=392 a=DEBUG:  00000: started 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_start_elections, bdr_seq.c:688
d= p=392 a=DEBUG:  00000: started 0 votes
d= p=392 a=LOCATION:  bdr_sequencer_vote, bdr_seq.c:629
d= p=392 a=DEBUG:  00000: tallied 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_tally, bdr_seq.c:755
d= p=392 a=DEBUG:  00000: checked 0 sequences for filling
d= p=392 a=LOCATION:  bdr_sequencer_fill_sequences, bdr_seq.c:1034
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  2 transactions need to finish.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECAD0
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  Transaction ID 1823 finished; no more running transactions.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SocketBackend, postgres.c:344



--

Re: [BDR] Node Join Question

From
"Wayne E. Seguin"
Date:
Also, what is the proper way to locate and remove these replication slots as mentioned in this log?

On Mon, May 11, 2015 at 11:36 PM, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Also,

Is there a way to remove these things from the init target node easier?

d= p=504 a=ERROR:  55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL:  Found bdr.bdr_nodes entry for bdr (6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT:  Remove all replication identifiers and slots corresponding to this node from the init target node then drop and recreate this database and try again



On Mon, May 11, 2015 at 11:33 PM, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
To the wonderful BDR folks,

I have a 5 node pg+bdr cluster that I've spun up. I have verified that each node can connect via psql to each other node. I am having an issue creating the bdr group and having the nodes join each other. 

The steps I am taking are as follows (I used pg compiled using getting started guide instructions):
1. initdb
2. start postgres
3. create database pgbdr
4. create extension btree_gist (actually I'm doing this on template1 for my purposes)
5. create extension bdr
6. on node 0 I am doing:
    SELECT bdr.bdr_group_create(
    local_node_name := 'pgbdr0',
    node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
    SELECT bdr.bdr_group_join(
    local_node_name := 'pgbdr1',
    node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr',
    join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );

Sometimes this works and all nodes are joined, but it isn't consistent and I'm not sure at the moment how to debug why.

From the most recent attempt on node 0:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    | node_init_from_dsn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
 6147857338297172162 |             1 |      16908 | r           | rdpg0     | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
(1 row)

From the most recent attempt on  node 1:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    |                  node_init_from_d
sn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
--------------------
 6147857345042854082 |             1 |      16908 | i           | rdpg1     | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=po
stgres dbname=pgbdr
(1 row)

I have pasted some relevant logs from node 0 further below.

Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will usually just hang.

How can I debug why the nodes are not connecting to each other find out where they getting stuck trying to join?




d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM                                         [171/773]
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857345042854082_1_16908__"
 LOGICAL bdr
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL:  There are no running transactions.
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1271
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857344392790209_1_16908__"
 LOGICAL bdr
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340000055490_1_16908__"
 LOGICAL bdr
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340923527359_1_16908__"
 LOGICAL bdr
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d= p=392 a=DEBUG:  00000: started 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_start_elections, bdr_seq.c:688
d= p=392 a=DEBUG:  00000: started 0 votes
d= p=392 a=LOCATION:  bdr_sequencer_vote, bdr_seq.c:629
d= p=392 a=DEBUG:  00000: tallied 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_tally, bdr_seq.c:755
d= p=392 a=DEBUG:  00000: checked 0 sequences for filling
d= p=392 a=LOCATION:  bdr_sequencer_fill_sequences, bdr_seq.c:1034
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  2 transactions need to finish.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECAD0
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  Transaction ID 1823 finished; no more running transactions.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SocketBackend, postgres.c:344



--



--

Re: [BDR] Node Join Question

From
Craig Ringer
Date:


On 12 May 2015 at 14:33, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
 
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
    SELECT bdr.bdr_group_join(
    local_node_name := 'pgbdr1',
    node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr',
    join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );

At a guess you're probably not waiting between joins to ensure that each new node has finished joining before starting another node join.

BDR really needs to be enhanced to either support parallel join of multiple nodes or identify and reject it.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [BDR] Node Join Question

From
Craig Ringer
Date:


On 12 May 2015 at 14:36, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Also,

Is there a way to remove these things from the init target node easier?

d= p=504 a=ERROR:  55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL:  Found bdr.bdr_nodes entry for bdr (6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT:  Remove all replication identifiers and slots corresponding to this node from the init target node then drop and recreate this database and try again

Now that we have SQL-level join it'd probably make sense to provide a cleanup function for failed node joins. At this point there's no such function.


Take note of the node identity given in the error as it corresponds to the replication identifier name and slot name.

You need to, on the join target node:

     SELECT pg_drop_replication_slot(slot_name) 
     FROM pg_replication_slots 
     WHERE slot_name = bdr.bdr_format_slot_name('6147869128174526660',1,16908)

where the sysid, timeline ID and database OID are those given in the error. You must run this from the target node's database, as it'll only consider slots for the current database.

Then 

    SELECT pg_replication_identifier_drop(...)

the replication identifier used, after looking up the replication identifier from pg_catalog.pg_replication_identifier. There isn't an equivalent of  bdr.bdr_format_slot_name for replication identifiers; I'll look at adding one. Look it up visually or write a simple function to format the string in the mean time.

Then delete the bdr.bdr_nodes entry for the failed-to-join node and any bdr.bdr_connections entries for it.

You *must* drop and re-create the database on the failed-to-join node, making a new blank db (preferably from template0).




Re: [BDR] Node Join Question

From
"Wayne E. Seguin"
Date:
Craig, thank you so much for the quick response! 

Adding these cleanup functions sounds wonderful, thank you for looking into that.

One question, why template0 vs template1 ? (My guess is because you want it to be devoid of pretty much everything?)

On Tue, May 12, 2015 at 1:31 AM, Craig Ringer <craig@2ndquadrant.com> wrote:


On 12 May 2015 at 14:36, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Also,

Is there a way to remove these things from the init target node easier?

d= p=504 a=ERROR:  55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL:  Found bdr.bdr_nodes entry for bdr (6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT:  Remove all replication identifiers and slots corresponding to this node from the init target node then drop and recreate this database and try again

Now that we have SQL-level join it'd probably make sense to provide a cleanup function for failed node joins. At this point there's no such function.


Take note of the node identity given in the error as it corresponds to the replication identifier name and slot name.

You need to, on the join target node:

     SELECT pg_drop_replication_slot(slot_name) 
     FROM pg_replication_slots 
     WHERE slot_name = bdr.bdr_format_slot_name('6147869128174526660',1,16908)

where the sysid, timeline ID and database OID are those given in the error. You must run this from the target node's database, as it'll only consider slots for the current database.

Then 

    SELECT pg_replication_identifier_drop(...)

the replication identifier used, after looking up the replication identifier from pg_catalog.pg_replication_identifier. There isn't an equivalent of  bdr.bdr_format_slot_name for replication identifiers; I'll look at adding one. Look it up visually or write a simple function to format the string in the mean time.

Then delete the bdr.bdr_nodes entry for the failed-to-join node and any bdr.bdr_connections entries for it.

You *must* drop and re-create the database on the failed-to-join node, making a new blank db (preferably from template0).







--

Re: [BDR] Node Join Question

From
"Wayne E. Seguin"
Date:
Craig, 

I was starting the first node then letting all of the other nodes join as quick as they could which clearly won't work. It also explains why it worked when I did it manually, I can only do it sequentially myself ;) I had suspected a race condition and it seems I was in the right area :)

Thank you for this, I will alter what I am doing to start sequentially.

On Tue, May 12, 2015 at 1:19 AM, Craig Ringer <craig@2ndquadrant.com> wrote:


On 12 May 2015 at 14:33, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
 
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
    SELECT bdr.bdr_group_join(
    local_node_name := 'pgbdr1',
    node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr',
    join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );

At a guess you're probably not waiting between joins to ensure that each new node has finished joining before starting another node join.

BDR really needs to be enhanced to either support parallel join of multiple nodes or identify and reject it.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



--

Re: [BDR] Node Join Question

From
"Wayne E. Seguin"
Date:
Craig,

It's alive!!! 

One more question on this thread, where can I find the meanings of node_status in the documentation?

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                     |                  node_init_from_
dsn
---------------------+---------------+------------+-------------+-----------+-------------------------------------------------------+---------------------------------
---------------------
 6147988955199598785 |             1 |      16386 | r           | rdpg0     | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr  |
 6147988955568627909 |             1 |      16386 | i           | rdpg1     | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr  | host=10.244.2.2 port=5432 user=p
ostgres dbname=pgbdr
 6147988955597271233 |             1 |      16386 | r           | rdpg3     | host=10.244.2.14 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=p
ostgres dbname=pgbdr
 6147988963356274882 |             1 |      16386 | r           | rdpg4     | host=10.244.2.18 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=p
ostgres dbname=pgbdr
 6147988963438956739 |             1 |      16386 | r           | rdpg2     | host=10.244.2.10 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=p
ostgres dbname=pgbdr
(5 rows)

Re: [BDR] Node Join Question

From
Craig Ringer
Date:


On 12 May 2015 at 22:21, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Craig,

It's alive!!! 

One more question on this thread, where can I find the meanings of node_status in the documentation?

node_status is really internal, but it's covered briefly in the docs:

Re: [BDR] Node Join Question

From
"Wayne E. Seguin"
Date:
*awesome*, the question was for my own curiosity so thanks for this!
On Tue, May 12, 2015 at 18:27 Craig Ringer <craig@2ndquadrant.com> wrote:
On 12 May 2015 at 22:21, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
Craig,

It's alive!!! 

One more question on this thread, where can I find the meanings of node_status in the documentation?

node_status is really internal, but it's covered briefly in the docs:

Re: [BDR] Node Join Question

From
Craig Ringer
Date:

On 13 May 2015 at 09:29, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
*awesome*, the question was for my own curiosity so thanks for this!


No worries.

I know it's trite, but for internal-ish detail like that the best reference remains the source code. I'd like to think the sources are fairly sane.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services