Re: [BDR] Node Join Question - Mailing list pgsql-general

From Wayne E. Seguin
Subject Re: [BDR] Node Join Question
Date
Msg-id CANf8RLu1AZoemPrRVShQ29yJt1Wzq9WY-PEP-LPBx3Th_PqE+Q@mail.gmail.com
Whole thread Raw
In response to [BDR] Node Join Question  ("Wayne E. Seguin" <wayneeseguin@gmail.com>)
Responses Re: [BDR] Node Join Question  ("Wayne E. Seguin" <wayneeseguin@gmail.com>)
Re: [BDR] Node Join Question  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-general
Also,

Is there a way to remove these things from the init target node easier?

d= p=504 a=ERROR:  55000: previous init failed, manual cleanup is required
d= p=504 a=DETAIL:  Found bdr.bdr_nodes entry for bdr (6147869128174526660,1,16908,) with state=i in remote bdr.bdr_nodes
d= p=504 a=HINT:  Remove all replication identifiers and slots corresponding to this node from the init target node then drop and recreate this database and try again



On Mon, May 11, 2015 at 11:33 PM, Wayne E. Seguin <wayneeseguin@gmail.com> wrote:
To the wonderful BDR folks,

I have a 5 node pg+bdr cluster that I've spun up. I have verified that each node can connect via psql to each other node. I am having an issue creating the bdr group and having the nodes join each other. 

The steps I am taking are as follows (I used pg compiled using getting started guide instructions):
1. initdb
2. start postgres
3. create database pgbdr
4. create extension btree_gist (actually I'm doing this on template1 for my purposes)
5. create extension bdr
6. on node 0 I am doing:
    SELECT bdr.bdr_group_create(
    local_node_name := 'pgbdr0',
    node_external_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );
7. on nodes 1-4 I am doing (adjusted for the nodes IP):
    SELECT bdr.bdr_group_join(
    local_node_name := 'pgbdr1',
    node_external_dsn := 'host=10.244.2.6 port=5432 user=postgres dbname=pgbdr',
    join_using_dsn := 'host=10.244.2.2 port=5432 user=postgres dbname=pgbdr'
    );

Sometimes this works and all nodes are joined, but it isn't consistent and I'm not sure at the moment how to debug why.

From the most recent attempt on node 0:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    | node_init_from_dsn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+--------------------
 6147857338297172162 |             1 |      16908 | r           | rdpg0     | host=10.244.2.2 port=5432 user=postgres dbname=pgbdr |
(1 row)

From the most recent attempt on  node 1:

pgbdr=#     SELECT * FROM bdr.bdr_nodes;
     node_sysid      | node_timeline | node_dboid | node_status | node_name |                    node_local_dsn                    |                  node_init_from_d
sn
---------------------+---------------+------------+-------------+-----------+------------------------------------------------------+----------------------------------
--------------------
 6147857345042854082 |             1 |      16908 | i           | rdpg1     | host=10.244.2.6 port=5432 user=postgres dbname=pgbdr | host=10.244.2.2 port=5432 user=po
stgres dbname=pgbdr
(1 row)

I have pasted some relevant logs from node 0 further below.

Even if I use 'SELECT bdr.bdr_node_join_wait_for_ready();' it will usually just hang.

How can I debug why the nodes are not connecting to each other find out where they getting stuck trying to join?




d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM                                         [171/773]
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857345042854082_1_16908__"
 LOGICAL bdr
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18E87A0
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init DETAIL:  There are no running transactions.
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1271
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071E-1" with 0 transaction IDs
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857344392790209_1_16908__"
 LOGICAL bdr
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC970
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340000055490_1_16908__"
 LOGICAL bdr
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=412 a=bdr (6147857345042854082,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18EC9F0
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  1 transaction needs to finish.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECA30
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init DETAIL:  Transaction ID 1822 finished; no more running transactions.
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000720-1" with 0 transaction IDs
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  00000: exported logical decoding snapshot: "0000071F-1" with 1 transaction ID
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: IDENTIFY_SYSTEM
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: received replication command: CREATE_REPLICATION_SLOT "bdr_16908_6147857340923527359_1_16908__"
 LOGICAL bdr
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  exec_replication_command, walsender.c:1291
d= p=392 a=DEBUG:  00000: started 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_start_elections, bdr_seq.c:688
d= p=392 a=DEBUG:  00000: started 0 votes
d= p=392 a=LOCATION:  bdr_sequencer_vote, bdr_seq.c:629
d= p=392 a=DEBUG:  00000: tallied 0 elections
d= p=392 a=LOCATION:  bdr_sequencer_tally, bdr_seq.c:755
d= p=392 a=DEBUG:  00000: checked 0 sequences for filling
d= p=392 a=LOCATION:  bdr_sequencer_fill_sequences, bdr_seq.c:1034
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=422 a=bdr (6147857340000055490,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=417 a=bdr (6147857344392790209,1,16908,):init LOCATION:  SocketBackend, postgres.c:344
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DEBUG:  00000: searching for logical decoding starting point, starting at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  DecodingContextFindStartpoint, logical.c:469
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found initial starting point at 0/18ECA60
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  2 transactions need to finish.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildFindSnapshot, snapbuild.c:1338
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: logical decoding found consistent point at 0/18ECAD0
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init DETAIL:  Transaction ID 1823 finished; no more running transactions.
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildEndTxn, snapbuild.c:907
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  00000: exported logical decoding snapshot: "00000721-1" with 0 transaction IDs
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SnapBuildExportSnapshot, snapbuild.c:604
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOG:  08006: unexpected EOF on client connection with an open transaction
d=pgbdr p=426 a=bdr (6147857340923527359,1,16908,):init LOCATION:  SocketBackend, postgres.c:344



--

pgsql-general by date:

Previous
From: "Wayne E. Seguin"
Date:
Subject: [BDR] Node Join Question
Next
From: "Wayne E. Seguin"
Date:
Subject: Re: [BDR] Node Join Question