Thread: Replication (BDR) problem: won't catch up after connection timeout

Replication (BDR) problem: won't catch up after connection timeout

From

Suomela Tero

Date:

02 November 2016, 12:12:07

Hi there,

We have some problems with BDR and would appreciate any hints and advice with it. Here’s the short story:

We are testing BDR with PostgreSQL 9.4 and it seems to work quite ok after getting it up and running, but we ran into a quite disturbing weakness also. A basic two node cluster breaks simply by making a transaction that takes a bit long to process. Here's an example:

On node1 the application is processing some data which causes around 100k select & insert statements within one transaction. This takes some time and the replication says "timeout" (assumably there is a keep-alive mechanism which simply doesn't work while the transaction is processed). After the transaction is committed on node1, the log on node1 says:

< 2016-11-02 13:06:29.117 EET >LOG: terminating walsender process due to replication timeout

< 2016-11-02 13:06:34.168 EET >LOG: starting logical decoding for slot "bdr_64344_6300833630798326204_1_77037__"

< 2016-11-02 13:06:34.168 EET >DETAIL: streaming transactions committing after 0/117FCE38, reading WAL from 0/117FCE38

< 2016-11-02 13:06:34.172 EET >LOG: logical decoding found consistent point at 0/117FCE38

< 2016-11-02 13:06:34.172 EET >DETAIL: There are no running transactions.

< 2016-11-02 13:09:09.196 EET >ERROR: data stream ended

< 2016-11-02 13:09:09.206 EET >LOG: worker process: bdr (6300843528307178977,1,64344,)->bdr (6300833630798326204,1, (PID 28195) exited with exit code 1

< 2016-11-02 13:09:14.209 EET >LOG: starting background worker process "bdr (6300843528307178977,1,64344,)->bdr (6300833630798326204,1,"

< 2016-11-02 13:09:14.217 EET >NOTICE: version "1.0" of extension "btree_gist" is already installed

< 2016-11-02 13:09:14.219 EET >NOTICE: version "1.0.1.0" of extension "bdr" is already installed

< 2016-11-02 13:09:14.241 EET >INFO: starting up replication from 5 at 0/D038EC8

Checking the BDR status:

< 2016-11-02 13:09:29.632 EET >LOG: statement: SELECT node_name, node_status FROM bdr.bdr_nodes;

< 2016-11-02 13:09:29.633 EET >LOG: statement: SELECT conn_sysid, conn_dboid, conn_dsn FROM bdr.bdr_connections;

< 2016-11-02 13:09:29.633 EET >LOG: statement: SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots WHERE plugin = 'bdr';

< 2016-11-02 13:09:29.635 EET >LOG: statement: SELECT pid, application_name, client_addr, state, pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes FROM pg_stat_replication;

Result:

node_name | node_status

-----------+-------------

node1 | r

node2 | r

(2 rows)

conn_sysid | conn_dboid | conn_dsn

---------------------+------------+---------------------------------------------------------------------------------

6300843528307178977 | 64344 | host=192.168.150.11 port=5432 dbname=test user=test password=test

6300833630798326204 | 77037 | host=192.168.150.12 port=5432 dbname=test user=test password=test

(2 rows)

slot_name | database | active | retained_bytes

-----------------------------------------+-----------+--------+----------------

bdr_64344_6300833630798326204_1_77037__ | test | t | 64939984

(1 row)

pid | application_name | client_addr | state | lag_bytes

-------+--------------------------------------------+----------------+---------+-----------

28825 | bdr (6300833630798326204,1,77037,):receive | 192.168.150.12 | catchup | 64939984

(1 row)

The node2 state is 'catchup' with lots to catch up, but nothing happens, it stays like this even that the connection looks ok. So the data is not replicated anymore.

Then if we restart node2, node1 log starts saying:

< 2016-11-02 13:11:06.656 EET >ERROR: replication slot "bdr_64344_6300833630798326204_1_77037__" is already active for pid 28517

Then if we restart both nodes (requires kill -9 for the wal sender process, otherwise it won't stop), node1 log:

< 2016-11-02 13:17:01.318 EET >LOG: shutting down

< 2016-11-02 13:17:01.343 EET >LOG: database system is shut down

< 2016-11-02 13:18:03.288 EET >LOG: server process (PID 28682) was terminated by signal 9: Killed

< 2016-11-02 13:18:03.288 EET >LOG: terminating any other active server processes

< 2016-11-02 13:18:03.288 EET >LOG: abnormal database system shutdown

< 2016-11-02 13:18:25.067 EET >LOG: database system was shut down at 2016-11-02 13:17:01 EET

< 2016-11-02 13:18:25.087 EET >LOG: starting up replication identifier with ckpt at 0/155EB520

< 2016-11-02 13:18:25.087 EET >LOG: recovered replication state of node 1 to 0/1BC8620

< 2016-11-02 13:18:25.087 EET >LOG: recovered replication state of node 2 to 0/1E932C8

< 2016-11-02 13:18:25.087 EET >LOG: recovered replication state of node 3 to 0/252FBB8

< 2016-11-02 13:18:25.087 EET >LOG: recovered replication state of node 4 to 0/294BA20

< 2016-11-02 13:18:25.087 EET >LOG: recovered replication state of node 5 to 0/D039310

< 2016-11-02 13:18:25.092 EET >LOG: MultiXact member wraparound protections are now enabled

< 2016-11-02 13:18:25.094 EET >LOG: starting background worker process "bdr supervisor"

< 2016-11-02 13:18:25.094 EET >LOG: database system is ready to accept connections

< 2016-11-02 13:18:25.094 EET >LOG: autovacuum launcher started

< 2016-11-02 13:18:25.130 EET >LOG: worker process: bdr supervisor (PID 28818) exited with exit code 1

< 2016-11-02 13:18:26.130 EET >LOG: starting background worker process "bdr supervisor"

< 2016-11-02 13:18:26.174 EET >LOG: registering background worker "bdr db: test"

< 2016-11-02 13:18:26.174 EET >LOG: starting background worker process "bdr db: test"

< 2016-11-02 13:18:26.218 EET >NOTICE: version "1.0" of extension "btree_gist" is already installed

< 2016-11-02 13:18:26.218 EET >NOTICE: version "1.0.1.0" of extension "bdr" is already installed

< 2016-11-02 13:18:26.270 EET >LOG: registering background worker "bdr (6300843528307178977,1,64344,)->bdr (6300833630798326204,1,"

< 2016-11-02 13:18:26.270 EET >LOG: starting background worker process "bdr (6300843528307178977,1,64344,)->bdr (6300833630798326204,1,"

< 2016-11-02 13:18:26.272 EET >NOTICE: version "1.0" of extension "btree_gist" is already installed

< 2016-11-02 13:18:26.272 EET >NOTICE: version "1.0.1.0" of extension "bdr" is already installed

< 2016-11-02 13:18:26.278 EET >DETAIL: Connection string is 'replication=database fallback_application_name='bdr (6300843528307178977,1,64344,):receive' connect_timeout=30 keepalives=1 keepalives_idle=20 keepalives_interval=20 keepalives_count=5 host=192.168.150.12 port=5432 dbname=test user=test password=test'

< 2016-11-02 13:18:26.281 EET >LOG: worker process: bdr (6300843528307178977,1,64344,)->bdr (6300833630798326204,1, (PID 28821) exited with exit code 1

< 2016-11-02 13:18:29.956 EET >LOG: starting logical decoding for slot "bdr_64344_6300833630798326204_1_77037__"

< 2016-11-02 13:18:29.956 EET >DETAIL: streaming transactions committing after 0/117FCE38, reading WAL from 0/117FCE38

< 2016-11-02 13:18:29.956 EET >LOG: logical decoding found consistent point at 0/117FCE38

< 2016-11-02 13:18:29.956 EET >DETAIL: There are no running transactions.

< 2016-11-02 13:18:31.283 EET >LOG: starting background worker process "bdr (6300843528307178977,1,64344,)->bdr (6300833630798326204,1,"

< 2016-11-02 13:18:31.284 EET >NOTICE: version "1.0" of extension "btree_gist" is already installed

< 2016-11-02 13:18:31.284 EET >NOTICE: version "1.0.1.0" of extension "bdr" is already installed

< 2016-11-02 13:18:31.296 EET >INFO: starting up replication from 5 at 0/D039310

Check status again:

< 2016-11-02 13:19:00.246 EET >LOG: statement: SELECT node_name, node_status FROM bdr.bdr_nodes;

< 2016-11-02 13:19:00.247 EET >LOG: statement: SELECT conn_sysid, conn_dboid, conn_dsn FROM bdr.bdr_connections;

< 2016-11-02 13:19:00.247 EET >LOG: statement: SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes FROM pg_replication_slots WHERE plugin = 'bdr';

< 2016-11-02 13:19:00.251 EET >LOG: statement: SELECT pid, application_name, client_addr, state, pg_xlog_location_diff(pg_current_xlog_insert_location(), flush_location) AS lag_bytes FROM pg_stat_replication;

Results:

node_name | node_status

-----------+-------------

node1 | r

node2 | r

(2 rows)

conn_sysid | conn_dboid | conn_dsn

---------------------+------------+---------------------------------------------------------------------------------

6300843528307178977 | 64344 | host=192.168.150.11 port=5432 dbname=test user=test password=test

6300833630798326204 | 77037 | host=192.168.150.12 port=5432 dbname=test user=test password=test

(2 rows)

slot_name | database | active | retained_bytes

-----------------------------------------+-----------+--------+----------------

bdr_64344_6300833630798326204_1_77037__ | test | t | 64939984

(1 row)

pid | application_name | client_addr | state | lag_bytes

-------+--------------------------------------------+----------------+---------+-----------

28825 | bdr (6300833630798326204,1,77037,):receive | 192.168.150.12 | catchup | 64939984

(1 row)

And the situation is still the same, replication does not work.

Log on node2 does not indicate anything much, just the connection breaking:

< 2016-11-02 12:53:49.299 EET >DETAIL: Connection string is 'replication=database fallback_application_name='bdr (6300833630798326204,1,77037,):receive' connect_timeout=3

0 keepalives=1 keepalives_idle=20 keepalives_interval=20 keepalives_count=5 host=192.168.150.11 port=5432 dbname=test user=test password=test'

< 2016-11-02 12:53:49.301 EET >LOG: worker process: bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1, (PID 22629) exited with exit code 1

< 2016-11-02 12:53:53.218 EET >LOG: starting logical decoding for slot "bdr_77037_6300843528307178977_1_64344__"

< 2016-11-02 12:53:53.218 EET >DETAIL: streaming transactions committing after 0/D038F00, reading WAL from 0/D038EC8

< 2016-11-02 12:53:53.218 EET >LOG: logical decoding found consistent point at 0/D038EC8

< 2016-11-02 12:53:53.218 EET >DETAIL: There are no running transactions.

< 2016-11-02 12:53:54.302 EET >LOG: starting background worker process "bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1,"

< 2016-11-02 13:02:12.314 EET >ERROR: connection to other side has died

< 2016-11-02 13:02:12.316 EET >LOG: worker process: bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1, (PID 22655) exited with exit code 1

< 2016-11-02 13:02:17.317 EET >LOG: starting background worker process "bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1,"

< 2016-11-02 13:06:29.145 EET >ERROR: connection to other side has died

< 2016-11-02 13:06:29.146 EET >LOG: worker process: bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1, (PID 22698) exited with exit code 1

< 2016-11-02 13:06:34.148 EET >LOG: starting background worker process "bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1,"

And after restart, log on node2:

< 2016-11-02 13:18:28.888 EET >LOG: starting up replication identifier with ckpt at 0/D039420

< 2016-11-02 13:18:28.888 EET >LOG: recovered replication state of node 1 to 0/1CAF260

< 2016-11-02 13:18:28.888 EET >LOG: recovered replication state of node 2 to 0/1FBE168

< 2016-11-02 13:18:28.888 EET >LOG: recovered replication state of node 4 to 0/23B2DF8

< 2016-11-02 13:18:28.888 EET >LOG: recovered replication state of node 5 to 0/2CFD4A0

< 2016-11-02 13:18:28.888 EET >LOG: recovered replication state of node 14 to 0/117FCBB0

< 2016-11-02 13:18:28.891 EET >LOG: MultiXact member wraparound protections are now enabled

< 2016-11-02 13:18:28.893 EET >LOG: starting background worker process "bdr supervisor"

< 2016-11-02 13:18:28.893 EET >LOG: database system is ready to accept connections

< 2016-11-02 13:18:28.894 EET >LOG: autovacuum launcher started

< 2016-11-02 13:18:28.902 EET >LOG: worker process: bdr supervisor (PID 23072) exited with exit code 1

< 2016-11-02 13:18:29.905 EET >LOG: starting background worker process "bdr supervisor"

< 2016-11-02 13:18:29.919 EET >LOG: registering background worker "bdr db: test"

< 2016-11-02 13:18:29.919 EET >LOG: starting background worker process "bdr db: test"

< 2016-11-02 13:18:29.945 EET >LOG: registering background worker "bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1,"

< 2016-11-02 13:18:29.945 EET >LOG: starting background worker process "bdr (6300833630798326204,1,77037,)->bdr (6300843528307178977,1,"

< 2016-11-02 13:18:31.314 EET >LOG: starting logical decoding for slot "bdr_77037_6300843528307178977_1_64344__"

< 2016-11-02 13:18:31.314 EET >DETAIL: streaming transactions committing after 0/D0393E8, reading WAL from 0/D0393E8

< 2016-11-02 13:18:31.314 EET >LOG: logical decoding found consistent point at 0/D0393E8

< 2016-11-02 13:18:31.314 EET >DETAIL: There are no running transactions.

It is bad enough that the replication stops so easily, but what is really disturbing is that it is unable to recover. Are we doing something wrong? How is the replication supposed to work in this kind of situation? What do we have to do to make the node2 catch up after timeouted connection?

Regards,

Tero S

Re: Replication (BDR) problem: won't catch up after connection timeout

From

Craig Ringer

Date:

03 November 2016, 01:08:22

Increase wal_sender_timeout to resolve the issue.

I've been investigating just this issue recently. See
https://www.postgresql.org/message-id/CAMsr+YE2dSfHVr7iEv1GSPZihitWX-PMkD9QALEGcTYa+sdsgg@mail.gmail.com
.

It would be very useful to me to know more about the transaction that
caused this problem.

Re: Replication (BDR) problem: won't catch up after connection timeout

From

Craig Ringer

Date:

03 November 2016, 01:13:52

See also https://github.com/2ndQuadrant/bdr/issues/233