Thread: [GENERAL] Replication stops under certain circumstances

[GENERAL] Replication stops under certain circumstances

From
Kim Rose Carlsen
Date:
20 October 2017, 23:39:26
Hi

I have some strange issues with a postgresql read replica that seems to stop replicating under certain circumstances.

Whenever we have changes to our views we have script that drops all views and reload them from scratch with the new
definitions.The reloading of the views happens in a transaction to avoid confusion for everyone using the database.
Whenthis update gets to the slave it seems there is a chance for a deadlock to occur that doesn't get detected.  

As I was trying to reproduce this behavior, I ran into another weird situation that I don't entirely understand. The
symptomis the same that replication stops, but it looks quite different. This example won't reproduce the issue
reliably,but after a few hours I get a slave that won't continue to replicate until I restart it. The queries in the
examplewon't make much sense, and I don't know if they can be simplified further and still cause the "desired" effect. 

Setup:
Launch a new RDS psql instance (9.6.2) on AWS (will be referred to as db-master) and create a read replica (will be
referredto as db-slave). The following options are changed from AWS default: 
 
max_standby_streaming_delay=-1
hot_standby_feedback=1


On the master create 2 dummy tables:
create table a  (id serial primary key);
create table b  (id serial primary key);

Setup thread 1 to do work on master:

while true; do psql -h db-master -U postgres db -c 'begin; drop view if exists view_a cascade; drop view if exists
view_b;drop view if exists view_c; create view view_a as select * from a; create view view_b as select * from b; create
viewview_c as select * from view_a join view_b using (id); insert into a values (default); insert into b values
(default);commit;'; done 

Setup thread 2 to do work on Slave:
while true; do psql -h  db-slave -U postgres db -c 'begin; select * from view_c order by random() limit 10; select *
fromview_a order by random() limit 10;'; done 

Setup thread 3 to do more work on slave:
while true; do psql -h  db-slave -U postgres db -c 'begin; select * from view_b order by random() limit 10; select *
fromview_a order by random() limit 10;'; done 

Every now and then a deadlock is detected and one connection is aborted, this works as expected. But After a
while(serveralhours) it becomes impossible to connect to db on db-slave and thread 2 and 3 stops producing output. When
tryingto connect the psql client just hangs. However it is possible connect to template1 database to get a look on what
isgoing on. 



template1=> select * from pg_stat_activity;
-[ RECORD 1 ]----+--------------------------------
datid            | 16384
datname          | rdsadmin
pid              | 7891
usesysid         | 10
usename          | rdsadmin
application_name |
client_addr      |
client_hostname  |
client_port      |
backend_start    |
xact_start       |
query_start      |
state_change     |
wait_event_type  |
wait_event       |
state            |
backend_xid      |
backend_xmin     |
query            | <insufficient privilege>
-[ RECORD 2 ]----+--------------------------------
datid            | 1
datname          | template1
pid              | 11949
usesysid         | 16388
usename          | hiper
application_name | psql
client_addr      | 192.168.10.166
client_hostname  |
client_port      | 41002
backend_start    | 2017-10-20 16:30:26.032745+02
xact_start       | 2017-10-20 16:30:34.306418+02
query_start      | 2017-10-20 16:30:34.306418+02
state_change     | 2017-10-20 16:30:34.306421+02
wait_event_type  |
wait_event       |
state            | active
backend_xid      |
backend_xmin     | 26891
query            | select * from pg_stat_activity;





There are no active connection except rdsadmin from aws.



template1=> select * from pg_locks;
  locktype  | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid |
virtualtransaction|  pid  |        mode         | granted | fastpath  

------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+----------
 virtualxid |          |          |      |       | 3/929      |               |         |       |          |
3/929             |  9640 | ExclusiveLock       | t       | t 
 relation   |    16390 |     2659 |      |       |            |               |         |       |          |
4/829             |  9639 | AccessShareLock     | t       | t 
 relation   |    16390 |     1249 |      |       |            |               |         |       |          |
4/829             |  9639 | AccessShareLock     | t       | t 
 virtualxid |          |          |      |       | 4/829      |               |         |       |          |
4/829             |  9639 | ExclusiveLock       | t       | t 
 relation   |        1 |    11695 |      |       |            |               |         |       |          |
5/148             | 11949 | AccessShareLock     | t       | t 
 virtualxid |          |          |      |       | 5/148      |               |         |       |          |
5/148             | 11949 | ExclusiveLock       | t       | t 
 virtualxid |          |          |      |       | 1/1        |               |         |       |          |
1/0               |  7593 | ExclusiveLock       | t       | t 
 object     |        0 |          |      |       |            |               |    1262 | 16390 |        0 |
4/829             |  9639 | RowExclusiveLock    | t       | f 
 relation   |    16390 |     1259 |      |       |            |               |         |       |          |
4/829             |  9639 | AccessShareLock     | f       | f 
 relation   |    16390 |     1259 |      |       |            |               |         |       |          |
1/0               |  7593 | AccessExclusiveLock | t       | f 
 object     |        0 |          |      |       |            |               |    1262 | 16390 |        0 |
3/929             |  9640 | RowExclusiveLock    | t       | f 
 relation   |    16390 |     1259 |      |       |            |               |         |       |          |
3/929             |  9640 | AccessShareLock     | f       | f 
(12 rows)

Here there seems to be many threads that all wait for a lock on relation 1259, however I have no idea what the process
thathas the lock waits for (pid: 7593). I can't use pg_terminate_backend(7593) to free the lock, so my only option is
torestart it. 

I don't know if anyone can reproduce this effect?
why are the pids still holding locks even after the have been disconnected in pg_stat_activity?
I assume that one of the pids is the slave thread applying updates from the master?

This is not the exact problem Im trying to reproduce, in that example all connection end up in LOCKED state and there
appearto be a deadlock when inspecting the pg_locks table that don't get discovered. I hope I can reproduce that
examplealso at some time. 

But have anyone else experienced problem with slave stopping to replicate because everything ends up being locked?

Cheers
Kim Carlsen




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general