[GENERAL] Replication stops under certain circumstances - Mailing list pgsql-general
From | Kim Rose Carlsen |
---|---|
Subject | [GENERAL] Replication stops under certain circumstances |
Date | |
Msg-id | VI1PR05MB170908E04964213E8C2F3837C7430@VI1PR05MB1709.eurprd05.prod.outlook.com Whole thread Raw |
List | pgsql-general |
Hi I have some strange issues with a postgresql read replica that seems to stop replicating under certain circumstances. Whenever we have changes to our views we have script that drops all views and reload them from scratch with the new definitions.The reloading of the views happens in a transaction to avoid confusion for everyone using the database. Whenthis update gets to the slave it seems there is a chance for a deadlock to occur that doesn't get detected. As I was trying to reproduce this behavior, I ran into another weird situation that I don't entirely understand. The symptomis the same that replication stops, but it looks quite different. This example won't reproduce the issue reliably,but after a few hours I get a slave that won't continue to replicate until I restart it. The queries in the examplewon't make much sense, and I don't know if they can be simplified further and still cause the "desired" effect. Setup: Launch a new RDS psql instance (9.6.2) on AWS (will be referred to as db-master) and create a read replica (will be referredto as db-slave). The following options are changed from AWS default: max_standby_streaming_delay=-1 hot_standby_feedback=1 On the master create 2 dummy tables: create table a (id serial primary key); create table b (id serial primary key); Setup thread 1 to do work on master: while true; do psql -h db-master -U postgres db -c 'begin; drop view if exists view_a cascade; drop view if exists view_b;drop view if exists view_c; create view view_a as select * from a; create view view_b as select * from b; create viewview_c as select * from view_a join view_b using (id); insert into a values (default); insert into b values (default);commit;'; done Setup thread 2 to do work on Slave: while true; do psql -h db-slave -U postgres db -c 'begin; select * from view_c order by random() limit 10; select * fromview_a order by random() limit 10;'; done Setup thread 3 to do more work on slave: while true; do psql -h db-slave -U postgres db -c 'begin; select * from view_b order by random() limit 10; select * fromview_a order by random() limit 10;'; done Every now and then a deadlock is detected and one connection is aborted, this works as expected. But After a while(serveralhours) it becomes impossible to connect to db on db-slave and thread 2 and 3 stops producing output. When tryingto connect the psql client just hangs. However it is possible connect to template1 database to get a look on what isgoing on. template1=> select * from pg_stat_activity; -[ RECORD 1 ]----+-------------------------------- datid | 16384 datname | rdsadmin pid | 7891 usesysid | 10 usename | rdsadmin application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | <insufficient privilege> -[ RECORD 2 ]----+-------------------------------- datid | 1 datname | template1 pid | 11949 usesysid | 16388 usename | hiper application_name | psql client_addr | 192.168.10.166 client_hostname | client_port | 41002 backend_start | 2017-10-20 16:30:26.032745+02 xact_start | 2017-10-20 16:30:34.306418+02 query_start | 2017-10-20 16:30:34.306418+02 state_change | 2017-10-20 16:30:34.306421+02 wait_event_type | wait_event | state | active backend_xid | backend_xmin | 26891 query | select * from pg_stat_activity; There are no active connection except rdsadmin from aws. template1=> select * from pg_locks; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction| pid | mode | granted | fastpath ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+---------- virtualxid | | | | | 3/929 | | | | | 3/929 | 9640 | ExclusiveLock | t | t relation | 16390 | 2659 | | | | | | | | 4/829 | 9639 | AccessShareLock | t | t relation | 16390 | 1249 | | | | | | | | 4/829 | 9639 | AccessShareLock | t | t virtualxid | | | | | 4/829 | | | | | 4/829 | 9639 | ExclusiveLock | t | t relation | 1 | 11695 | | | | | | | | 5/148 | 11949 | AccessShareLock | t | t virtualxid | | | | | 5/148 | | | | | 5/148 | 11949 | ExclusiveLock | t | t virtualxid | | | | | 1/1 | | | | | 1/0 | 7593 | ExclusiveLock | t | t object | 0 | | | | | | 1262 | 16390 | 0 | 4/829 | 9639 | RowExclusiveLock | t | f relation | 16390 | 1259 | | | | | | | | 4/829 | 9639 | AccessShareLock | f | f relation | 16390 | 1259 | | | | | | | | 1/0 | 7593 | AccessExclusiveLock | t | f object | 0 | | | | | | 1262 | 16390 | 0 | 3/929 | 9640 | RowExclusiveLock | t | f relation | 16390 | 1259 | | | | | | | | 3/929 | 9640 | AccessShareLock | f | f (12 rows) Here there seems to be many threads that all wait for a lock on relation 1259, however I have no idea what the process thathas the lock waits for (pid: 7593). I can't use pg_terminate_backend(7593) to free the lock, so my only option is torestart it. I don't know if anyone can reproduce this effect? why are the pids still holding locks even after the have been disconnected in pg_stat_activity? I assume that one of the pids is the slave thread applying updates from the master? This is not the exact problem Im trying to reproduce, in that example all connection end up in LOCKED state and there appearto be a deadlock when inspecting the pg_locks table that don't get discovered. I hope I can reproduce that examplealso at some time. But have anyone else experienced problem with slave stopping to replicate because everything ends up being locked? Cheers Kim Carlsen -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
pgsql-general by date: