Home > mailing lists
Re: Logical replication hangs up. - Mailing list pgsql-general

From	Achilleas Mantzios
Subject	Re: Logical replication hangs up.
Date	November 5, 2018 12:23:23
Msg-id	02fc0ca8-3b46-3c5d-9cf6-46d75fed7ec5@matrix.gatewaynet.com Whole thread Raw
In response to	Re: Logical replication hangs up. (Jerry Sievers <gsievers19@comcast.net>)
Responses	Re: Logical replication hangs up.
List	pgsql-general
Tree view
On 3/11/18 6:02 μ.μ., Jerry Sievers wrote:
> Aleš Zelený <zeleny.ales@gmail.com> writes:
>
>> Hello,
>>
>> we are suing logical replication on 10.4  and it now hangs. After
>> some timeout it is retarted again, replaying 18GB of data and then
>> hangs (while 7GB of wals remains to be proceeded).
> Timeout...
>
> Have a look at the 2 setting wal sender/receiver timeout and you
> probably need to raise the sender timeout value.
That's been the case for me as well, setting both values to '5 mins' solved my issues.
>
> HTH
>
>> The backlog of 18GB comes from a failed migration adding new table to
>> replication while replication user was not granted to select the
>> table. This was clear from log files and once resolved by adding
>> select privilege, I thought that all will work as usual (same
>> happened in test env. many times and adding missing grant for select
>> was sufficient to get it working... these were issues on tests).
>>
>> RDBMS Version:
>> PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
>> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609,
>> 64-bit
>>
>> Publication (master) instance error message from log:
>>
>> 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
>> master_prod repusr 0 5/0 sub_eur idle [00000]:LOG:  terminating
>> walsender process due to replication timeout
>>
>> Subscription instance log:
>>
>> 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2    0 3/0
>> [XX000]:ERROR:  could not receive data from WAL stream: SSL
>> connection has been closed unexpectedly
>> 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397    0
>> [00000]:LOG:  worker process: logical replication worker for
>> subscription 37932 (PID 8657) exited with exit code 1
>>
>> While it seems to be a network issue, it is may be not - we have
>> checked the network and even monitoring, all the time some packets
>> were exchanged.
>>
>> We do have 3 subscriptions(thus repl. slots) for one publication, one
>> subscriber instance is within same datacenter as master, remainig
>> subscribers are remote.
>>
>>
>>>> select * from pg_replication_slots
>>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | slot_name  | plugin   | slot_type | datoid | database    |
>> temporary | active | active_pid | xmin | catalog_xmin | restart_lsn
>> | confirmed_flush_lsn |
>>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | sub_usd    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 16604      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>> | sub_cad    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 22875      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>> | sub_eur    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 16605      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>>
>>
>> Once  after the error occurs, wal senders are re/started and they
>> reads wal files until they reach restart_lsn wal file:
>>
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>>
>> # changing file names until they reach this one:
>> root@master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758
>> -p 5790| grep pg_wal
>> postgres 1560 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5758 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5790 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>>
>>
>> At this moment sent_lsn stops growing and nothing is happening for a
>> while.
>>
>> select * from pg_stat_replication;
>>
>>    pid  | usesysid | usename |  application_name  | client_addr   |
>> client_hostname | client_port |         backend_start         |
>> backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |
>>    replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority |
>> sync_state
>>
-------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
>>   16604 |    37868 | repusr  | sub_usd            | 192.168.21.11 |
>>                   |       35010 | 2018-11-02 23:52:22.059157+00 |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>>   16605 |    37868 | repusr  | sub_eur            | 192.168.23.11 |
>>                   |       36388 | 2018-11-02 23:52:24.75038+00  |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>>   12387 |    37868 | repusr  | sub_cad            | 192.168.22.11 |
>>                   |       51982 | 2018-11-02 23:30:31.372404+00 |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>> (3 rows)
>>
>> What I found being done behind the scenes while from
>> pg_stat_replication "nothing is happening":
>>
>> On Master wal snaders strace look like:
>>
>> open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
>> O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
>> write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\
>> 0%@\0\0"..., 220) = 220
>> close(15)                               = 0
>>
>> On subscription instance, strice for wal reciever looks like this:
>> recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> epoll_create1(EPOLL_CLOEXEC)            = 10
>> epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282352, u64=94512536630832}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282376, u64=94512536630856}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282400, u64=94512536630880}}) = 0
>> epoll_wait(10, [], 1, 1000)             = 0
>>
>> tcpdump:
>> 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
>> (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
>> 2060668460], length 0
>> 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options
>> [nop,nop,TS val 2060670963 ecr 3699603], length 68
>> 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
>> (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
>> 2060670963], length 0
>> 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options
>> [nop,nop,TS val 2060673465 ecr 3702105], length 68
>> 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
>> (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
>> 2060673465], length 0
>> 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options
>> [nop,nop,TS val 2060675968 ecr 3704608], length 68
>> 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
>> (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
>> 2060675968], length 0
>>
>> Some checksum issues, but finally ack-ed. Tried to open psql session
>> from host running subscription instance to master instance and it
>> worked like a charm.
>>
>>
>> I've tried pg_waldump, but nothing what I'll be able to consider as
>> error...
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>> /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
>> 0000000100000426000000AA | less
>>
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn:
>> 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH
>>
>>
>> I've also did the same for sent_lsn:
>>>> select pg_walfile_name('429/69E9CC60');
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 000000010000042900000069 |
>> +--------------------------+
>>
>> /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
>> 000000010000042900000069 | less
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>>
>> The relation 38572 from list above is an ordinary non replicated
>> table.
>>
>> Any help or advice how to diagnose/get it working is highly
>> appreciated.
>>
>> Kind regards Ales Zeleny
>>
>>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt
pgsql-general by date:
From: Condor
Date: 05 November 2018, 11:56:57
Subject: Question about index on different tablespace and rebuild it
From: pavan95
Date: 05 November 2018, 13:09:06
Subject: Re: Password management in PostgreSQL
Re: Logical replication hangs up. - Mailing list pgsql-general

Previous

Next