Thread: Logical replication hangs up.

Logical replication hangs up.

From
Aleš Zelený
Date:
Hello,

we are suing logical replication on 10.4  and it now hangs. After some timeout it is retarted again, replaying 18GB of data and then hangs (while 7GB of wals remains to be proceeded).

The backlog of 18GB comes from a failed migration adding new table to replication while replication user was not granted to select the table. This was clear from log files and once resolved by adding select privilege, I thought that all will work as usual (same happened in test env. many times and adding missing grant for select was sufficient to get it working... these were issues on tests).

RDBMS Version:
PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit

Publication (master) instance error message from log:

2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884) master_prod repusr 0 5/0 sub_eur idle [00000]:LOG:  terminating walsender process due to replication timeout

Subscription instance log:

2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2    0 3/0   [XX000]:ERROR:  could not receive data from WAL stream: SSL connection has been closed unexpectedly
2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397    0    [00000]:LOG:  worker process: logical replication worker for subscription 37932 (PID 8657) exited with exit code 1

While it seems to be a network issue, it is may be not - we have checked the network and even monitoring, all the time some packets were exchanged.

We do have 3 subscriptions(thus repl. slots) for one publication, one subscriber instance is within same datacenter as master, remainig subscribers are remote.


>> select * from pg_replication_slots
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
| slot_name  | plugin   | slot_type | datoid | database    | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn |
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
| sub_usd    | pgoutput | logical   | 16421  | master_prod | False     | True   | 16604      |      | 5536488      | 426/AAE55A68 | 426/AAE55A68        |
| sub_cad    | pgoutput | logical   | 16421  | master_prod | False     | True   | 22875      |      | 5536488      | 426/AAE55A68 | 426/AAE55A68        |
| sub_eur    | pgoutput | logical   | 16421  | master_prod | False     | True   | 16605      |      | 5536488      | 426/AAE55A68 | 426/AAE55A68        |
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+


Once  after the error occurs, wal senders are re/started and they reads wal files until they reach restart_lsn wal file:

>> select pg_walfile_name('426/AAE55A68')
+--------------------------+
| pg_walfile_name          |
+--------------------------+
| 0000000100000426000000AA |
+--------------------------+

# changing file names until they reach this one:
root@master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758 -p 5790| grep pg_wal
postgres 1560 postgres   10r      REG  259,3  16777216 115766007 /pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
postgres 5758 postgres   10r      REG  259,3  16777216 115766007 /pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
postgres 5790 postgres   10r      REG  259,3  16777216 115766007 /pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069



At this moment sent_lsn stops growing and nothing is happening for a while.

select * from pg_stat_replication;

  pid  | usesysid | usename |  application_name  | client_addr   | client_hostname | client_port |         backend_start         | backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |  replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
 16604 |    37868 | repusr  | sub_usd            | 192.168.21.11 |                 |       35010 | 2018-11-02 23:52:22.059157+00 |              | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 | 426/AAE55A68 |           |           |            |             0 | async
 16605 |    37868 | repusr  | sub_eur            | 192.168.23.11 |                 |       36388 | 2018-11-02 23:52:24.75038+00  |              | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 | 426/AAE55A68 |           |           |            |             0 | async
 12387 |    37868 | repusr  | sub_cad            | 192.168.22.11 |                 |       51982 | 2018-11-02 23:30:31.372404+00 |              | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 | 426/AAE55A68 |           |           |            |             0 | async
(3 rows)


What I found being done behind the scenes while from pg_stat_replication "nothing is happening":

On Master wal snaders strace look like:

open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap", O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\0%@\0\0"..., 220) = 220
close(15)                               = 0


On subscription instance, strice for wal reciever looks like this:
recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_create1(EPOLL_CLOEXEC)            = 10
epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1781282352, u64=94512536630832}}) = 0
epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1781282376, u64=94512536630856}}) = 0
epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1781282400, u64=94512536630880}}) = 0
epoll_wait(10, [], 1, 1000)             = 0


tcpdump:
21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr 2060668460], length 0
21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF], proto TCP (6), length 120)
    192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options [nop,nop,TS val 2060670963 ecr 3699603], length 68
21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr 2060670963], length 0
21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF], proto TCP (6), length 120)
    192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options [nop,nop,TS val 2060673465 ecr 3702105], length 68
21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09 (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr 2060673465], length 0
21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF], proto TCP (6), length 120)
    192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options [nop,nop,TS val 2060675968 ecr 3704608], length 68
21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737 (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr 2060675968], length 0


Some checksum issues, but finally ack-ed. Tried to open psql session from host running subscription instance to master instance and it worked like a charm.


I've tried pg_waldump, but nothing what I'll be able to consider as error...
>> select pg_walfile_name('426/AAE55A68')
+--------------------------+
| pg_walfile_name          |
+--------------------------+
| 0000000100000426000000AA |
+--------------------------+

/usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68 0000000100000426000000AA | less

rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488 latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488 latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488 latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488 latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488 latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488 latestCompletedXid 5536487 oldestRunningXid 5536488
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH


I've also did the same for sent_lsn:
>> select pg_walfile_name('429/69E9CC60');
+--------------------------+
| pg_walfile_name          |
+--------------------------+
| 000000010000042900000069 |
+--------------------------+


/usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60 000000010000042900000069 | less
rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn: 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED , blkref #0: rel 1663/16421/38572 blk 40
rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn: 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED , blkref #0: rel 1663/16421/38572 blk 40
rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn: 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED , blkref #0: rel 1663/16421/38572 blk 40

The relation 38572 from list above is an ordinary non replicated table.

Any help or advice how to diagnose/get it working is highly appreciated.

Kind regards Ales Zeleny

Re: Logical replication hangs up.

From
Jerry Sievers
Date:
Aleš Zelený <zeleny.ales@gmail.com> writes:

> Hello,
>
> we are suing logical replication on 10.4  and it now hangs. After
> some timeout it is retarted again, replaying 18GB of data and then
> hangs (while 7GB of wals remains to be proceeded).

Timeout...

Have a look at the 2 setting wal sender/receiver timeout and you
probably need to raise the sender timeout value.

HTH

>
> The backlog of 18GB comes from a failed migration adding new table to
> replication while replication user was not granted to select the
> table. This was clear from log files and once resolved by adding
> select privilege, I thought that all will work as usual (same
> happened in test env. many times and adding missing grant for select
> was sufficient to get it working... these were issues on tests).
>
> RDBMS Version:
> PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609,
> 64-bit
>
> Publication (master) instance error message from log:
>
> 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
> master_prod repusr 0 5/0 sub_eur idle [00000]:LOG:  terminating
> walsender process due to replication timeout
>
> Subscription instance log:
>
> 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2    0 3/0  
> [XX000]:ERROR:  could not receive data from WAL stream: SSL
> connection has been closed unexpectedly
> 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397    0   
> [00000]:LOG:  worker process: logical replication worker for
> subscription 37932 (PID 8657) exited with exit code 1
>
> While it seems to be a network issue, it is may be not - we have
> checked the network and even monitoring, all the time some packets
> were exchanged.
>
> We do have 3 subscriptions(thus repl. slots) for one publication, one
> subscriber instance is within same datacenter as master, remainig
> subscribers are remote.
>
>
>>> select * from pg_replication_slots
>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
> | slot_name  | plugin   | slot_type | datoid | database    |
> temporary | active | active_pid | xmin | catalog_xmin | restart_lsn 
> | confirmed_flush_lsn |
>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
> | sub_usd    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 16604      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> | sub_cad    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 22875      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> | sub_eur    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 16605      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>
>
> Once  after the error occurs, wal senders are re/started and they
> reads wal files until they reach restart_lsn wal file:
>
>>> select pg_walfile_name('426/AAE55A68')
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 0000000100000426000000AA |
> +--------------------------+
>
> # changing file names until they reach this one:
> root@master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758
> -p 5790| grep pg_wal
> postgres 1560 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
> postgres 5758 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
> postgres 5790 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>
>
> At this moment sent_lsn stops growing and nothing is happening for a
> while.
>
> select * from pg_stat_replication;
>
>   pid  | usesysid | usename |  application_name  | client_addr   |
> client_hostname | client_port |         backend_start         |
> backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |
>   replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority |
> sync_state
>
-------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
>  16604 |    37868 | repusr  | sub_usd            | 192.168.21.11 |
>                  |       35010 | 2018-11-02 23:52:22.059157+00 |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
>  16605 |    37868 | repusr  | sub_eur            | 192.168.23.11 |
>                  |       36388 | 2018-11-02 23:52:24.75038+00  |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
>  12387 |    37868 | repusr  | sub_cad            | 192.168.22.11 |
>                  |       51982 | 2018-11-02 23:30:31.372404+00 |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
> (3 rows)
>
> What I found being done behind the scenes while from
> pg_stat_replication "nothing is happening":
>
> On Master wal snaders strace look like:
>
> open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
> O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
> write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\
> 0%@\0\0"..., 220) = 220
> close(15)                               = 0
>
> On subscription instance, strice for wal reciever looks like this:
> recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
> temporarily unavailable)
> epoll_create1(EPOLL_CLOEXEC)            = 10
> epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282352, u64=94512536630832}}) = 0
> epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282376, u64=94512536630856}}) = 0
> epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282400, u64=94512536630880}}) = 0
> epoll_wait(10, [], 1, 1000)             = 0
>
> tcpdump:
> 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
> (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
> 2060668460], length 0
> 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options
> [nop,nop,TS val 2060670963 ecr 3699603], length 68
> 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
> (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
> 2060670963], length 0
> 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options
> [nop,nop,TS val 2060673465 ecr 3702105], length 68
> 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
> (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
> 2060673465], length 0
> 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options
> [nop,nop,TS val 2060675968 ecr 3704608], length 68
> 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
> (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
> 2060675968], length 0
>
> Some checksum issues, but finally ack-ed. Tried to open psql session
> from host running subscription instance to master instance and it
> worked like a charm.
>
>
> I've tried pg_waldump, but nothing what I'll be able to consider as
> error...
>>> select pg_walfile_name('426/AAE55A68')
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 0000000100000426000000AA |
> +--------------------------+
> /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
> 0000000100000426000000AA | less
>
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn:
> 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH
>
>
> I've also did the same for sent_lsn:
>>> select pg_walfile_name('429/69E9CC60');
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 000000010000042900000069 |
> +--------------------------+
>
> /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
> 000000010000042900000069 | less
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
>
> The relation 38572 from list above is an ordinary non replicated
> table.
>
> Any help or advice how to diagnose/get it working is highly
> appreciated.
>
> Kind regards Ales Zeleny
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


Re: Logical replication hangs up.

From
Aleš Zelený
Date:
Hello,

wal_sender_timeout did not help (from default 60s to 600s), but I haven't tested receiver timeout since it was sender complaining and receiver loops over ( recvfrom with "Resource  temporarily unavailable" result, so i thought it is waiting for master sender to provide some data), thanks for the tip.

Since the behavior is all the time deterministic - works as expected until sender reaches 429/69E9CC60 (  wal 000000010000042900000069 ) and it stuck it seems to me, that the issue is somewhat related to content of this file.

Another consideration is, why restart_lsn ( flush_lsn   /  replay_lsn on master ) never goes through 426/AAE55A68. May be, it is due to lack of my knowledge, but I'd expect that within almost 11GB of wal data would be some committed changes and it is lot behind maximum sent_lsn reported by wal sender - I would expect reply_lsn to more/less follow sent_lsn but I have no clue what what is root cause (stuck sender at certain sent_lsn or logical replication bg worker not applying/reporting so keeping restart_lsn constant). That is why I've tried to find whatever suspicious using  pg_waldump in wal files corresponding to restart_lsn and sent_lsn.

Kind regards Ales

so 3. 11. 2018 v 17:02 odesílatel Jerry Sievers <gsievers19@comcast.net> napsal:
Aleš Zelený <zeleny.ales@gmail.com> writes:

> Hello,
>
> we are suing logical replication on 10.4  and it now hangs. After
> some timeout it is retarted again, replaying 18GB of data and then
> hangs (while 7GB of wals remains to be proceeded).

Timeout...

Have a look at the 2 setting wal sender/receiver timeout and you
probably need to raise the sender timeout value.

HTH

>
> The backlog of 18GB comes from a failed migration adding new table to
> replication while replication user was not granted to select the
> table. This was clear from log files and once resolved by adding
> select privilege, I thought that all will work as usual (same
> happened in test env. many times and adding missing grant for select
> was sufficient to get it working... these were issues on tests).
>
> RDBMS Version:
> PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609,
> 64-bit
>
> Publication (master) instance error message from log:
>
> 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
> master_prod repusr 0 5/0 sub_eur idle [00000]:LOG:  terminating
> walsender process due to replication timeout
>
> Subscription instance log:
>
> 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2    0 3/0  
> [XX000]:ERROR:  could not receive data from WAL stream: SSL
> connection has been closed unexpectedly
> 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397    0   
> [00000]:LOG:  worker process: logical replication worker for
> subscription 37932 (PID 8657) exited with exit code 1
>
> While it seems to be a network issue, it is may be not - we have
> checked the network and even monitoring, all the time some packets
> were exchanged.
>
> We do have 3 subscriptions(thus repl. slots) for one publication, one
> subscriber instance is within same datacenter as master, remainig
> subscribers are remote.
>
>
>>> select * from pg_replication_slots
> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
> | slot_name  | plugin   | slot_type | datoid | database    |
> temporary | active | active_pid | xmin | catalog_xmin | restart_lsn 
> | confirmed_flush_lsn |
> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
> | sub_usd    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 16604      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> | sub_cad    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 22875      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> | sub_eur    | pgoutput | logical   | 16421  | master_prod |
> False     | True   | 16605      |      | 5536488      | 426/AAE55A68
> | 426/AAE55A68        |
> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>
>
> Once  after the error occurs, wal senders are re/started and they
> reads wal files until they reach restart_lsn wal file:
>
>>> select pg_walfile_name('426/AAE55A68')
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 0000000100000426000000AA |
> +--------------------------+
>
> # changing file names until they reach this one:
> root@master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758
> -p 5790| grep pg_wal
> postgres 1560 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
> postgres 5758 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
> postgres 5790 postgres   10r      REG  259,3  16777216 115766007 /
> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>
>
> At this moment sent_lsn stops growing and nothing is happening for a
> while.
>
> select * from pg_stat_replication;
>
>   pid  | usesysid | usename |  application_name  | client_addr   |
> client_hostname | client_port |         backend_start         |
> backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |
>   replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority |
> sync_state
> -------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
>  16604 |    37868 | repusr  | sub_usd            | 192.168.21.11 |
>                  |       35010 | 2018-11-02 23:52:22.059157+00 |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
>  16605 |    37868 | repusr  | sub_eur            | 192.168.23.11 |
>                  |       36388 | 2018-11-02 23:52:24.75038+00  |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
>  12387 |    37868 | repusr  | sub_cad            | 192.168.22.11 |
>                  |       51982 | 2018-11-02 23:30:31.372404+00 |
>               | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
> | 426/AAE55A68 |           |           |            |             0 |
> async
> (3 rows)
>
> What I found being done behind the scenes while from
> pg_stat_replication "nothing is happening":
>
> On Master wal snaders strace look like:
>
> open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
> O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
> write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\
> 0%@\0\0"..., 220) = 220
> close(15)                               = 0
>
> On subscription instance, strice for wal reciever looks like this:
> recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
> temporarily unavailable)
> epoll_create1(EPOLL_CLOEXEC)            = 10
> epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282352, u64=94512536630832}}) = 0
> epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282376, u64=94512536630856}}) = 0
> epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
> 1781282400, u64=94512536630880}}) = 0
> epoll_wait(10, [], 1, 1000)             = 0
>
> tcpdump:
> 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
> (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
> 2060668460], length 0
> 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options
> [nop,nop,TS val 2060670963 ecr 3699603], length 68
> 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
> (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
> 2060670963], length 0
> 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options
> [nop,nop,TS val 2060673465 ecr 3702105], length 68
> 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
> (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
> 2060673465], length 0
> 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF],
> proto TCP (6), length 120)
>     192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
> 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options
> [nop,nop,TS val 2060675968 ecr 3704608], length 68
> 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF],
> proto TCP (6), length 52)
>     192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
> (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
> 2060675968], length 0
>
> Some checksum issues, but finally ack-ed. Tried to open psql session
> from host running subscription instance to master instance and it
> worked like a charm.
>
>
> I've tried pg_waldump, but nothing what I'll be able to consider as
> error...
>>> select pg_walfile_name('426/AAE55A68')
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 0000000100000426000000AA |
> +--------------------------+
> /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
> 0000000100000426000000AA | less
>
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
> 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
> latestCompletedXid 5536487 oldestRunningXid 5536488
> rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn:
> 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH
>
>
> I've also did the same for sent_lsn:
>>> select pg_walfile_name('429/69E9CC60');
> +--------------------------+
> | pg_walfile_name          |
> +--------------------------+
> | 000000010000042900000069 |
> +--------------------------+
>
> /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
> 000000010000042900000069 | less
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
> 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED ,
> blkref #0: rel 1663/16421/38572 blk 40
>
> The relation 38572 from list above is an ordinary non replicated
> table.
>
> Any help or advice how to diagnose/get it working is highly
> appreciated.
>
> Kind regards Ales Zeleny
>
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Unsuscribe

From
Christophe TAVERNE
Date:
Hi,

I would like to unsuscribe but I can't recover my account.

How can I process ?

Please, admin, could you remove my email from List?


Thanks

regards,

Chris



Attachment

Re: Logical replication hangs up.

From
Achilleas Mantzios
Date:
On 3/11/18 6:02 μ.μ., Jerry Sievers wrote:
> Aleš Zelený <zeleny.ales@gmail.com> writes:
>
>> Hello,
>>
>> we are suing logical replication on 10.4  and it now hangs. After
>> some timeout it is retarted again, replaying 18GB of data and then
>> hangs (while 7GB of wals remains to be proceeded).
> Timeout...
>
> Have a look at the 2 setting wal sender/receiver timeout and you
> probably need to raise the sender timeout value.
That's been the case for me as well, setting both values to '5 mins' solved my issues.
>
> HTH
>
>> The backlog of 18GB comes from a failed migration adding new table to
>> replication while replication user was not granted to select the
>> table. This was clear from log files and once resolved by adding
>> select privilege, I thought that all will work as usual (same
>> happened in test env. many times and adding missing grant for select
>> was sufficient to get it working... these were issues on tests).
>>
>> RDBMS Version:
>> PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
>> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609,
>> 64-bit
>>
>> Publication (master) instance error message from log:
>>
>> 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
>> master_prod repusr 0 5/0 sub_eur idle [00000]:LOG:  terminating
>> walsender process due to replication timeout
>>
>> Subscription instance log:
>>
>> 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2    0 3/0
>> [XX000]:ERROR:  could not receive data from WAL stream: SSL
>> connection has been closed unexpectedly
>> 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397    0
>> [00000]:LOG:  worker process: logical replication worker for
>> subscription 37932 (PID 8657) exited with exit code 1
>>
>> While it seems to be a network issue, it is may be not - we have
>> checked the network and even monitoring, all the time some packets
>> were exchanged.
>>
>> We do have 3 subscriptions(thus repl. slots) for one publication, one
>> subscriber instance is within same datacenter as master, remainig
>> subscribers are remote.
>>
>>
>>>> select * from pg_replication_slots
>>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | slot_name  | plugin   | slot_type | datoid | database    |
>> temporary | active | active_pid | xmin | catalog_xmin | restart_lsn
>> | confirmed_flush_lsn |
>>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | sub_usd    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 16604      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>> | sub_cad    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 22875      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>> | sub_eur    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 16605      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>>
+------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>>
>>
>> Once  after the error occurs, wal senders are re/started and they
>> reads wal files until they reach restart_lsn wal file:
>>
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>>
>> # changing file names until they reach this one:
>> root@master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758
>> -p 5790| grep pg_wal
>> postgres 1560 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5758 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5790 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>>
>>
>> At this moment sent_lsn stops growing and nothing is happening for a
>> while.
>>
>> select * from pg_stat_replication;
>>
>>    pid  | usesysid | usename |  application_name  | client_addr   |
>> client_hostname | client_port |         backend_start         |
>> backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |
>>    replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority |
>> sync_state
>>
-------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
>>   16604 |    37868 | repusr  | sub_usd            | 192.168.21.11 |
>>                   |       35010 | 2018-11-02 23:52:22.059157+00 |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>>   16605 |    37868 | repusr  | sub_eur            | 192.168.23.11 |
>>                   |       36388 | 2018-11-02 23:52:24.75038+00  |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>>   12387 |    37868 | repusr  | sub_cad            | 192.168.22.11 |
>>                   |       51982 | 2018-11-02 23:30:31.372404+00 |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>> (3 rows)
>>
>> What I found being done behind the scenes while from
>> pg_stat_replication "nothing is happening":
>>
>> On Master wal snaders strace look like:
>>
>> open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
>> O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
>> write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\
>> 0%@\0\0"..., 220) = 220
>> close(15)                               = 0
>>
>> On subscription instance, strice for wal reciever looks like this:
>> recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> epoll_create1(EPOLL_CLOEXEC)            = 10
>> epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282352, u64=94512536630832}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282376, u64=94512536630856}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282400, u64=94512536630880}}) = 0
>> epoll_wait(10, [], 1, 1000)             = 0
>>
>> tcpdump:
>> 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
>> (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
>> 2060668460], length 0
>> 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options
>> [nop,nop,TS val 2060670963 ecr 3699603], length 68
>> 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
>> (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
>> 2060670963], length 0
>> 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options
>> [nop,nop,TS val 2060673465 ecr 3702105], length 68
>> 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
>> (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
>> 2060673465], length 0
>> 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options
>> [nop,nop,TS val 2060675968 ecr 3704608], length 68
>> 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
>> (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
>> 2060675968], length 0
>>
>> Some checksum issues, but finally ack-ed. Tried to open psql session
>> from host running subscription instance to master instance and it
>> worked like a charm.
>>
>>
>> I've tried pg_waldump, but nothing what I'll be able to consider as
>> error...
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>> /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
>> 0000000100000426000000AA | less
>>
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn:
>> 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH
>>
>>
>> I've also did the same for sent_lsn:
>>>> select pg_walfile_name('429/69E9CC60');
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 000000010000042900000069 |
>> +--------------------------+
>>
>> /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
>> 000000010000042900000069 | less
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>>
>> The relation 38572 from list above is an ordinary non replicated
>> table.
>>
>> Any help or advice how to diagnose/get it working is highly
>> appreciated.
>>
>> Kind regards Ales Zeleny
>>
>>


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt



Re: Logical replication hangs up.

From
Aleš Zelený
Date:
Hello, thanks for sharing the hint,

I've tested 10 minutes, but it did not fix that.

Having more time waiting for timeout, I've run strace on wal sender once more and now the patern is different (probably I've catch it first time before it reaches issue point).

There are repeated failing attempts to open certain files for the slot which are not there:

open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-62000000.snap", O_RDONLY) = 28
read(28, "\334\0\0\0\0\0\0\0H]Tb)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\0%@\0\0"..., 72) = 72
read(28, "|\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0\320\346\37\17\316U\0\0\0\0\0\0\0\0\0\0"..., 148) = 148
read(28, "", 72)                        = 0
close(28)                               = 0
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-63000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-64000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-65000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-66000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-67000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-68000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-69000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-6A000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-6B000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-6C000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-6D000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-6E000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-6F000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-70000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-71000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-72000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-73000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-74000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-75000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-76000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-77000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-78000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489824-lsn-429-79000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-62000000.snap", O_RDONLY) = 28
read(28, "\334\0\0\0\0\0\0\0 ^Tb)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\0%@\0\0"..., 72) = 72
read(28, "|\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0 \7 \17\316U\0\0\0\0\0\0\0\0\0\0"..., 148) = 148
read(28, "", 72)                        = 0
close(28)                               = 0
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-63000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-64000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-65000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-66000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-67000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-68000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-69000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-6A000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-6B000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-6C000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-6D000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-6E000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-6F000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-70000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-71000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-72000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-73000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-74000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-75000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-76000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-77000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-78000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489825-lsn-429-79000000.snap", O_RDONLY) = -1 ENOENT (No such file or directory)
open("pg_replslot/sub_anl_master_usd/xid-6489826-lsn-429-62000000.snap", O_RDONLY) = 28
read(28, "\334\0\0\0\0\0\0\0\370^Tb)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\0%@\0\0"..., 72) = 72
read(28, "|\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0p' \17\316U\0\0\0\0\0\0\0\0\0\0"..., 148) = 148
read(28, "", 72)                        = 0
close(28)                               = 0

there are no error messages in postgres log indicating missng WAL to be decoded... as I've mentioned:

I've copied databases to different machine for investigation and it behaves same way there:

LOG:  replication connection authorized: user=repusr SSL enabled
(protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)
LOG:  starting logical decoding for slot "sub_usd"
DETAIL:  streaming transactions committing after 426/AAE55A68, rea
ding WAL from 426/AAE55A68
LOG:  logical decoding found consistent point at 426/AAE55A68
DETAIL:  There are no running transactions.

So it look-like it takes too long to checks for all xid/lsn snap files and it seems to be the reason.

And at least with upgraded Postgres to 10.5 having both timeouts set to 10 minutes replication was able to cross the "magic freezing" LSN.

If there will be a change, I'll try to test it with 10.4 (as what we do have on prod) to see whether timeouts were solution or upgrade to 10.5 was also necessary.

it consumes reasonable amount of memory:
db:~$ ps -u -p 4232
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres  4232 12.2 23.9 16202616 7636624 ?    Ds   21:38   8:32 postgres: anl_master_prod: wal sender process repusr 10.30.100.77(60866) idle

and from logs we know that sever starts swapping (there are usually 4GB of free memory), therefore load raises and whole response time become bad and could be why we start hitting timeouts (and due to high volume of WAL).

It terminates after the 10 minutes timeout, but it was able to confirm higher restart_lsn, so next round did not start at 18GB of WAL but only on 4GB of and finally get streaming state.

After the restart wal sender (having to process much less volume of WAL) sucees to reach streaming state and it also consumes reasonable amount of memory:

db:~$ ps -u -p 4232
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres  4232 12.2 23.9 16202616 7636624 ?    Ds   21:38   8:32 postgres: anl_master_prod: wal sender process repusr 10.30.100.77(60866) idle

Thanks a lot to all inputs! 10.5 and raised both timeouts seems to be solution.

Kind regards Ales Zeleny

po 5. 11. 2018 v 10:26 odesílatel Achilleas Mantzios <achill@matrix.gatewaynet.com> napsal:
On 3/11/18 6:02 μ.μ., Jerry Sievers wrote:
> Aleš Zelený <zeleny.ales@gmail.com> writes:
>
>> Hello,
>>
>> we are suing logical replication on 10.4  and it now hangs. After
>> some timeout it is retarted again, replaying 18GB of data and then
>> hangs (while 7GB of wals remains to be proceeded).
> Timeout...
>
> Have a look at the 2 setting wal sender/receiver timeout and you
> probably need to raise the sender timeout value.
That's been the case for me as well, setting both values to '5 mins' solved my issues.
>
> HTH
>
>> The backlog of 18GB comes from a failed migration adding new table to
>> replication while replication user was not granted to select the
>> table. This was clear from log files and once resolved by adding
>> select privilege, I thought that all will work as usual (same
>> happened in test env. many times and adding missing grant for select
>> was sufficient to get it working... these were issues on tests).
>>
>> RDBMS Version:
>> PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu,
>> compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609,
>> 64-bit
>>
>> Publication (master) instance error message from log:
>>
>> 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884)
>> master_prod repusr 0 5/0 sub_eur idle [00000]:LOG:  terminating
>> walsender process due to replication timeout
>>
>> Subscription instance log:
>>
>> 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2    0 3/0
>> [XX000]:ERROR:  could not receive data from WAL stream: SSL
>> connection has been closed unexpectedly
>> 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397    0
>> [00000]:LOG:  worker process: logical replication worker for
>> subscription 37932 (PID 8657) exited with exit code 1
>>
>> While it seems to be a network issue, it is may be not - we have
>> checked the network and even monitoring, all the time some packets
>> were exchanged.
>>
>> We do have 3 subscriptions(thus repl. slots) for one publication, one
>> subscriber instance is within same datacenter as master, remainig
>> subscribers are remote.
>>
>>
>>>> select * from pg_replication_slots
>> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | slot_name  | plugin   | slot_type | datoid | database    |
>> temporary | active | active_pid | xmin | catalog_xmin | restart_lsn
>> | confirmed_flush_lsn |
>> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>> | sub_usd    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 16604      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>> | sub_cad    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 22875      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>> | sub_eur    | pgoutput | logical   | 16421  | master_prod |
>> False     | True   | 16605      |      | 5536488      | 426/AAE55A68
>> | 426/AAE55A68        |
>> +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+
>>
>>
>> Once  after the error occurs, wal senders are re/started and they
>> reads wal files until they reach restart_lsn wal file:
>>
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>>
>> # changing file names until they reach this one:
>> root@master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758
>> -p 5790| grep pg_wal
>> postgres 1560 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5758 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>> postgres 5790 postgres   10r      REG  259,3  16777216 115766007 /
>> pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069
>>
>>
>> At this moment sent_lsn stops growing and nothing is happening for a
>> while.
>>
>> select * from pg_stat_replication;
>>
>>    pid  | usesysid | usename |  application_name  | client_addr   |
>> client_hostname | client_port |         backend_start         |
>> backend_xmin |  state  |   sent_lsn   |  write_lsn   |  flush_lsn   |
>>    replay_lsn  | write_lag | flush_lag | replay_lag | sync_priority |
>> sync_state
>> -------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------
>>   16604 |    37868 | repusr  | sub_usd            | 192.168.21.11 |
>>                   |       35010 | 2018-11-02 23:52:22.059157+00 |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>>   16605 |    37868 | repusr  | sub_eur            | 192.168.23.11 |
>>                   |       36388 | 2018-11-02 23:52:24.75038+00  |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>>   12387 |    37868 | repusr  | sub_cad            | 192.168.22.11 |
>>                   |       51982 | 2018-11-02 23:30:31.372404+00 |
>>                | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68
>> | 426/AAE55A68 |           |           |            |             0 |
>> async
>> (3 rows)
>>
>> What I found being done behind the scenes while from
>> pg_stat_replication "nothing is happening":
>>
>> On Master wal snaders strace look like:
>>
>> open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap",
>> O_WRONLY|O_CREAT|O_APPEND, 0600) = 15
>> write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\
>> 0%@\0\0"..., 220) = 220
>> close(15)                               = 0
>>
>> On subscription instance, strice for wal reciever looks like this:
>> recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> epoll_create1(EPOLL_CLOEXEC)            = 10
>> epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282352, u64=94512536630832}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282376, u64=94512536630856}}) = 0
>> epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=
>> 1781282400, u64=94512536630880}}) = 0
>> epoll_wait(10, [], 1, 1000)             = 0
>>
>> tcpdump:
>> 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab
>> (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr
>> 2060668460], length 0
>> 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options
>> [nop,nop,TS val 2060670963 ecr 3699603], length 68
>> 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda
>> (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr
>> 2060670963], length 0
>> 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options
>> [nop,nop,TS val 2060673465 ecr 3702105], length 68
>> 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09
>> (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr
>> 2060673465], length 0
>> 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF],
>> proto TCP (6), length 120)
>>      192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum
>> 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options
>> [nop,nop,TS val 2060675968 ecr 3704608], length 68
>> 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF],
>> proto TCP (6), length 52)
>>      192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737
>> (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr
>> 2060675968], length 0
>>
>> Some checksum issues, but finally ack-ed. Tried to open psql session
>> from host running subscription instance to master instance and it
>> worked like a charm.
>>
>>
>> I've tried pg_waldump, but nothing what I'll be able to consider as
>> error...
>>>> select pg_walfile_name('426/AAE55A68')
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 0000000100000426000000AA |
>> +--------------------------+
>> /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68
>> 0000000100000426000000AA | less
>>
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn:
>> 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488
>> latestCompletedXid 5536487 oldestRunningXid 5536488
>> rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn:
>> 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH
>>
>>
>> I've also did the same for sent_lsn:
>>>> select pg_walfile_name('429/69E9CC60');
>> +--------------------------+
>> | pg_walfile_name          |
>> +--------------------------+
>> | 000000010000042900000069 |
>> +--------------------------+
>>
>> /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60
>> 000000010000042900000069 | less
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>> rmgr: Heap        len (rec/tot):     76/    76, tx:    5536495, lsn:
>> 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED ,
>> blkref #0: rel 1663/16421/38572 blk 40
>>
>> The relation 38572 from list above is an ordinary non replicated
>> table.
>>
>> Any help or advice how to diagnose/get it working is highly
>> appreciated.
>>
>> Kind regards Ales Zeleny
>>
>>


--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt