Thread: Logical replication stuck and no data being copied

Logical replication stuck and no data being copied

From
anonymous001
Date:
I'm connecting from an aurora postgresql 12.4 instance to a postgresql 12.6
community edition.
The postgresql 12.6 is the publisher and the aurora db is the subscriber.
Most of the subscriptions and publishing worked. 

Except for one table, which I've created a separate publication for.

One table is stuck and not copying data

I checked with the following query.

WITH active_pub_sub AS
(
  select relname from pg_stat_all_tables a
  join pg_subscription_rel sr on sr.srrelid = a.relid
  where srsubstate!='r'
)
select relname as table_name,
       pg_size_pretty(pg_total_relation_size(relid)) as total_size,
       pg_size_pretty(pg_relation_size(relid)) as table_size,
       pg_size_pretty(pg_total_relation_size(relid) -
pg_relation_size(relid)) as index_size
from pg_catalog.pg_statio_user_tables where relname IN (SELECT * FROM
active_pub_sub)
 order by pg_relation_size(relid) desc;
   table_name    | total_size | table_size | index_size 
-----------------+------------+------------+------------
 translationitem | 16 kB      | 0 bytes    | 16 kB


This is what i see from the logs

2021-02-17 14:38:42 SAST [19462]: [3-1]
user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x
LOG:  starting logical decoding for slot "sub_prodza_translationitem"
2021-02-17 14:38:42 SAST [19462]: [4-1]
user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x
DETAIL:  Streaming transactions committing after 7AB/9CD78D60, reading WAL
from 7AB/9CD78D28.
2021-02-17 14:38:42 SAST [19462]: [5-1]
user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x
LOG:  logical decoding found consistent point at 7AB/9CD78D28
2021-02-17 14:38:42 SAST [19462]: [6-1]
user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x
DETAIL:  There are no running transactions.

from aurora
prodza=> select * from pg_subscription_rel r join pg_stat_all_tables b on
b.relid = r.srrelid and r.srsubstate='i';;
-[ RECORD 1 ]-------+------------------------------
srsubid             | 7768746
srrelid             | 3553053
srsubstate          | i
srsublsn            | 
relid               | 3553053
schemaname          | prodza
relname             | translationitem
seq_scan            | 6
seq_tup_read        | 0
idx_scan            | 0
idx_tup_fetch       | 0
n_tup_ins           | 15021981
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 0
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         | 2021-02-17 14:23:43.23275+02
last_autovacuum     | 2021-02-17 14:09:05.482797+02
last_analyze        | 2021-02-17 14:23:43.232893+02
last_autoanalyze    | 
vacuum_count        | 1
autovacuum_count    | 42
analyze_count       | 1
autoanalyze_count   | 0


I have tried:
ALTER SUBSCRIPTION sub_prodza_translationitem REFRESH PUBLICATION WITH
(COPY_DATA);
alter subscription sub_prodza_translationitem REFRESH PUBLICATION;

On the publisher.
# select * from pg_stat_activity where application_name
='sub_prodza_translationitem';
-[ RECORD 1 ]----+--------------------------------------------------------
datid            | 221550750
datname          | prodza
pid              | 19462
usesysid         | 401960256
usename          | replication
application_name | sub_prodza_translationitem
client_addr      | 13.x.x.x.x
client_hostname  | 
client_port      | 53162
backend_start    | 2021-02-17 14:38:42.299105+02
xact_start       | 
query_start      | 2021-02-17 14:38:42.382807+02
state_change     | 2021-02-17 14:38:42.43461+02
wait_event_type  | Client
wait_event       | WalSenderWaitForWAL
state            | active
backend_xid      | 
backend_xmin     | 
query            | SELECT pg_catalog.set_config('search_path', '', false);
backend_type     | walsender


I'm not certain why the snapshot and replication is not working. I'm
guessing the the snapshot of the table completed but the wal process is
stuck.






--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html



Re: Logical replication stuck and no data being copied

From
anonymous001
Date:
After further investigation, I discovered the following.

On aurora ( the subscriber).
select now()::timestamp(0),a.subname,b.srsubstate,count(*) from
pg_subscription_rel b, pg_subscription a where  b.srsubid=a.oid group by
1,2,3 order by 2,3;

         now         |         subname         | srsubstate | count 
---------------------+-------------------------+------------+-------
 2021-02-18 08:53:31 | sub_prodza_big          | r          |    11
 2021-02-18 08:53:31 | sub_prodza_default      | r          |   108
 2021-02-18 08:53:31 | sub_prodza_trade        | r          |     1
 2021-02-18 08:53:31 | sub_prodza_tradearchive | i          |     1


For the table where the snapshot isn't being taken and the state remains in
initialize. I drop the other subscriptions and publications.

Then I get the following:
prodza=# select * from pg_replication_slots ;
-[ RECORD 1 ]-------+---------------------------------------------
slot_name           | test_slot
plugin              | test_decoding
slot_type           | logical
datoid              | 221550750
database            | prodza
temporary           | f
active              | f
active_pid          | 
xmin                | 
catalog_xmin        | 798588827
restart_lsn         | 7A7/520BAD48
confirmed_flush_lsn | 7A7/520BAD80
-[ RECORD 2 ]-------+---------------------------------------------
slot_name           | sub_prodza_tradearchive_7775427_sync_3552943
plugin              | pgoutput
slot_type           | logical
datoid              | 221550750
database            | prodza
temporary           | t
active              | t
active_pid          | 25001
xmin                | 
catalog_xmin        | 799048174
restart_lsn         | 7AC/6003D048
confirmed_flush_lsn | 7AC/6003D080
-[ RECORD 3 ]-------+---------------------------------------------
slot_name           | sub_prodza_tradearchive
plugin              | pgoutput
slot_type           | logical
datoid              | 221550750
database            | prodza
temporary           | f
active              | t
active_pid          | 24835
xmin                | 
catalog_xmin        | 799048182
restart_lsn         | 7AC/60058488
confirmed_flush_lsn | 7AC/600584C0

Now it seems that a snapshot is being taken for the table, but when I create
another publication and subscription, both are in state 'd'.
on aurora (subscriber)

prodza=> select now()::timestamp(0),a.subname,b.srsubstate,count(*) from
pg_subscription_rel b, pg_subscription a where  b.srsubid=a.oid group by
1,2,3 order by 2,3;

         now         |         subname         | srsubstate | count 
---------------------+-------------------------+------------+-------
 2021-02-18 09:38:24 | sub_prodza_trade        | d          |     1
 2021-02-18 09:38:24 | sub_prodza_tradearchive | d          |     1

Which is not correct, since the table trade's data is already synced it
should be in srsubstate = 'r'.
Does this mean that the tradearchive has to complete before data will be
published to the subscriber for the table trade?

Also, why do the replication slots share the same restart_lsn and
confirmed_flush lsn?.

-[ RECORD 1 ]-------+---------------------------------------------
slot_name           | sub_prodza_trade
plugin              | pgoutput
slot_type           | logical
datoid              | 221550750
database            | prodza
temporary           | f
active              | t
active_pid          | 25381
xmin                | 
catalog_xmin        | 799051115
restart_lsn         | 7AC/60CD9F10
confirmed_flush_lsn | 7AC/60CD9F80
-[ RECORD 2 ]-------+---------------------------------------------
slot_name           | sub_prodza_tradearchive_7775427_sync_3552943
plugin              | pgoutput
slot_type           | logical
datoid              | 221550750
database            | prodza
temporary           | t
active              | t
active_pid          | 25001
xmin                | 
catalog_xmin        | 799048174
restart_lsn         | 7AC/6003D048
confirmed_flush_lsn | 7AC/6003D080
-[ RECORD 3 ]-------+---------------------------------------------
slot_name           | sub_prodza_tradearchive
plugin              | pgoutput
slot_type           | logical
datoid              | 221550750
database            | prodza
temporary           | f
active              | t
active_pid          | 24835
xmin                | 
catalog_xmin        | 799051115
restart_lsn         | 7AC/60CD9F10
confirmed_flush_lsn | 7AC/60CD9F80








--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html