Thread: Logical replication stuck and no data being copied
I'm connecting from an aurora postgresql 12.4 instance to a postgresql 12.6 community edition. The postgresql 12.6 is the publisher and the aurora db is the subscriber. Most of the subscriptions and publishing worked. Except for one table, which I've created a separate publication for. One table is stuck and not copying data I checked with the following query. WITH active_pub_sub AS ( select relname from pg_stat_all_tables a join pg_subscription_rel sr on sr.srrelid = a.relid where srsubstate!='r' ) select relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_relation_size(relid)) as table_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as index_size from pg_catalog.pg_statio_user_tables where relname IN (SELECT * FROM active_pub_sub) order by pg_relation_size(relid) desc; table_name | total_size | table_size | index_size -----------------+------------+------------+------------ translationitem | 16 kB | 0 bytes | 16 kB This is what i see from the logs 2021-02-17 14:38:42 SAST [19462]: [3-1] user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x LOG: starting logical decoding for slot "sub_prodza_translationitem" 2021-02-17 14:38:42 SAST [19462]: [4-1] user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x DETAIL: Streaming transactions committing after 7AB/9CD78D60, reading WAL from 7AB/9CD78D28. 2021-02-17 14:38:42 SAST [19462]: [5-1] user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x LOG: logical decoding found consistent point at 7AB/9CD78D28 2021-02-17 14:38:42 SAST [19462]: [6-1] user=replication,db=prodza,app=sub_prodza_translationitem,client=13.x.x.x DETAIL: There are no running transactions. from aurora prodza=> select * from pg_subscription_rel r join pg_stat_all_tables b on b.relid = r.srrelid and r.srsubstate='i';; -[ RECORD 1 ]-------+------------------------------ srsubid | 7768746 srrelid | 3553053 srsubstate | i srsublsn | relid | 3553053 schemaname | prodza relname | translationitem seq_scan | 6 seq_tup_read | 0 idx_scan | 0 idx_tup_fetch | 0 n_tup_ins | 15021981 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 0 n_dead_tup | 0 n_mod_since_analyze | 0 last_vacuum | 2021-02-17 14:23:43.23275+02 last_autovacuum | 2021-02-17 14:09:05.482797+02 last_analyze | 2021-02-17 14:23:43.232893+02 last_autoanalyze | vacuum_count | 1 autovacuum_count | 42 analyze_count | 1 autoanalyze_count | 0 I have tried: ALTER SUBSCRIPTION sub_prodza_translationitem REFRESH PUBLICATION WITH (COPY_DATA); alter subscription sub_prodza_translationitem REFRESH PUBLICATION; On the publisher. # select * from pg_stat_activity where application_name ='sub_prodza_translationitem'; -[ RECORD 1 ]----+-------------------------------------------------------- datid | 221550750 datname | prodza pid | 19462 usesysid | 401960256 usename | replication application_name | sub_prodza_translationitem client_addr | 13.x.x.x.x client_hostname | client_port | 53162 backend_start | 2021-02-17 14:38:42.299105+02 xact_start | query_start | 2021-02-17 14:38:42.382807+02 state_change | 2021-02-17 14:38:42.43461+02 wait_event_type | Client wait_event | WalSenderWaitForWAL state | active backend_xid | backend_xmin | query | SELECT pg_catalog.set_config('search_path', '', false); backend_type | walsender I'm not certain why the snapshot and replication is not working. I'm guessing the the snapshot of the table completed but the wal process is stuck. -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
After further investigation, I discovered the following. On aurora ( the subscriber). select now()::timestamp(0),a.subname,b.srsubstate,count(*) from pg_subscription_rel b, pg_subscription a where b.srsubid=a.oid group by 1,2,3 order by 2,3; now | subname | srsubstate | count ---------------------+-------------------------+------------+------- 2021-02-18 08:53:31 | sub_prodza_big | r | 11 2021-02-18 08:53:31 | sub_prodza_default | r | 108 2021-02-18 08:53:31 | sub_prodza_trade | r | 1 2021-02-18 08:53:31 | sub_prodza_tradearchive | i | 1 For the table where the snapshot isn't being taken and the state remains in initialize. I drop the other subscriptions and publications. Then I get the following: prodza=# select * from pg_replication_slots ; -[ RECORD 1 ]-------+--------------------------------------------- slot_name | test_slot plugin | test_decoding slot_type | logical datoid | 221550750 database | prodza temporary | f active | f active_pid | xmin | catalog_xmin | 798588827 restart_lsn | 7A7/520BAD48 confirmed_flush_lsn | 7A7/520BAD80 -[ RECORD 2 ]-------+--------------------------------------------- slot_name | sub_prodza_tradearchive_7775427_sync_3552943 plugin | pgoutput slot_type | logical datoid | 221550750 database | prodza temporary | t active | t active_pid | 25001 xmin | catalog_xmin | 799048174 restart_lsn | 7AC/6003D048 confirmed_flush_lsn | 7AC/6003D080 -[ RECORD 3 ]-------+--------------------------------------------- slot_name | sub_prodza_tradearchive plugin | pgoutput slot_type | logical datoid | 221550750 database | prodza temporary | f active | t active_pid | 24835 xmin | catalog_xmin | 799048182 restart_lsn | 7AC/60058488 confirmed_flush_lsn | 7AC/600584C0 Now it seems that a snapshot is being taken for the table, but when I create another publication and subscription, both are in state 'd'. on aurora (subscriber) prodza=> select now()::timestamp(0),a.subname,b.srsubstate,count(*) from pg_subscription_rel b, pg_subscription a where b.srsubid=a.oid group by 1,2,3 order by 2,3; now | subname | srsubstate | count ---------------------+-------------------------+------------+------- 2021-02-18 09:38:24 | sub_prodza_trade | d | 1 2021-02-18 09:38:24 | sub_prodza_tradearchive | d | 1 Which is not correct, since the table trade's data is already synced it should be in srsubstate = 'r'. Does this mean that the tradearchive has to complete before data will be published to the subscriber for the table trade? Also, why do the replication slots share the same restart_lsn and confirmed_flush lsn?. -[ RECORD 1 ]-------+--------------------------------------------- slot_name | sub_prodza_trade plugin | pgoutput slot_type | logical datoid | 221550750 database | prodza temporary | f active | t active_pid | 25381 xmin | catalog_xmin | 799051115 restart_lsn | 7AC/60CD9F10 confirmed_flush_lsn | 7AC/60CD9F80 -[ RECORD 2 ]-------+--------------------------------------------- slot_name | sub_prodza_tradearchive_7775427_sync_3552943 plugin | pgoutput slot_type | logical datoid | 221550750 database | prodza temporary | t active | t active_pid | 25001 xmin | catalog_xmin | 799048174 restart_lsn | 7AC/6003D048 confirmed_flush_lsn | 7AC/6003D080 -[ RECORD 3 ]-------+--------------------------------------------- slot_name | sub_prodza_tradearchive plugin | pgoutput slot_type | logical datoid | 221550750 database | prodza temporary | f active | t active_pid | 24835 xmin | catalog_xmin | 799051115 restart_lsn | 7AC/60CD9F10 confirmed_flush_lsn | 7AC/60CD9F80 -- Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html