Thread: Help diagnosing replication (copy) error

Help diagnosing replication (copy) error

From
Steve Baldwin
Date:
Hi,

I'm in the process of migrating a cluster from 15.3 to 16.2. We have a 'zero downtime' requirement so I'm using logical replication to create the new cluster and then perform the switch in the application.

I have a situation where all but one table have done their initial copy. The remaining table is the largest (of course), and the replication slot that is assigned for the copy (pg_378075177_sync_60067_7343845372910323059) is showing as 'active=false' if I select from pg_replication_slots on the publisher.

I've checked the recent logs for both the publishing cluster and the subscribing cluster but I can't see any replication errors. I guess I could have missed them, but it doesn't seem like anything is being 'retried' like I've seen in the past with replication errors.

I've used this mechanism for zero-downtime upgrades multiple times in the past, and have recently used it to upgrade smaller clusters from 15.x to 16.2 without issue.

The clusters are hosted on AWS RDS, so I have no access to the servers, but if that's the only way to diagnose the issue, I can create a support case.

Does anyone have any suggestions as to where I should look for the issue?

Thanks,

Steve

Re: Help diagnosing replication (copy) error

From
Adrian Klaver
Date:
On 3/8/24 13:50, Steve Baldwin wrote:
> Hi,
> 
> I'm in the process of migrating a cluster from 15.3 to 16.2. We have a 
> 'zero downtime' requirement so I'm using logical replication to create 
> the new cluster and then perform the switch in the application.
> 
> I have a situation where all but one table have done their initial copy. 
> The remaining table is the largest (of course), and the replication slot 
> that is assigned for the copy 
> (pg_378075177_sync_60067_7343845372910323059) is showing as 
> 'active=false' if I select from pg_replication_slots on the publisher.

What are the rest of the values in pg_replication_slots?

Is there data in the subscriber side table?

What are the publisher and subscriber configurations?


> 
> I've checked the recent logs for both the publishing cluster and the 
> subscribing cluster but I can't see any replication errors. I guess I 
> could have missed them, but it doesn't seem like anything is being 
> 'retried' like I've seen in the past with replication errors.
> 
> I've used this mechanism for zero-downtime upgrades multiple times in 
> the past, and have recently used it to upgrade smaller clusters from 
> 15.x to 16.2 without issue.
> 
> The clusters are hosted on AWS RDS, so I have no access to the servers, 
> but if that's the only way to diagnose the issue, I can create a support 
> case.
> 
> Does anyone have any suggestions as to where I should look for the issue?
> 
> Thanks,
> 
> Steve

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Help diagnosing replication (copy) error

From
Steve Baldwin
Date:


On Sat, Mar 9, 2024 at 8:56 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

What are the rest of the values in pg_replication_slots?

b2bcreditonline=> select * from pg_replication_slots;
                  slot_name                  |  plugin  | slot_type | datoid |    database     | temporary | active | active_pid | xmin | catalog_xmin |  restart_lsn  | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
---------------------------------------------+----------+-----------+--------+-----------------+-----------+--------+------------+------+--------------+---------------+---------------------+------------+---------------+-----------
 b2bcreditonline_prod_e_master               | pgoutput | logical   |  16404 | b2bcreditonline | f         | t      |      13700 |      |    456150233 | 2E2F/AC07B760 | 2E2F/AEEEA478       | reserved   |               | f
 b2bcreditonline_prod_sandbox_d_master       | pgoutput | logical   |  16404 | b2bcreditonline | f         | t      |       9232 |      |    456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |               | f
 b2bcreditonline_prod_e_master_only          | pgoutput | logical   |  16404 | b2bcreditonline | f         | t      |      13710 |      |    456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |               | f
 pg_378075177_sync_60067_7343845372910323059 | pgoutput | logical   |  16404 | b2bcreditonline | f         | f      |            |      |    455719618 | 2E2A/1C0972E0 | 2E2A/1C097318       | extended   |               | f
 b2bcreditonline_prod_e_shard                | pgoutput | logical   |  16404 | b2bcreditonline | f         | t      |      13718 |      |    456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |               | f
(5 rows)
 
Is there data in the subscriber side table?

No there is not, although when I did a 'select count(*) from {table}' it took several minutes to return 0 rows.
 
What are the publisher and subscriber configurations?

Not sure which settings, but here's a few.

(publisher - pg 15.3)
b2bcreditonline=> select name, setting from pg_settings where name like 'max\_%' escape '\';
               name                | setting
-----------------------------------+---------
 max_connections                   | 5000
 max_files_per_process             | 1000
 max_function_args                 | 100
 max_identifier_length             | 63
 max_index_keys                    | 32
 max_locks_per_transaction         | 64
 max_logical_replication_workers   | 16
 max_parallel_maintenance_workers  | 2
 max_parallel_workers              | 8
 max_parallel_workers_per_gather   | 2
 max_pred_locks_per_page           | 2
 max_pred_locks_per_relation       | -2
 max_pred_locks_per_transaction    | 64
 max_prepared_transactions         | 0
 max_replication_slots             | 64
 max_slot_wal_keep_size            | -1
 max_stack_depth                   | 6144
 max_standby_archive_delay         | 30000
 max_standby_streaming_delay       | 30000
 max_sync_workers_per_subscription | 4
 max_wal_senders                   | 96
 max_wal_size                      | 4096
 max_worker_processes              | 32
(23 rows)

(subscriber - pg 16.2)
b2bcreditonline=> select name, setting from pg_settings where name like 'max\_%' escape '\';
                    name                     | setting
---------------------------------------------+---------
 max_connections                             | 5000
 max_files_per_process                       | 1000
 max_function_args                           | 100
 max_identifier_length                       | 63
 max_index_keys                              | 32
 max_locks_per_transaction                   | 64
 max_logical_replication_workers             | 16
 max_parallel_apply_workers_per_subscription | 2
 max_parallel_maintenance_workers            | 2
 max_parallel_workers                        | 8
 max_parallel_workers_per_gather             | 2
 max_pred_locks_per_page                     | 2
 max_pred_locks_per_relation                 | -2
 max_pred_locks_per_transaction              | 64
 max_prepared_transactions                   | 0
 max_replication_slots                       | 64
 max_slot_wal_keep_size                      | -1
 max_stack_depth                             | 6144
 max_standby_archive_delay                   | 30000
 max_standby_streaming_delay                 | 30000
 max_sync_workers_per_subscription           | 4
 max_wal_senders                             | 96
 max_wal_size                                | 4096
 max_worker_processes                        | 32
(24 rows)
 

>
> I've checked the recent logs for both the publishing cluster and the
> subscribing cluster but I can't see any replication errors. I guess I
> could have missed them, but it doesn't seem like anything is being
> 'retried' like I've seen in the past with replication errors.
>
> I've used this mechanism for zero-downtime upgrades multiple times in
> the past, and have recently used it to upgrade smaller clusters from
> 15.x to 16.2 without issue.
>
> The clusters are hosted on AWS RDS, so I have no access to the servers,
> but if that's the only way to diagnose the issue, I can create a support
> case.
>
> Does anyone have any suggestions as to where I should look for the issue?
>
> Thanks,
>
> Steve

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Help diagnosing replication (copy) error

From
Adrian Klaver
Date:
On 3/8/24 14:04, Steve Baldwin wrote:
> 
> 
> On Sat, Mar 9, 2024 at 8:56 AM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
> 
>     What are the rest of the values in pg_replication_slots?
> 
> b2bcreditonline=> select * from pg_replication_slots;
>                    slot_name                  |  plugin  | slot_type | 
> datoid |    database     | temporary | active | active_pid | xmin | 
> catalog_xmin |  restart_lsn  | confirmed_flush_lsn | wal_status | 
> safe_wal_size | two_phase
>
---------------------------------------------+----------+-----------+--------+-----------------+-----------+--------+------------+------+--------------+---------------+---------------------+------------+---------------+-----------
>   b2bcreditonline_prod_e_master               | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |      13700 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA478       | reserved   |         
>        | f
>   b2bcreditonline_prod_sandbox_d_master       | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |       9232 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |         
>        | f
>   b2bcreditonline_prod_e_master_only          | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |      13710 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |         
>        | f
>   pg_378075177_sync_60067_7343845372910323059 | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | f      |            |      |   
>   455719618 | 2E2A/1C0972E0 | 2E2A/1C097318       | extended   |         
>        | f
>   b2bcreditonline_prod_e_shard                | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |      13718 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |         
>        | f
> (5 rows)
> 
>     Is there data in the subscriber side table?
> 
> No there is not, although when I did a 'select count(*) from {table}' it 
> took several minutes to return 0 rows.
> 
>     What are the publisher and subscriber configurations?
> 
> Not sure which settings, but here's a few.


I should been clearer.

What are the CREATE PUBLICATION and CREATE SUBSCRIPTION statements?

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Help diagnosing replication (copy) error

From
Adrian Klaver
Date:
On 3/8/24 14:04, Steve Baldwin wrote:
> 
> 
> On Sat, Mar 9, 2024 at 8:56 AM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
> 
>     What are the rest of the values in pg_replication_slots?
> 
> b2bcreditonline=> select * from pg_replication_slots;
>                    slot_name                  |  plugin  | slot_type | 
> datoid |    database     | temporary | active | active_pid | xmin | 
> catalog_xmin |  restart_lsn  | confirmed_flush_lsn | wal_status | 
> safe_wal_size | two_phase
>
---------------------------------------------+----------+-----------+--------+-----------------+-----------+--------+------------+------+--------------+---------------+---------------------+------------+---------------+-----------
>   b2bcreditonline_prod_e_master               | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |      13700 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA478       | reserved   |         
>        | f
>   b2bcreditonline_prod_sandbox_d_master       | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |       9232 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |         
>        | f
>   b2bcreditonline_prod_e_master_only          | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |      13710 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |         
>        | f
>   pg_378075177_sync_60067_7343845372910323059 | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | f      |            |      |   
>   455719618 | 2E2A/1C0972E0 | 2E2A/1C097318       | extended   |         
>        | f
>   b2bcreditonline_prod_e_shard                | pgoutput | logical   | 
>   16404 | b2bcreditonline | f         | t      |      13718 |      |   
>   456150233 | 2E2F/AC07B760 | 2E2F/AEEEA560       | reserved   |         
>        | f
> (5 rows)
> 

For future reference the above would be easier to follow if you did \x 
before the select * from pg_replication_slots;

Someone with more experience will need to comment further but to me:

restart_lsn     confirmed_flush_lsn
2E2A/1C0972E0    2E2A/1C097318

does not look like enough data was transferred before the slot stopped 
transmitting.

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Help diagnosing replication (copy) error

From
Steve Baldwin
Date:


On Sat, Mar 9, 2024 at 9:13 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

I should been clearer.

What are the CREATE PUBLICATION and CREATE SUBSCRIPTION statements?

The publications were created a while ago. Does this help:

b2bcreditonline=> select * from pg_publication;
-[ RECORD 1 ]+-----------------
oid          | 18829
pubname      | b2bc_master
pubowner     | 16760
puballtables | f
pubinsert    | t
pubupdate    | t
pubdelete    | t
pubtruncate  | t
pubviaroot   | f
-[ RECORD 2 ]+-----------------
oid          | 18830
pubname      | b2bc_master_only
pubowner     | 16760
puballtables | f
pubinsert    | t
pubupdate    | t
pubdelete    | t
pubtruncate  | t
pubviaroot   | f
-[ RECORD 3 ]+-----------------
oid          | 18831
pubname      | b2bc_shard
pubowner     | 16760
puballtables | f
pubinsert    | t
pubupdate    | t
pubdelete    | t
pubtruncate  | t
pubviaroot   | f

The publication getting 'stuck' is b2bc_shard. It defines a bunch of tables, one of which is audit.log_row. This table is quite large (600+ GB).

b2bcreditonline=> select * from pg_publication_rel where prrelid = 'audit.log_row'::regclass;
-[ RECORD 1 ]--
oid     | 18832
prpubid | 18831
prrelid | 60067
prqual  |
prattrs |

Here's the subscription info:

b2bcreditonline=> select * from pg_subscription;
-[ RECORD 1 ]-------+-------------------------------------------------------------------------------------------------------------------------------------------
oid                 | 378075175
subdbid             | 16404
subskiplsn          | 0/0
subname             | b2bcreditonline_prod_e_master
subowner            | 16388
subenabled          | t
subbinary           | f
substream           | f
subtwophasestate    | d
subdisableonerr     | f
subpasswordrequired | t
subrunasowner       | f
subconninfo         | host=b2bcreditonline-prod-d.cliwvveusxrd.us-east-1.rds.amazonaws.com user=xxx password=xxx dbname=b2bcreditonline
subslotname         | b2bcreditonline_prod_e_master
subsynccommit       | off
subpublications     | {b2bc_master}
suborigin           | any
-[ RECORD 2 ]-------+-------------------------------------------------------------------------------------------------------------------------------------------
oid                 | 378075176
subdbid             | 16404
subskiplsn          | 0/0
subname             | b2bcreditonline_prod_e_master_only
subowner            | 16388
subenabled          | t
subbinary           | f
substream           | f
subtwophasestate    | d
subdisableonerr     | f
subpasswordrequired | t
subrunasowner       | f
subconninfo         | host=b2bcreditonline-prod-d.cliwvveusxrd.us-east-1.rds.amazonaws.com user=xxx password=xxx dbname=b2bcreditonline
subslotname         | b2bcreditonline_prod_e_master_only
subsynccommit       | off
subpublications     | {b2bc_master_only}
suborigin           | any
-[ RECORD 3 ]-------+-------------------------------------------------------------------------------------------------------------------------------------------
oid                 | 378075177
subdbid             | 16404
subskiplsn          | 0/0
subname             | b2bcreditonline_prod_e_shard
subowner            | 16388
subenabled          | t
subbinary           | f
substream           | f
subtwophasestate    | d
subdisableonerr     | f
subpasswordrequired | t
subrunasowner       | f
subconninfo         | host=b2bcreditonline-prod-d.cliwvveusxrd.us-east-1.rds.amazonaws.com user=xxx password=xxx dbname=b2bcreditonline
subslotname         | b2bcreditonline_prod_e_shard
subsynccommit       | off
subpublications     | {b2bc_shard}
suborigin           | any

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Help diagnosing replication (copy) error

From
Jeff Ross
Date:
On 3/8/24 14:50, Steve Baldwin wrote:

> Hi,
>
> I'm in the process of migrating a cluster from 15.3 to 16.2. We have a 
> 'zero downtime' requirement so I'm using logical replication to create 
> the new cluster and then perform the switch in the application.
>
> I have a situation where all but one table have done their initial 
> copy. The remaining table is the largest (of course), and the 
> replication slot that is assigned for the copy 
> (pg_378075177_sync_60067_7343845372910323059) is showing as 
> 'active=false' if I select from pg_replication_slots on the publisher.
>
> I've checked the recent logs for both the publishing cluster and the 
> subscribing cluster but I can't see any replication errors. I guess I 
> could have missed them, but it doesn't seem like anything is being 
> 'retried' like I've seen in the past with replication errors.
>
> I've used this mechanism for zero-downtime upgrades multiple times in 
> the past, and have recently used it to upgrade smaller clusters from 
> 15.x to 16.2 without issue.
>
> The clusters are hosted on AWS RDS, so I have no access to the 
> servers, but if that's the only way to diagnose the issue, I can 
> create a support case.
>
> Does anyone have any suggestions as to where I should look for the issue?
>
> Thanks,
>
> Steve

In our setup we're logically replicating a 450G database hosted on real 
hardware to an RDS instance.

Multiple times we've had replication simply stop and we could never find 
any reason for that on either publisher or subscriber.

The *only* solution that ever worked in these cases was dropping the 
subscription in RDS and re-creating it with (copy_data = false).

At that point replication picks right up again for new transactions 
*but* at the expense of losing all of the WAL that should have been 
replicated during the outage.  I wrote a python based "logical 
replication fixer" to fill in those gaps.

Given that the subscriber is the one that initiates the connection to 
the publisher and that as soon as the subscription is dropped and 
restarted replication resumes my hunch is that this is squarely on RDS.  
With both publisher and subscriber on RDS as in your case YMMV.

RDS is a black box--who knows what's really going on there?  It would be 
interesting to see what the response is after you open a support case.  
I hope you'll be able to share that with the list.

Jeff







Re: Help diagnosing replication (copy) error

From
Steve Baldwin
Date:


On Sat, Mar 9, 2024 at 11:06 AM Jeff Ross <jross@openvistas.net> wrote:

RDS is a black box--who knows what's really going on there?  It would be
interesting to see what the response is after you open a support case. 
I hope you'll be able to share that with the list.

This is very mysterious. I logged the case, and before it had been picked up by an analyst, the issue somehow resolved itself without me doing anything.

I now have 418M+ rows in the table that it got stuck on.

:shrug:

Thanks Adrian and Jeff for responding.

Steve
Jeff






Re: Help diagnosing replication (copy) error

From
Adrian Klaver
Date:
On 3/8/24 22:26, Steve Baldwin wrote:
> 
> 
> On Sat, Mar 9, 2024 at 11:06 AM Jeff Ross <jross@openvistas.net 
> <mailto:jross@openvistas.net>> wrote:
> 
> 
>     RDS is a black box--who knows what's really going on there?  It
>     would be
>     interesting to see what the response is after you open a support case.
>     I hope you'll be able to share that with the list.
> 
> This is very mysterious. I logged the case, and before it had been 
> picked up by an analyst, the issue somehow resolved itself without me 
> doing anything.

Is the case still active?
Can you get information from them about what they saw?

I am not big believer in coincidences, that you reported a problem and 
then the problem disappeared.

> 
> I now have 418M+ rows in the table that it got stuck on.
> 
> :shrug:
> 
> Thanks Adrian and Jeff for responding.
> 
> Steve
> 
>     Jeff
> 
> 
> 
> 
> 
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Help diagnosing replication (copy) error

From
Steve Baldwin
Date:


On Sun, Mar 10, 2024 at 3:16 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

Is the case still active?
Can you get information from them about what they saw?

I've re-opened the case and asked for an explanation of the error and what they did to resolve it.

Hopefully they shed some light on the 'mystery'.

Steve