The following bug has been logged on the website:
Bug reference: 18898
Logged by: Jaime Casanova
Email address: jcasanov@systemguards.com.ec
PostgreSQL version: 16.6
Operating system: Ubuntu 22.04.4 LTS
Description:
H everyone,
A customer has a pyramidal structure. A central server, 3 nodes acting as
hubs and lots of point of sales (POS) grouped in the hubs (around 80 per
hub).
In this structure there is a set of tables whose data is generated on the
central server and replicated all the way down to POS (there is subscription
on every hub, pulling the data from the central and a subscription in each
POS pulling data from its corresponding hub). The down subscription.
There is also a set of tables that generates on the POS and goes all the way
up to the central server (there are one subscription per POS on each hub,
every hub only has the info of the POS connected to it, and the central
pulls the aggregated data from the hubs). The up subscription.
All this with native logical replication, central server is 16.6 and also
the hubs. POS are a mix of 16.6, 16.7 and 16.8.
It has this structure because of historical reasons related from the way the
info was read from wal to be sent to the subscribers.
Anyway, after the down subscription has been created (this subscription
pulls data from the hub to the POS, the data was originally generated on the
central) with copy_data=true; and all data copied to POS. Then we create the
up subscription, which generally has no initial data to copy so we create it
with copy_data=false. And we have seem cases in wich the up subscription is
active pg_stat_subscription.pid is not null on the hub and
pg_replication_slot.active_pid is not null and a row in pg_stat_replication
but no data generated on the POS has been sent to the hub. When this happens
just dropping the subscription on the hub and creating it again with
copy_data=true solves the problem.
Any way to debug the real problem here? or is there something i'm missing?