Thread: BUG #18898: replication process not working... sometimes

BUG #18898: replication process not working... sometimes

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18898
Logged by:          Jaime Casanova
Email address:      jcasanov@systemguards.com.ec
PostgreSQL version: 16.6
Operating system:   Ubuntu 22.04.4 LTS
Description:

H everyone,

A customer has a pyramidal structure. A central server, 3 nodes acting as
hubs and lots of point of sales (POS) grouped in the hubs (around 80 per
hub).

In this structure there is a set of tables whose data is generated on the
central server and replicated all the way down to POS (there is subscription
on every hub, pulling the data from the central and a subscription in each
POS pulling data from its corresponding hub). The down subscription.

There is also a set of tables that generates on the POS and goes all the way
up to the central server (there are one subscription per POS on each hub,
every hub only has the info of the POS connected to it, and the central
pulls the aggregated data from the hubs). The up subscription.

All this with native logical replication, central server is 16.6 and also
the hubs. POS are a mix of 16.6, 16.7 and 16.8.
It has this structure because of historical reasons related from the way the
info was read from wal to be sent to the subscribers.

Anyway, after the down subscription has been created (this subscription
pulls data from the hub to the POS, the data was originally generated on the
central) with copy_data=true; and all data copied to POS. Then we create the
up subscription, which generally has no initial data to copy so we create it
with copy_data=false. And we have seem cases in wich the up subscription is
active pg_stat_subscription.pid is not null on the hub and
pg_replication_slot.active_pid is not null and a row in pg_stat_replication
but no data generated on the POS has been sent to the hub. When this happens
just dropping the subscription on the hub and creating it again with
copy_data=true solves the problem.

Any way to debug the real problem here? or is there something i'm missing?


RE: BUG #18898: replication process not working... sometimes

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Jaime,

> The following bug has been logged on the website:

Thanks for reporting the issue. Could you please provide scripts which can
reproduce the issue? This is very helpful to analyze and fix.

I roughly emulated the situation like attached but I could not reproduce. Maybe
it missed something.

> Any way to debug the real problem here? or is there something i'm missing?

Not sure it is related, but I could come up with [1] and [2]. This issue could happen
when ALTER PUBLICATION or ALTER TYPE commands are run during the replication.

[1]: https://www.postgresql.org/message-id/flat/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com
[2]: https://github.com/postgres/postgres/commit/4909b38af034fa4d2c67c5c71fd8509f870c1695

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment