Hello,
While giving assistance to some customer with their broker procedure, I found a
scenario where the subscription is failing but the table are sync'ed anyway.
Here is bash script to reproduce it with versions 10, 11 and 12 (make sure
to set PATH correctly):
# env
PUB=/tmp/pub
SUB=/tmp/sub
unset PGPORT PGHOST PGDATABASE PGDATA
export PGUSER=postgres
# cleanup
kill %1
pg_ctl -w -s -D "$PUB" -m immediate stop; echo $?
pg_ctl -w -s -D "$SUB" -m immediate stop; echo $?
rm -r "$PUB" "$SUB"
# cluster
initdb -U postgres -N "$PUB" &>/dev/null; echo $?
initdb -U postgres -N "$SUB" &>/dev/null; echo $?
echo "wal_level=logical" >> "$PUB"/postgresql.conf
echo "port=5433" >> "$SUB"/postgresql.conf
pg_ctl -w -s -D $PUB -l "$PUB"-"$(date +%FT%T)".log start; echo $?
pg_ctl -w -s -D $SUB -l "$SUB"-"$(date +%FT%T)".log start; echo $?
pgbench -p 5432 -qi
pg_dump -p 5432 -s | psql -qXp 5433
# fake activity
pgbench -p 5432 -T 60 -c 2 &
# replication setup
cat<<SQL | psql -p 5432 -X
SELECT * FROM pg_create_logical_replication_slot('sub','pgoutput');
CREATE PUBLICATION prov FOR ALL TABLES;
SQL
cat<<SQL | psql -p 5433 -X
CREATE SUBSCRIPTION sub
CONNECTION 'port=5432'
PUBLICATION prov
WITH (create_slot=false, slot_name=sub)
SQL
Here are part of the logs from the subscriber:
LOG: logical replication apply worker for subscription "sub" has started
LOG: logical replication table synchronization worker for subscription
"sub", table "pgbench_accounts" has started
LOG: logical replication table synchronization worker for subscription
"sub", table "pgbench_branches" has started
ERROR: could not receive data from WAL stream: ERROR: publication "prov"
does not exist
CONTEXT: slot "sub", output plugin "pgoutput", in the change callback,
associated LSN 0/22C0138
LOG: logical replication table synchronization worker for subscription
"sub", table "pgbench_branches" has finished
LOG: logical replication table synchronization worker for subscription
"sub", table "pgbench_accounts" has finished
LOG: logical replication apply worker for subscription "sub" has started
ERROR: could not receive data from WAL stream: ERROR: publication "prov"
does not exist
CONTEXT: slot "sub", output plugin "pgoutput", in the change callback,
associated LSN 0/22C0138
All tables are synch'ed while the main worker for subscription is spawned again
and again with the same failure.
As far as I could find out, the problem here is that the slot is created
manually before the publication is created. When the subscriber subscribe,
it builds a catalog cache from the slot by the time it has been created.
Then, it couldn't find the publication, because it didn't exists in this
old version of the catalog. Is my understanding correct?
Sadly, I couldn't find any documentation (neither official or in sources) about
the fact the slot must be created after the related publication.
Moreover, it's quite illogical to find some error about a non
existing publication when the data are being synched AND the publication
actually exists on the other side.
I suppose this should be documented in user documentation.
Plus, what about forbidding the data sync if the main worker for
subscription fails?
Regards,
PS: the customer hit the following issue as well while messing around but I
hadn't time to find out how they did yet:
https://www.postgresql.org/message-id/flat/a9139c29-7ddd-973b-aa7f-71fed9c38d75%40minerva.info