Logical replication dead but synching - Mailing list pgsql-hackers

From Jehan-Guillaume de Rorthais
Subject Logical replication dead but synching
Date
Msg-id 20191010115752.2d0f27af@firost
Whole thread Raw
List pgsql-hackers
Hello,

While giving assistance to some customer with their broker procedure, I found a
scenario where the subscription is failing but the table are sync'ed anyway.
Here is bash script to reproduce it with versions 10, 11 and 12 (make sure
to set PATH correctly):

  # env
  PUB=/tmp/pub
  SUB=/tmp/sub
  unset PGPORT PGHOST PGDATABASE PGDATA
  export PGUSER=postgres

  # cleanup
  kill %1
  pg_ctl -w -s -D "$PUB" -m immediate stop; echo $?
  pg_ctl -w -s -D "$SUB" -m immediate stop; echo $?
  rm -r "$PUB" "$SUB"

  # cluster
  initdb -U postgres -N "$PUB" &>/dev/null; echo $?
  initdb -U postgres -N "$SUB" &>/dev/null; echo $?
  echo "wal_level=logical" >> "$PUB"/postgresql.conf
  echo "port=5433" >> "$SUB"/postgresql.conf
  pg_ctl -w -s -D $PUB -l "$PUB"-"$(date +%FT%T)".log start; echo $?
  pg_ctl -w -s -D $SUB -l "$SUB"-"$(date +%FT%T)".log start; echo $?
  pgbench -p 5432 -qi 
  pg_dump -p 5432 -s | psql -qXp 5433

  # fake activity
  pgbench -p 5432 -T 60 -c 2 &

  # replication setup
  cat<<SQL | psql -p 5432 -X
  SELECT * FROM pg_create_logical_replication_slot('sub','pgoutput');
  CREATE PUBLICATION prov FOR ALL TABLES;
  SQL

  cat<<SQL | psql -p 5433 -X
  CREATE SUBSCRIPTION sub
  CONNECTION 'port=5432'
  PUBLICATION prov
  WITH (create_slot=false, slot_name=sub)
  SQL

Here are part of the logs from the subscriber:

    LOG:  logical replication apply worker for subscription "sub" has started
    LOG:  logical replication table synchronization worker for subscription
          "sub", table "pgbench_accounts" has started
    LOG:  logical replication table synchronization worker for subscription
          "sub", table "pgbench_branches" has started
  ERROR:  could not receive data from WAL stream: ERROR: publication "prov"
          does not exist
CONTEXT:  slot "sub", output plugin "pgoutput", in the change callback,
          associated LSN 0/22C0138
    LOG:  logical replication table synchronization worker for subscription
          "sub", table "pgbench_branches" has finished
    LOG:  logical replication table synchronization worker for subscription
          "sub", table "pgbench_accounts" has finished
    LOG:  logical replication apply worker for subscription "sub" has started
  ERROR:  could not receive data from WAL stream: ERROR:  publication "prov"
          does not exist 
CONTEXT:  slot "sub", output plugin "pgoutput", in the change callback,
          associated LSN 0/22C0138

All tables are synch'ed while the main worker for subscription is spawned again
and again with the same failure.

As far as I could find out, the problem here is that the slot is created
manually before the publication is created. When the subscriber subscribe,
it builds a catalog cache from the slot by the time it has been created.
Then, it couldn't find the publication, because it didn't exists in this
old version of the catalog. Is my understanding correct?

Sadly, I couldn't find any documentation (neither official or in sources) about
the fact the slot must be created after the related publication.

Moreover, it's quite illogical to find some error about a non
existing publication when the data are being synched AND the publication
actually exists on the other side.

I suppose this should be documented in user documentation.

Plus, what about forbidding the data sync if the main worker for
subscription fails?

Regards,

PS: the customer hit the following issue as well while messing around but I
hadn't time to find out how they did yet:
https://www.postgresql.org/message-id/flat/a9139c29-7ddd-973b-aa7f-71fed9c38d75%40minerva.info



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] proposal: schema variables
Next
From: Dilip Kumar
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions