Thread: Logical replication troubles
Hi,
I have run into a (to me) weird issue with logical replication. We are running Zalandos postgres-operator in our Kubernetes clusters and have recently had a use-case where we wanted to start doing logical replication of select tables to a data warehouse, also running postgres. It worked as expected at first, but then after a pod-restart in Kubernetes, the replication slots that were created for the subscription were gone. A bit of reading later, and I learn we need to tell Patroni which slots should be permanently available, so we specify a slot and try to set this up, but then run into an error which says the publication does not exist, even though we can verify that it does. At first I suspected Patroni handling the replication slots to be the cause of the problem, but about a week's worth of learning and experimenting later, I can now reliably replicate the problem in pure postgres. Patroni is kind of the catalyst, since my findings are that if the replication slot is created before data is inserted into the source database, and a publication is created, then it breaks. If the replication slot is created after data is inserted and the publication is created, then it works. We just can't tell Patroni to not create it until some arbitrary point in time. I am guessing this is either a bug or a case of us not knowing what we are doing...
I have created a Github gist demonstrating the problem: https://gist.github.com/drzero42/02b4082ce002c1d90ddd64f5fe03aee0
Anybody able to help? :)
--
Anders Bøgh Bruun
Infrastructure Architect
CellPoint digital
cellpointdigital.com
WE MAKE TRAVEL EASIER™
M: +45 31 14 87 41
E: anders@cellpointdigital.com
Chicago | Copenhagen | Dubai | London | Miami | Pune | Singapore
On 2020-05-19 09:22, Anders Bøgh Bruun wrote: > I have run into a (to me) weird issue with logical replication. We are > running Zalandos postgres-operator in our Kubernetes clusters and have > recently had a use-case where we wanted to start doing logical > replication of select tables to a data warehouse, also running postgres. > It worked as expected at first, but then after a pod-restart in > Kubernetes, the replication slots that were created for the subscription > were gone. A bit of reading later, and I learn we need to tell Patroni > which slots should be permanently available, so we specify a slot and > try to set this up, but then run into an error which says the > publication does not exist, even though we can verify that it does. At > first I suspected Patroni handling the replication slots to be the cause > of the problem, but about a week's worth of learning and experimenting > later, I can now reliably replicate the problem in pure postgres. > Patroni is kind of the catalyst, since my findings are that if the > replication slot is created before data is inserted into the source > database, and a publication is created, then it breaks. If the > replication slot is created after data is inserted and the publication > is created, then it works. We just can't tell Patroni to not create it > until some arbitrary point in time. I am guessing this is either a bug > or a case of us not knowing what we are doing... What does "works" and "breaks" mean? Are there error messages shown from commands or in the server logs? Is replication progressing, but doing something your are not expecting? etc. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Fair question, I should have been more elaborate.
When I have the creation of a replication slot, before I insert any data into the table in my example (the SQL-file named "broken" in my gist), after I then create the subscription on the receiving side, it does the initial sync of data and then the replication fails with these messages repeating on the receiving server:
[67] LOG: logical replication apply worker for subscription "widgets_sub" has started
[67] DEBUG: connecting to publisher using connection string "dbname=testdb host=master port=5432 user=repuser password=abc123"
[67] ERROR: could not receive data from WAL stream: ERROR: publication "widgets_pub" does not exist
CONTEXT: slot "my_slot", output plugin "pgoutput", in the change callback, associated LSN 0/1674958
[1] DEBUG: unregistering background worker "logical replication worker for subscription 16396"
[1] LOG: background worker "logical replication worker" (PID 67) exited with exit code 1
[67] DEBUG: connecting to publisher using connection string "dbname=testdb host=master port=5432 user=repuser password=abc123"
[67] ERROR: could not receive data from WAL stream: ERROR: publication "widgets_pub" does not exist
CONTEXT: slot "my_slot", output plugin "pgoutput", in the change callback, associated LSN 0/1674958
[1] DEBUG: unregistering background worker "logical replication worker for subscription 16396"
[1] LOG: background worker "logical replication worker" (PID 67) exited with exit code 1
I can verify that the publication called widgets_pub does exist, and I am not seeing any errors on the sending side.
The SQL-file named "works" just has the creation of the replication slot moved down to after I insert some data into the table I want to replicate. And that works as expected.
On Wed, 20 May 2020 at 15:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2020-05-19 09:22, Anders Bøgh Bruun wrote:
> I have run into a (to me) weird issue with logical replication. We are
> running Zalandos postgres-operator in our Kubernetes clusters and have
> recently had a use-case where we wanted to start doing logical
> replication of select tables to a data warehouse, also running postgres.
> It worked as expected at first, but then after a pod-restart in
> Kubernetes, the replication slots that were created for the subscription
> were gone. A bit of reading later, and I learn we need to tell Patroni
> which slots should be permanently available, so we specify a slot and
> try to set this up, but then run into an error which says the
> publication does not exist, even though we can verify that it does. At
> first I suspected Patroni handling the replication slots to be the cause
> of the problem, but about a week's worth of learning and experimenting
> later, I can now reliably replicate the problem in pure postgres.
> Patroni is kind of the catalyst, since my findings are that if the
> replication slot is created before data is inserted into the source
> database, and a publication is created, then it breaks. If the
> replication slot is created after data is inserted and the publication
> is created, then it works. We just can't tell Patroni to not create it
> until some arbitrary point in time. I am guessing this is either a bug
> or a case of us not knowing what we are doing...
What does "works" and "breaks" mean? Are there error messages shown
from commands or in the server logs? Is replication progressing, but
doing something your are not expecting? etc.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Anders Bøgh Bruun
Infrastructure Architect
CellPoint digital
cellpointdigital.com
WE MAKE TRAVEL EASIER™
M: +45 31 14 87 41
E: anders@cellpointdigital.com
Chicago | Copenhagen | Dubai | London | Miami | Pune | Singapore
On 2020-05-20 17:16, Anders Bøgh Bruun wrote: > [67] LOG: logical replication apply worker for subscription > "widgets_sub" has started > [67] DEBUG: connecting to publisher using connection string > "dbname=testdb host=master port=5432 user=repuser password=abc123" > [67] ERROR: could not receive data from WAL stream: ERROR: publication > "widgets_pub" does not exist > CONTEXT: slot "my_slot", output plugin "pgoutput", in the change > callback, associated LSN 0/1674958 > [1] DEBUG: unregistering background worker "logical replication worker > for subscription 16396" > [1] LOG: background worker "logical replication worker" (PID 67) exited > with exit code 1 > > I can verify that the publication called widgets_pub does exist, and I > am not seeing any errors on the sending side. > > The SQL-file named "works" just has the creation of the replication slot > moved down to after I insert some data into the table I want to > replicate. And that works as expected. You need to create the publication before the replication slot. The walsender's view of the world moves along with the WAL it is decoding/sending. So when the subscription worker connects, it initially sees a state as of the creation of the replication slot, when the publication did not exist yet. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you for that clarification. It helps me understand how things work a lot better.
I know this might be a bit off topic, but my challenge here is that we are using Patroni (by using Zalando's postgres-operator for Kubernetes), and any replication slot not created by Patroni, seems to be removed, whenever the master pod restarts. We therefore specify in the Patroni config, that a permanent replication slot should be created for our usage to do logical replication of some select tables, to our data warehouse. That means that the replication slot is created as soon as the database is ready to use, which is also before any tables, data or publications are created. Can you give me a hint as to what the correct way to set this up would be?
Or do I need to try contacting the Patroni devs instead?
On Fri, 22 May 2020 at 11:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2020-05-20 17:16, Anders Bøgh Bruun wrote:
> [67] LOG: logical replication apply worker for subscription
> "widgets_sub" has started
> [67] DEBUG: connecting to publisher using connection string
> "dbname=testdb host=master port=5432 user=repuser password=abc123"
> [67] ERROR: could not receive data from WAL stream: ERROR: publication
> "widgets_pub" does not exist
> CONTEXT: slot "my_slot", output plugin "pgoutput", in the change
> callback, associated LSN 0/1674958
> [1] DEBUG: unregistering background worker "logical replication worker
> for subscription 16396"
> [1] LOG: background worker "logical replication worker" (PID 67) exited
> with exit code 1
>
> I can verify that the publication called widgets_pub does exist, and I
> am not seeing any errors on the sending side.
>
> The SQL-file named "works" just has the creation of the replication slot
> moved down to after I insert some data into the table I want to
> replicate. And that works as expected.
You need to create the publication before the replication slot. The
walsender's view of the world moves along with the WAL it is
decoding/sending. So when the subscription worker connects, it
initially sees a state as of the creation of the replication slot, when
the publication did not exist yet.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Anders Bøgh Bruun
Infrastructure Architect
CellPoint digital
cellpointdigital.com
WE MAKE TRAVEL EASIER™
M: +45 31 14 87 41
E: anders@cellpointdigital.com
Chicago | Copenhagen | Dubai | London | Miami | Pune | Singapore
On 2020-05-25 10:19, Anders Bøgh Bruun wrote: > Thank you for that clarification. It helps me understand how things work > a lot better. > I know this might be a bit off topic, but my challenge here is that we > are using Patroni (by using Zalando's postgres-operator for Kubernetes), > and any replication slot not created by Patroni, seems to be removed, > whenever the master pod restarts. We therefore specify in the Patroni > config, that a permanent replication slot should be created for our > usage to do logical replication of some select tables, to our data > warehouse. That means that the replication slot is created as soon as > the database is ready to use, which is also before any tables, data or > publications are created. Can you give me a hint as to what the correct > way to set this up would be? > Or do I need to try contacting the Patroni devs instead? That would probably be best. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services