Thread: Logical replication - initial data synchronization
The following documentation comment has been logged on the website: Page: https://www.postgresql.org/docs/16/logical-replication-subscription.html Description: I'm reading up on Logical Replication and have been reading the pages in order. The first 2 pages: https://www.postgresql.org/docs/current/logical-replication.html and https://www.postgresql.org/docs/current/logical-replication-publication.html both speak of the requirement to set up a snapshot and explain that publication will then send further updates as they happen to subscribers. But the 3rd page, https://www.postgresql.org/docs/current/logical-replication-subscription.html now mentions this: "Additional replication slots may be required for the initial data synchronization of pre-existing table data and those will be dropped at the end of data synchronization." For me, reading the first 2 pages implied that I would have to perform some manual command that starts the creation of a snapshot of pre-existing table data, and unpack this on the subscriber node somehow. The text on the "Subscription" page sounds to me like this is actually something the publisher<-> subscriber model of the postgres software can manage on its own. As opposed to a snapshot, which feels more like the concept of a basebackup. Regardless of that being correct or not, my current impression is that the description isn't consistent across pages. Maybe the text is obvious for people who've performed setup of logical replication before, but I have never done this. To me, the description on the first 2 pages seems inconsistent with the description I just encountered on the 3rd page. I was under the impression there was no such thing as "initial data synchronization of pre-existing table data" in terms of postgres doing this by itself. Am I missing something extremely simple, or can the description of the involved operations be made more consistent across documentation pages? Regards, Koen De Groote
Hello Bruce, thanks for picking this up.
Having used LR for months now, that seems weird as I write it, but I remember it being part of my initial confusion.
Instead of:
" Internally logical replication of a table starts by taking a snapshot
of the data on the publisher database and copying that to the subscriber."
of the data on the publisher database and copying that to the subscriber."
I would say:
"When logical replication is started for a table, Postgres internally
takes a snapshot of the table data on the publisher database,
and then copies that data to the subscriber."
Also, I would change:
"Once complete, the changes on the publisher are sent to the subscriber"
To:
"Once complete, any changes on the publisher since the initial copy are sent to the subscriber"
This is more explicit and clear, I feel.
And then to be consistent I'd also use this wording in the last change, changing:
"publisher database. Once complete, changes on the publisher are sent"
to
"publisher database. Once complete, any changes on the publisher since the initial copy are sent"
Hope that's ok.
Thanks for looking into this.
Regards,
Koen De Groote
On Thu, Oct 17, 2024 at 3:20 AM Bruce Momjian <bruce@momjian.us> wrote:
On Sat, May 18, 2024 at 09:02:11PM +0000, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
>
> Page: https://www.postgresql.org/docs/16/logical-replication-subscription.html
> Description:
>
> I'm reading up on Logical Replication and have been reading the pages in
> order.
>
> The first 2 pages:
> https://www.postgresql.org/docs/current/logical-replication.html and
> https://www.postgresql.org/docs/current/logical-replication-publication.html
> both speak of the requirement to set up a snapshot and explain that
> publication will then send further updates as they happen to subscribers.
>
> But the 3rd page,
> https://www.postgresql.org/docs/current/logical-replication-subscription.html
> now mentions this: "Additional replication slots may be required for the
> initial data synchronization of pre-existing table data and those will be
> dropped at the end of data synchronization."
>
> For me, reading the first 2 pages implied that I would have to perform some
> manual command that starts the creation of a snapshot of pre-existing table
> data, and unpack this on the subscriber node somehow.
>
> The text on the "Subscription" page sounds to me like this is actually
> something the publisher<-> subscriber model of the postgres software can
> manage on its own. As opposed to a snapshot, which feels more like the
> concept of a basebackup.
>
> Regardless of that being correct or not, my current impression is that the
> description isn't consistent across pages. Maybe the text is obvious for
> people who've performed setup of logical replication before, but I have
> never done this. To me, the description on the first 2 pages seems
> inconsistent with the description I just encountered on the 3rd page. I was
> under the impression there was no such thing as "initial data
> synchronization of pre-existing table data" in terms of postgres doing this
> by itself.
>
> Am I missing something extremely simple, or can the description of the
> involved operations be made more consistent across documentation pages?
Is the attached patch an improvement?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
I would still add that the snapshot and data copying is handled by postgres itself. That was the big question I was stuck with:"Who is taking this snapshot? Do I have to do it? Will they explain how?"
Because it's left in the middle who does it. Once you know how logical replication works, it's obvious, re-reading the documentation I know what to expect, but that's only because I've already done it a few time snow.
As someone just starting and reading the documentation, it was a stumbling block for me.
To me,
> When logical replication of a table typically starts, a snapshot is
> taken of the table's data on the publisher database and copied to the
> taken of the table's data on the publisher database and copied to the
> subscriber
Does not clarify that.
It's the reason I created this mail: I would like it stated explicitly that the database process takes care of this for us.
Regards,
Koen De Groote
On Thu, Oct 17, 2024 at 3:08 PM Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Oct 17, 2024 at 10:59:51AM +0200, Koen De Groote wrote:
> Hello Bruce, thanks for picking this up.
>
> Personally, I would make explicit mention of the fact that creating the
> snapshot and copying the data is taken care of by Postgres itself. Those are
> the points that had me confused early on, wondering if I had to perform the
> copy once the snapshot was ready.
Updated patch attached. I tried to tighten up the wording and add more
detail. I didn't see the point in repeating the same paragraph later on
so I removed it.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Thu, Oct 17, 2024 at 09:57:04PM +0200, Koen De Groote wrote: > I would still add that the snapshot and data copying is handled by postgres > itself. That was the big question I was stuck with:"Who is taking this > snapshot? Do I have to do it? Will they explain how?" > > Because it's left in the middle who does it. Once you know how logical > replication works, it's obvious, re-reading the documentation I know what to > expect, but that's only because I've already done it a few time snow. > > As someone just starting and reading the documentation, it was a stumbling > block for me. > > To me, > > > When logical replication of a table typically starts, a snapshot is > > taken of the table's data on the publisher database and copied to the > > subscriber > > Does not clarify that. > > It's the reason I created this mail: I would like it stated explicitly that the > database process takes care of this for us. Well, you are the first person to report this confusion, and we can't go around explaining what Postgres does and does not do in each section. I would need to hear from other people that this is confusing before making it explicit. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Thu, 2024-10-17 at 16:00 -0400, Bruce Momjian wrote: > > To me, > > > > > When logical replication of a table typically starts, a snapshot is > > > taken of the table's data on the publisher database and copied to the > > > subscriber > > > > Does not clarify that. > > > > It's the reason I created this mail: I would like it stated explicitly that the > > database process takes care of this for us. > > Well, you are the first person to report this confusion, and we can't go > around explaining what Postgres does and does not do in each section. I > would need to hear from other people that this is confusing before > making it explicit. I for one would have interpreted the passive voice here as meaning that the database does that automatically. But perhaps active voice can make it even clearer: Ordinarily, when logical replication of a table starts, <productname>PostgreSQL</productname> takes a snapshot of the table's data on the publisher database and copies these data to the subscriber Yours, Laurenz Albe
> But perhaps active voice can make it even clearer:
To give some context, jobs over the past decade have taught me to work on a default setting of "If it isn't explicitly stated, it cannot be assumed and has to be tested first. Lack of explicit statement is lack of a guarantee, and if you don't test it, you'll run into issues down the line."
To Bruce: I get that concern in general. But I feel like people don't complain because at some point they figure it out, and then it's no longer an issue for them. But it was initially, and it did cost them time. Most people will work with PostgreSQL for their job and their job will demand timely delivery from them. Sending mails like this is probably something most people aren't going to take time for.
At least, that's how I see the situation. It may be presumptive of me.
I realize this has taken a bit of time already, and for my part, I would like the more explicit statement, as shown by Laurenz, but I'm not going to keep going on about it if it doesn't make its way in.
Regards,
Koen De Groote
On Fri, Oct 18, 2024 at 10:11 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Thu, 2024-10-17 at 16:00 -0400, Bruce Momjian wrote:
> > To me,
> >
> > > When logical replication of a table typically starts, a snapshot is
> > > taken of the table's data on the publisher database and copied to the
> > > subscriber
> >
> > Does not clarify that.
> >
> > It's the reason I created this mail: I would like it stated explicitly that the
> > database process takes care of this for us.
>
> Well, you are the first person to report this confusion, and we can't go
> around explaining what Postgres does and does not do in each section. I
> would need to hear from other people that this is confusing before
> making it explicit.
I for one would have interpreted the passive voice here as meaning that the
database does that automatically. But perhaps active voice can make it even
clearer:
Ordinarily, when logical replication of a table starts,
<productname>PostgreSQL</productname> takes a snapshot of the table's
data on the publisher database and copies these data to the subscriber
Yours,
Laurenz Albe
On Fri, 2024-10-18 at 13:05 +0200, Koen De Groote wrote: > I feel like people don't complain because at some point they figure it out, > and then it's no longer an issue for them. But it was initially, and it did > cost them time. Most people will work with PostgreSQL for their job and > their job will demand timely delivery from them. Sending mails like this > is probably something most people aren't going to take time for. It is the goal of the documentation to be factually correct and complete while still being comprehensible. We don't always succeed in that. PostgreSQL depends on people that contribute, and your contribution is valued. Yours, Laurenz Albe
On Fri, Oct 18, 2024 at 7:11 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > On Thu, 2024-10-17 at 16:00 -0400, Bruce Momjian wrote: > > > To me, > > > > > > > When logical replication of a table typically starts, a snapshot is > > > > taken of the table's data on the publisher database and copied to the > > > > subscriber > > > > > > Does not clarify that. > > > > > > It's the reason I created this mail: I would like it stated explicitly that the > > > database process takes care of this for us. > > > > Well, you are the first person to report this confusion, and we can't go > > around explaining what Postgres does and does not do in each section. I > > would need to hear from other people that this is confusing before > > making it explicit. > > I for one would have interpreted the passive voice here as meaning that the > database does that automatically. But perhaps active voice can make it even > clearer: > > Ordinarily, when logical replication of a table starts, > <productname>PostgreSQL</productname> takes a snapshot of the table's > data on the publisher database and copies these data to the subscriber > +1 to say that it is PostgreSQL that does this. It seems to me the same clarification could be achieved just by adding 1 word ("PostgreSQL") to the original text. e.g. BEFORE Logical replication of a table typically starts with taking a snapshot of the data on the publisher database and copying that to the subscriber. AFTER #1 (I added "PostgreSQL") Logical replication of a table typically starts with PostgreSQL taking a snapshot of the data on the publisher database and copying that to the subscriber. Or, AFTER #2 (I added "PostgreSQL internally") Logical replication of a table typically starts with PostgreSQL internally taking a snapshot of the data on the publisher database and copying that to the subscriber. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Fri, 2024-10-25 at 17:22 +1100, Peter Smith wrote: > It seems to me the same clarification could be achieved just by adding > 1 word ("PostgreSQL") to the original text. e.g. > > BEFORE > Logical replication of a table typically starts with taking a snapshot > of the data on the publisher database and copying that to the > subscriber. > > AFTER #1 (I added "PostgreSQL") > Logical replication of a table typically starts with PostgreSQL taking > a snapshot of the data on the publisher database and copying that to > the subscriber. +1 on this version. Either this way or the way I suggested, it doesn't matter as far as I am concerned. Perhaps Koen wants to chime in and say what sounds best to him. > Or, AFTER #2 (I added "PostgreSQL internally") > Logical replication of a table typically starts with PostgreSQL > internally taking a snapshot of the data on the publisher database and > copying that to the subscriber. Yours, Laurenz Albe
Number 2 would be perfect, I think. It leaves no doubt at all.
Regards,
Koen De Groote
On Fri, Oct 25, 2024 at 9:32 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Fri, 2024-10-25 at 17:22 +1100, Peter Smith wrote:
> It seems to me the same clarification could be achieved just by adding
> 1 word ("PostgreSQL") to the original text. e.g.
>
> BEFORE
> Logical replication of a table typically starts with taking a snapshot
> of the data on the publisher database and copying that to the
> subscriber.
>
> AFTER #1 (I added "PostgreSQL")
> Logical replication of a table typically starts with PostgreSQL taking
> a snapshot of the data on the publisher database and copying that to
> the subscriber.
+1 on this version. Either this way or the way I suggested, it doesn't
matter as far as I am concerned.
Perhaps Koen wants to chime in and say what sounds best to him.
> Or, AFTER #2 (I added "PostgreSQL internally")
> Logical replication of a table typically starts with PostgreSQL
> internally taking a snapshot of the data on the publisher database and
> copying that to the subscriber.
Yours,
Laurenz Albe
And my thanks to everyone who took part.
Regards,
Koen De Groote
On Thu, Oct 31, 2024 at 4:16 PM Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Oct 25, 2024 at 11:48:16PM +0200, Koen De Groote wrote:
> Number 2 would be perfect, I think. It leaves no doubt at all.
I went with even more direct wording, patch attached. Glad we could
improve this together --- the feedback from everyone was helpful.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Fri, Nov 1, 2024 at 2:16 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Fri, Oct 25, 2024 at 11:48:16PM +0200, Koen De Groote wrote: > > Number 2 would be perfect, I think. It leaves no doubt at all. > > I went with even more direct wording, patch attached. Glad we could > improve this together --- the feedback from everyone was helpful. > One small nit -- I think "PostgreSQL" typically would have SGML markup: <productname>PostgreSQL</productname> ====== Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Nov 1, 2024 at 09:07:38AM +1100, Peter Smith wrote: > On Fri, Nov 1, 2024 at 2:16 AM Bruce Momjian <bruce@momjian.us> wrote: > > > > On Fri, Oct 25, 2024 at 11:48:16PM +0200, Koen De Groote wrote: > > > Number 2 would be perfect, I think. It leaves no doubt at all. > > > > I went with even more direct wording, patch attached. Glad we could > > improve this together --- the feedback from everyone was helpful. > > > > One small nit -- I think "PostgreSQL" typically would have SGML markup: > > <productname>PostgreSQL</productname> Yes, I thought so too, but if you look at the SGML, it uses PostgreSQL with no markup in many places close to this sentence, e.g. "PostgreSQL supports both mechanisms concurrently". -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Thu, Oct 31, 2024 at 06:52:34PM -0400, Bruce Momjian wrote: > On Fri, Nov 1, 2024 at 09:07:38AM +1100, Peter Smith wrote: > > On Fri, Nov 1, 2024 at 2:16 AM Bruce Momjian <bruce@momjian.us> wrote: > > > > > > On Fri, Oct 25, 2024 at 11:48:16PM +0200, Koen De Groote wrote: > > > > Number 2 would be perfect, I think. It leaves no doubt at all. > > > > > > I went with even more direct wording, patch attached. Glad we could > > > improve this together --- the feedback from everyone was helpful. > > > > > > > One small nit -- I think "PostgreSQL" typically would have SGML markup: > > > > <productname>PostgreSQL</productname> > > Yes, I thought so too, but if you look at the SGML, it uses PostgreSQL > with no markup in many places close to this sentence, e.g. "PostgreSQL > supports both mechanisms concurrently". Patch applied to master. :-) -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Thanks for letting us know, will be happy to see it appear.
Regards,
Koen De Groote
On Thu, Nov 21, 2024 at 11:15 PM Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Oct 31, 2024 at 06:52:34PM -0400, Bruce Momjian wrote:
> On Fri, Nov 1, 2024 at 09:07:38AM +1100, Peter Smith wrote:
> > On Fri, Nov 1, 2024 at 2:16 AM Bruce Momjian <bruce@momjian.us> wrote:
> > >
> > > On Fri, Oct 25, 2024 at 11:48:16PM +0200, Koen De Groote wrote:
> > > > Number 2 would be perfect, I think. It leaves no doubt at all.
> > >
> > > I went with even more direct wording, patch attached. Glad we could
> > > improve this together --- the feedback from everyone was helpful.
> > >
> >
> > One small nit -- I think "PostgreSQL" typically would have SGML markup:
> >
> > <productname>PostgreSQL</productname>
>
> Yes, I thought so too, but if you look at the SGML, it uses PostgreSQL
> with no markup in many places close to this sentence, e.g. "PostgreSQL
> supports both mechanisms concurrently".
Patch applied to master. :-)
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Fri, Nov 22, 2024 at 9:15 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Oct 31, 2024 at 06:52:34PM -0400, Bruce Momjian wrote: > > On Fri, Nov 1, 2024 at 09:07:38AM +1100, Peter Smith wrote: > > > On Fri, Nov 1, 2024 at 2:16 AM Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > On Fri, Oct 25, 2024 at 11:48:16PM +0200, Koen De Groote wrote: > > > > > Number 2 would be perfect, I think. It leaves no doubt at all. > > > > > > > > I went with even more direct wording, patch attached. Glad we could > > > > improve this together --- the feedback from everyone was helpful. > > > > > > > > > > One small nit -- I think "PostgreSQL" typically would have SGML markup: > > > > > > <productname>PostgreSQL</productname> > > > > Yes, I thought so too, but if you look at the SGML, it uses PostgreSQL > > with no markup in many places close to this sentence, e.g. "PostgreSQL > > supports both mechanisms concurrently". > > Patch applied to master. :-) > Sorry for the late nit, but when I read the re-worded text that was pushed [1], I felt the word "typically" was misplaced. Maybe it is a personal opinion, but OTOH Chat-GPT has the same opinion as me: ------ Me: Is "typically" in the correct place below? When logical replication of a table typically starts, PostgreSQL takes a snapshot of the table's data on the publisher database and copies it to the subscriber. ChatGPT said: The placement of "typically" in your sentence is grammatically correct, but it may cause slight confusion about what "typically" is modifying. It seems to suggest that the timing of when logical replication starts varies, which might not be the intended meaning. If you mean that it is the process of taking a snapshot that typically happens (not the timing of when replication starts), then "typically" would be better placed after "PostgreSQL." Here's a revised version for clarity: "When logical replication of a table starts, PostgreSQL typically takes a snapshot of the table's data on the publisher database and copies it to the subscriber." This makes it clear that "typically" refers to the snapshot-taking process, not the timing of replication's start. ------ ====== [1] https://github.com/postgres/postgres/commit/4c4aaa19a6fed39e0eb0247625331c3df34d8211 Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Nov 25, 2024 at 11:55 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Fri, Nov 22, 2024 at 9:15 AM Bruce Momjian <bruce@momjian.us> wrote: > > > > On Thu, Oct 31, 2024 at 06:52:34PM -0400, Bruce Momjian wrote: > > > On Fri, Nov 1, 2024 at 09:07:38AM +1100, Peter Smith wrote: > > > > On Fri, Nov 1, 2024 at 2:16 AM Bruce Momjian <bruce@momjian.us> wrote: > > > > > > > > > > On Fri, Oct 25, 2024 at 11:48:16PM +0200, Koen De Groote wrote: > > > > > > Number 2 would be perfect, I think. It leaves no doubt at all. > > > > > > > > > > I went with even more direct wording, patch attached. Glad we could > > > > > improve this together --- the feedback from everyone was helpful. > > > > > > > > > > > > > One small nit -- I think "PostgreSQL" typically would have SGML markup: > > > > > > > > <productname>PostgreSQL</productname> > > > > > > Yes, I thought so too, but if you look at the SGML, it uses PostgreSQL > > > with no markup in many places close to this sentence, e.g. "PostgreSQL > > > supports both mechanisms concurrently". > > > > Patch applied to master. :-) > > > > Sorry for the late nit, but when I read the re-worded text that was > pushed [1], I felt the word "typically" was misplaced. > > Maybe it is a personal opinion, but OTOH Chat-GPT has the same opinion as me: > > ------ > Me: > Is "typically" in the correct place below? > When logical replication of a table typically starts, PostgreSQL takes > a snapshot of the table's data on the publisher database and copies it > to the subscriber. > > ChatGPT said: > The placement of "typically" in your sentence is grammatically > correct, but it may cause slight confusion about what "typically" is > modifying. It seems to suggest that the timing of when logical > replication starts varies, which might not be the intended meaning. > > If you mean that it is the process of taking a snapshot that typically > happens (not the timing of when replication starts), then "typically" > would be better placed after "PostgreSQL." Here's a revised version > for clarity: > > "When logical replication of a table starts, PostgreSQL typically > takes a snapshot of the table's data on the publisher database and > copies it to the subscriber." > > This makes it clear that "typically" refers to the snapshot-taking > process, not the timing of replication's start. > ------ > > ====== > [1] https://github.com/postgres/postgres/commit/4c4aaa19a6fed39e0eb0247625331c3df34d8211 > Hi Bruce, There was no reply yet to my 3-week-old post above regarding (what I thought was) the misplaced word "typically" so I am bumping this thread, just in case that post was accidentally overlooked. OTOH, if you disagree and/or don't plan to modify it, please let me know so I can take this thread off my watch list. ====== Kind Regards, Peter Smith. Fujitsu Australia