Thread: Native Logical Replication Initial Import Qs

Native Logical Replication Initial Import Qs

From
Don Seiler
Date:
Good afternoon.

I'm looking at having to move a fleet of PG 12 databases from Ubuntu 18.04 to Ubuntu 22.04. This means crossing the dreaded libc collation change, so we're looking to have to migrate via pg_dump/restore or logical replication for the bigger/busier ones. We're also planning to use PG 15 on the destination (Ubuntu 22.04) side to kill two birds with one stone, as much as I'd prefer to have minimal moving parts.

On the logical replication front, the concern is with the initial data import that happens when the subscription is created (by default). I know that you can tell the subscription to not copy data and instead use pg_dump and a replication slot snapshot to achieve this manually. However I'm unable to explain (to myself) why this is better than just having the subscription do it upon creation. Given that I can create pub/sub sets for individual tables for parallel operations, I'm curious what advantages there are in using pg_dump to do this import.

I had been planning to have pg_dump pipe directly into the destination database via psql. Is this faster than just having the subscription do the import? I'm curious as to why or not. I know to only use the minimal indexes required on the destination side (ie identity-related indexes) and omit other indexes and constraints until after the data is loaded, but that is true for either method.

Thanks,
Don.

--
Don Seiler
www.seiler.us

Re: Native Logical Replication Initial Import Qs

From
Jeremy Schneider
Date:
On 6/7/23 2:12 PM, Don Seiler wrote:
> On the logical replication front, the concern is with the initial data
> import that happens when the subscription is created (by default). I
> know that you can tell the subscription to not copy data and instead use
> pg_dump and a replication slot snapshot to achieve this manually.
> However I'm unable to explain (to myself) why this is better than just
> having the subscription do it upon creation. Given that I can create
> pub/sub sets for individual tables for parallel operations, I'm curious
> what advantages there are in using pg_dump to do this import.

FWIW, I think the place this feature shines the most is when you can
safely leverage things like storage-level snapshots. Sometimes that
means you can get a copy of a multi-TB database almost instantly if the
storage or filesystem does copy-on-write, for example database lab
(postgres.ai) which uses ZFS.

Another thing I can think of is that while it's true you can create
multiple pub/sub sets, I'm not sure you can reduce the number of sets
later. So if you were concerned about having too many slots doing
decoding on the source, then you might want the flexibility of pg_dump
(or perhaps restoring a backup) to get more parallelism while having
more control over how many slots will be used later.

In your case, the whole setup is hopefully temporary, so maybe these
particular concerns aren't as relevant to you.

This is just what comes to mind... probably there's a few more things
I'm not thinking of and hopefully others will chime in.  :)

-Jeremy


-- 
http://about.me/jeremy_schneider




Re: Native Logical Replication Initial Import Qs

From
Don Seiler
Date:
On Wed, Jun 7, 2023 at 4:30 PM Jeremy Schneider <schneider@ardentperf.com> wrote:
On 6/7/23 2:12 PM, Don Seiler wrote:
> On the logical replication front, the concern is with the initial data
> import that happens when the subscription is created (by default). I
> know that you can tell the subscription to not copy data and instead use
> pg_dump and a replication slot snapshot to achieve this manually.
> However I'm unable to explain (to myself) why this is better than just
> having the subscription do it upon creation. Given that I can create
> pub/sub sets for individual tables for parallel operations, I'm curious
> what advantages there are in using pg_dump to do this import.

FWIW, I think the place this feature shines the most is when you can
safely leverage things like storage-level snapshots. Sometimes that
means you can get a copy of a multi-TB database almost instantly if the
storage or filesystem does copy-on-write, for example database lab
(postgres.ai) which uses ZFS.

Another thing I can think of is that while it's true you can create
multiple pub/sub sets, I'm not sure you can reduce the number of sets
later. So if you were concerned about having too many slots doing
decoding on the source, then you might want the flexibility of pg_dump
(or perhaps restoring a backup) to get more parallelism while having
more control over how many slots will be used later.

In your case, the whole setup is hopefully temporary, so maybe these
particular concerns aren't as relevant to you.

This is just what comes to mind... probably there's a few more things
I'm not thinking of and hopefully others will chime in.  :)

Yes my setup would be temporary, just as long as needed to complete the migration to a new host.

One other use case is using pg_dump/restore to move older data in time-based partitions (assuming data isn't modified for older partition periods). Those older partitions would just need to be copied but wouldn't need any ongoing pub/sub/replication set up for them. So you'd have both in use in that case.

Thanks,
Don.

--
Don Seiler
www.seiler.us