Thread: [HACKERS] Detecting schema changes during logical replication

[HACKERS] Detecting schema changes during logical replication

From
Daniele Varrazzo
Date:
Hello,

I'm putting together a replication system based on logical
replication. I would like to send table information only the first
time a table is seen by the 'change_cb' callback, but of course there
could be some schema change after replication started. So I wonder: is
there any information I can find in the 'Relation' structure of the
change callback, which may suggest that there could have been a change
in the table schema, hence a new schema should be sent to the client?

Thank you very much,

-- Daniele



Re: [HACKERS] Detecting schema changes during logical replication

From
Andres Freund
Date:
Hi,

On 2017-05-07 19:27:08 +0100, Daniele Varrazzo wrote:
> I'm putting together a replication system based on logical
> replication.

Interesting.  If you very briefly could recap what it's about... ;)


> I would like to send table information only the first
> time a table is seen by the 'change_cb' callback, but of course there
> could be some schema change after replication started. So I wonder: is
> there any information I can find in the 'Relation' structure of the
> change callback, which may suggest that there could have been a change
> in the table schema, hence a new schema should be sent to the client?

The best way I can think of - which is also what is implemented in the
in-core replication framework - is to have a small cache on-top of the
relcache.  That cache is kept coherent using
CacheRegisterRelcacheCallback().  Then whenever there's a change you
look up that change in that cache, and send the schema information if
it's been invalidated since you last sent something.  That's also how
the new stuff in v10 essentially works:
src/backend/replication/pgoutput/pgoutput.c

pgoutput_change(), does a lookup for its own metadata using get_rel_sync_entry()
which then checks relentry->schema_sent.  Invalidation unsets
schema_sent in rel_sync_cache_relation_cb.

Greetings,

Andres Freund



Re: [HACKERS] Detecting schema changes during logical replication

From
Daniele Varrazzo
Date:
On Sun, May 7, 2017 at 8:04 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-05-07 19:27:08 +0100, Daniele Varrazzo wrote:
>> I'm putting together a replication system based on logical
>> replication.
>
> Interesting.  If you very briefly could recap what it's about... ;)

I need to replicate some tables from a central database into the
database that should run a secondary system. For a similar use case we
have used Londiste in the past, which has served us good, but its
usage has not been problem-free. Logical decoding seems much less
invasive on the source database than a trigger-based replication
solution, and has less moving part to care about and maintain.

For the moment I'm hacking into a fork of Euler project for wal
decoding into json (https://github.com/dvarrazzo/wal2json), mostly
adding configurability, so that we may be able to replicate only the
tables we need, skip certain fields etc. I'm also taking a look at
minimising the amount of information produced: sending over and over
the column names and types for every record seems a waste, hence my
question.

>> I would like to send table information only the first
>> time a table is seen by the 'change_cb' callback, but of course there
>> could be some schema change after replication started. So I wonder: is
>> there any information I can find in the 'Relation' structure of the
>> change callback, which may suggest that there could have been a change
>> in the table schema, hence a new schema should be sent to the client?
>
> The best way I can think of - which is also what is implemented in the
> in-core replication framework - is to have a small cache on-top of the
> relcache.  That cache is kept coherent using
> CacheRegisterRelcacheCallback().  Then whenever there's a change you
> look up that change in that cache, and send the schema information if
> it's been invalidated since you last sent something.  That's also how
> the new stuff in v10 essentially works:
> src/backend/replication/pgoutput/pgoutput.c
>
> pgoutput_change(), does a lookup for its own metadata using get_rel_sync_entry()
> which then checks relentry->schema_sent.  Invalidation unsets
> schema_sent in rel_sync_cache_relation_cb.

Thank you very much, it seems exactly what I need. I'll try hacking
around this callback.

-- Daniele



Re: [HACKERS] Detecting schema changes during logical replication

From
Craig Ringer
Date:


On 8 May 2017 05:56, "Daniele Varrazzo" <daniele.varrazzo@gmail.com> wrote:
On Sun, May 7, 2017 at 8:04 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-05-07 19:27:08 +0100, Daniele Varrazzo wrote:
>> I'm putting together a replication system based on logical
>> replication.
>
> Interesting.  If you very briefly could recap what it's about... ;)

I need to replicate some tables from a central database into the
database that should run a secondary system. For a similar use case we
have used Londiste in the past, which has served us good, but its
usage has not been problem-free. Logical decoding seems much less
invasive on the source database than a trigger-based replication
solution, and has less moving part to care about and maintain.

For the moment I'm hacking into a fork of Euler project for wal
decoding into json (https://github.com/dvarrazzo/wal2json), mostly
adding configurability, so that we may be able to replicate only the
tables we need, skip certain fields etc. I'm also taking a look at
minimising the amount of information produced: sending over and over
the column names and types for every record seems a waste, hence my
question.

Sounds like you're reimplementing pglogical (http://2ndquadrant.com/pglogical) on top of a json protocol.

Pglogical is open source and hackable to meet your needs. We're also happy to accept patches with appropriate design discussion and code review to make sure they will aid or not hinder other users and won't add undue maintenance burden.

Rather than repeat the same work, maybe it's worth picking it up as a starting point. It's the extension-based progenitor of the in-core logical rep code, but portable from 9.4 to 9.6 (and soon 10) with a bunch of features that didn't make it into Pg's version yet.

I have no reason to object to your doing it yourself, and you're welcome to use pglogical as a reference for how to do things (see the license). It just seems like a waste.

Re: [HACKERS] Detecting schema changes during logical replication

From
Daniele Varrazzo
Date:
On Mon, May 8, 2017 at 3:48 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:

> Sounds like you're reimplementing pglogical
> (http://2ndquadrant.com/pglogical) on top of a json protocol.

The fact the protocol is JSON is more a detail, but it's a good start
as it's human-readable.

> [...]
> I have no reason to object to your doing it yourself, and you're welcome to
> use pglogical as a reference for how to do things (see the license). It just
> seems like a waste.

Logical Replication, for the first time, offers a way to implement a
replication solution that is not several layers away from the
database. Or even: for the first time is something I understand.

Using the logical replication we can perform some manipulation of the
data I will want to use (tables not necessarily in the same places,
schemas not necessarily matching). In particular one not-really-minor
annoyance of the current system is that adding a column of the master
regularly breaks the replica, and pglogical doesn't resolve this
problem. We currently use certain features of Londiste (tables living
in different schemas, slightly different data types with the same
textual representation, extra columns on the slave...) that are
against pglogical requirements, but which can be implemented no
problem is a customised replication solution built on top of streaming
replication.

All in all I'm more thrilled by the idea of having a database throwing
a stream of changes at me in a format I can reinterpret, allowing me
to write in a target database with a high degree of configurability in
the middle (i.e. I just have a Python script receiving the data,
munging them and then performing queries), then a complete but
schema-rigid replication solution.


-- Daniele