Thread: Logical Replication Custom Column Expression

Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
Hi all,

Working with PostgreSQL Logical Replication is just great! It helps a lot doing real time replication for analytical purposes without using any other 3d party service. Although all these years working as product architect of reporting i have noted a few requirements which are always a challenge and may help enhance logical replication even better.

To the point:
PostgreSQL14 Logical Replication allows replication of a table to another table that exists in another database or even in another host. It also allows multiple upstream tables using the same structure to downstream into a single table.
CREATE PUBLICATION pb_test FOR TABLE test

PostgreSQL15 Logical Replication allows even better replication options, like selecting subsets of the columns from publisher tables. It also supports plenty of options like disable_on_error etc.
CREATE PUBLICATION pb_test FOR TABLE test ("id", "name")

What does not support is the option for defining custom column expressions, as keys or values, into the upstream (publication). This will give more flexibility into making replication from multiple upstreams into less downstreams adding more logic. For instance, in a project for analytical purposes there is the need to consolidate data from multiple databases into one and at the same time keep the origin of each replicated data identified by a tenanant_id column. In this case we also need the ability to define the new column as an additional key which will participate into the destination table.

Tenant 1 table
id serial pk
description varchar

Tenant 2 table
id integer pk
description varchar

Group table
tenant integer pk
id integer pk
description varchar

Possible syntax to archive that
CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias} ,"id", "name")

Example
CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} ,"id", "name")

I suppose the column definition should exist in the publication syntax as the publication should know from before the datatype and if is a key before being consumed by a subscriber which may already have the column.

So making an insert or update or delete statement into a tenant 1 database:
INSERT INTO test (id, description) VALUES (5, 'data')
UPDATE test SET description = 'data' WHERE id = 5
DELETE FROM test WHERE id = 5
Will be reflected into subscriber as the following
INSERT INTO test (tenant, id, description) VALUES (1, 5, 'data')
UPDATE test SET description = 'data' WHERE tenant=1 AND id = 5
DELETE FROM test WHERE tenant=1 AND id = 5

For more clarifications please reach me at koureasstavros@gmail.com
Thanks!



Re: Logical Replication Custom Column Expression

From
Ashutosh Bapat
Date:
On Sat, Nov 19, 2022 at 6:47 PM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Hi all,
>
> Working with PostgreSQL Logical Replication is just great! It helps a lot doing real time replication for analytical
purposeswithout using any other 3d party service. Although all these years working as product architect of reporting i
havenoted a few requirements which are always a challenge and may help enhance logical replication even better. 
>
> To the point:
> PostgreSQL14 Logical Replication allows replication of a table to another table that exists in another database or
evenin another host. It also allows multiple upstream tables using the same structure to downstream into a single
table.
> CREATE PUBLICATION pb_test FOR TABLE test
>
> PostgreSQL15 Logical Replication allows even better replication options, like selecting subsets of the columns from
publishertables. It also supports plenty of options like disable_on_error etc. 
> CREATE PUBLICATION pb_test FOR TABLE test ("id", "name")
>
> What does not support is the option for defining custom column expressions, as keys or values, into the upstream
(publication).This will give more flexibility into making replication from multiple upstreams into less downstreams
addingmore logic. For instance, in a project for analytical purposes there is the need to consolidate data from
multipledatabases into one and at the same time keep the origin of each replicated data identified by a tenanant_id
column.In this case we also need the ability to define the new column as an additional key which will participate into
thedestination table. 
>
> Tenant 1 table
> id serial pk
> description varchar
>
> Tenant 2 table
> id integer pk
> description varchar
>
> Group table
> tenant integer pk
> id integer pk
> description varchar
>
> Possible syntax to archive that
> CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias} ,"id", "name")
>
> Example
> CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} ,"id", "name")

I think that's a valid usecase.

This looks more like a subscription option to me. In multi-subscriber
multi-publisher scenarios, on one subscriber a given upstream may be
tenant 1 but on some other it could be 2. But I don't think we allow
specifying subscription options for a single table. AFAIU, the origin
ids are available as part of the commit record which contained this
change; that's how conflict resolution is supposed to know it. So
somehow the subscriber will need to fetch those from there and set the
tenant.

--
Best Wishes,
Ashutosh Bapat



Re: Logical Replication Custom Column Expression

From
Amit Kapila
Date:
On Mon, Nov 21, 2022 at 5:05 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Sat, Nov 19, 2022 at 6:47 PM Stavros Koureas
> <koureasstavros@gmail.com> wrote:
> >
> > What does not support is the option for defining custom column expressions, as keys or values, into the upstream
(publication).This will give more flexibility into making replication from multiple upstreams into less downstreams
addingmore logic. For instance, in a project for analytical purposes there is the need to consolidate data from
multipledatabases into one and at the same time keep the origin of each replicated data identified by a tenanant_id
column.In this case we also need the ability to define the new column as an additional key which will participate into
thedestination table. 
> >
> > Tenant 1 table
> > id serial pk
> > description varchar
> >
> > Tenant 2 table
> > id integer pk
> > description varchar
> >
> > Group table
> > tenant integer pk
> > id integer pk
> > description varchar
> >
> > Possible syntax to archive that
> > CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias} ,"id", "name")
> >
> > Example
> > CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} ,"id", "name")
>
> I think that's a valid usecase.
>
> This looks more like a subscription option to me. In multi-subscriber
> multi-publisher scenarios, on one subscriber a given upstream may be
> tenant 1 but on some other it could be 2. But I don't think we allow
> specifying subscription options for a single table. AFAIU, the origin
> ids are available as part of the commit record which contained this
> change; that's how conflict resolution is supposed to know it. So
> somehow the subscriber will need to fetch those from there and set the
> tenant.
>

Yeah, to me also it appears that we can handle it on the subscriber
side. We have the provision of sending origin information in proto.c.
But note that by default publishers won't have any origin associated
with change unless someone has defined it. I think this work needs
more thought but sounds to be an interesting feature.

--
With Regards,
Amit Kapila.



Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
Sure, this can be implemented as a subscription option, and it will cover this use case scenario as each subscriber points only to one database.
I also have some more analytical/reporting use-cases which need additions in logical-replication, I am not sure if I need to open different discussions for each one, all ideas are for publication/subscription.

Στις Τρί 22 Νοε 2022 στις 2:22 μ.μ., ο/η Amit Kapila <amit.kapila16@gmail.com> έγραψε:
On Mon, Nov 21, 2022 at 5:05 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Sat, Nov 19, 2022 at 6:47 PM Stavros Koureas
> <koureasstavros@gmail.com> wrote:
> >
> > What does not support is the option for defining custom column expressions, as keys or values, into the upstream (publication). This will give more flexibility into making replication from multiple upstreams into less downstreams adding more logic. For instance, in a project for analytical purposes there is the need to consolidate data from multiple databases into one and at the same time keep the origin of each replicated data identified by a tenanant_id column. In this case we also need the ability to define the new column as an additional key which will participate into the destination table.
> >
> > Tenant 1 table
> > id serial pk
> > description varchar
> >
> > Tenant 2 table
> > id integer pk
> > description varchar
> >
> > Group table
> > tenant integer pk
> > id integer pk
> > description varchar
> >
> > Possible syntax to archive that
> > CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias} ,"id", "name")
> >
> > Example
> > CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} ,"id", "name")
>
> I think that's a valid usecase.
>
> This looks more like a subscription option to me. In multi-subscriber
> multi-publisher scenarios, on one subscriber a given upstream may be
> tenant 1 but on some other it could be 2. But I don't think we allow
> specifying subscription options for a single table. AFAIU, the origin
> ids are available as part of the commit record which contained this
> change; that's how conflict resolution is supposed to know it. So
> somehow the subscriber will need to fetch those from there and set the
> tenant.
>

Yeah, to me also it appears that we can handle it on the subscriber
side. We have the provision of sending origin information in proto.c.
But note that by default publishers won't have any origin associated
with change unless someone has defined it. I think this work needs
more thought but sounds to be an interesting feature.

--
With Regards,
Amit Kapila.

Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
Reading more carefully what you described, I think you are interested in getting something you call origin from publishers, probably some metadata from the publications.

This identifier in those metadata maybe does not have business value on the reporting side. The idea is to use a value which has specific meaning to the user at the end.

For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end based on a dimension table which holds this mapping the user would be able to filter the data. So programmatically the user can set the id value of the column plus creating the mapping table from an application let’s say and be able to distinguish the data.

In addition this column should have the ability to be part of the primary key on the subscription table in order to not conflict with lines from other tenants having the same keys.


22 Νοε 2022, 14:52, ο χρήστης «Stavros Koureas <koureasstavros@gmail.com>» έγραψε:


Sure, this can be implemented as a subscription option, and it will cover this use case scenario as each subscriber points only to one database.
I also have some more analytical/reporting use-cases which need additions in logical-replication, I am not sure if I need to open different discussions for each one, all ideas are for publication/subscription.

Στις Τρί 22 Νοε 2022 στις 2:22 μ.μ., ο/η Amit Kapila <amit.kapila16@gmail.com> έγραψε:
On Mon, Nov 21, 2022 at 5:05 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Sat, Nov 19, 2022 at 6:47 PM Stavros Koureas
> <koureasstavros@gmail.com> wrote:
> >
> > What does not support is the option for defining custom column expressions, as keys or values, into the upstream (publication). This will give more flexibility into making replication from multiple upstreams into less downstreams adding more logic. For instance, in a project for analytical purposes there is the need to consolidate data from multiple databases into one and at the same time keep the origin of each replicated data identified by a tenanant_id column. In this case we also need the ability to define the new column as an additional key which will participate into the destination table.
> >
> > Tenant 1 table
> > id serial pk
> > description varchar
> >
> > Tenant 2 table
> > id integer pk
> > description varchar
> >
> > Group table
> > tenant integer pk
> > id integer pk
> > description varchar
> >
> > Possible syntax to archive that
> > CREATE PUBLICATION pb_test FOR TABLE test ({value:datatype:iskey:alias} ,"id", "name")
> >
> > Example
> > CREATE PUBLICATION pb_test FOR TABLE test ({1:integer:true:tenant} ,"id", "name")
>
> I think that's a valid usecase.
>
> This looks more like a subscription option to me. In multi-subscriber
> multi-publisher scenarios, on one subscriber a given upstream may be
> tenant 1 but on some other it could be 2. But I don't think we allow
> specifying subscription options for a single table. AFAIU, the origin
> ids are available as part of the commit record which contained this
> change; that's how conflict resolution is supposed to know it. So
> somehow the subscriber will need to fetch those from there and set the
> tenant.
>

Yeah, to me also it appears that we can handle it on the subscriber
side. We have the provision of sending origin information in proto.c.
But note that by default publishers won't have any origin associated
with change unless someone has defined it. I think this work needs
more thought but sounds to be an interesting feature.

--
With Regards,
Amit Kapila.

Re: Logical Replication Custom Column Expression

From
Peter Smith
Date:
On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Reading more carefully what you described, I think you are interested in getting something you call origin from
publishers,probably some metadata from the publications. 
>
> This identifier in those metadata maybe does not have business value on the reporting side. The idea is to use a
valuewhich has specific meaning to the user at the end. 
>
> For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end based on a dimension table which holds
thismapping the user would be able to filter the data. So programmatically the user can set the id value of the column
pluscreating the mapping table from an application let’s say and be able to distinguish the data. 
>
> In addition this column should have the ability to be part of the primary key on the subscription table in order to
notconflict with lines from other tenants having the same keys. 
>
>

I was wondering if a simpler syntax solution might also work here.

Imagine another SUBSCRIPTION parameter that indicates to write the
*name* of the subscription to some pre-defined table column:
e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
CONNECTION '...' WITH (subscription_column);

Logical Replication already allows the subscriber table to have extra
columns, so you just need to manually create the extra 'subscription'
column up-front.

Then...

~~

On Publisher:

test_pub=# CREATE TABLE tab(id int primary key, description varchar);
CREATE TABLE

test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
INSERT 0 3

test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
CREATE PUBLICATION

~~

On Subscriber:

test_sub=# CREATE TABLE tab(id int, description varchar, subscription varchar);
CREATE TABLE

test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
CREATE SUBSCRIPTION

test_sub=# SELECT * FROM tab;
 id | description | subscription
----+-------------+--------------
  1 | one         | sub_tenant1
  2 | two         | sub_tenant1
  3 | three       | sub_tenant1
(3 rows)

~~

Subscriptions to different tenants would be named differently.

And using other SQL you can map/filter those names however your
application wants.

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: Logical Replication Custom Column Expression

From
Amit Kapila
Date:
On Tue, Nov 22, 2022 at 6:22 PM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Sure, this can be implemented as a subscription option, and it will cover this use case scenario as each subscriber
pointsonly to one database.
 
> I also have some more analytical/reporting use-cases which need additions in logical-replication, I am not sure if I
needto open different discussions for each one, all ideas are for publication/subscription.
 
>

I think to some extent it depends on how unique each idea is but
initially you may want to post here and then we can spin off different
threads for a discussion if required. Are you interested in working on
one or more of those ideas to make them reality or do you want others
to pick up based on their interest?

-- 
With Regards,
Amit Kapila.



Re: Logical Replication Custom Column Expression

From
Amit Kapila
Date:
On Wed, Nov 23, 2022 at 1:40 AM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Reading more carefully what you described, I think you are interested in getting something you call origin from
publishers,probably some metadata from the publications. 
>
> This identifier in those metadata maybe does not have business value on the reporting side. The idea is to use a
valuewhich has specific meaning to the user at the end. 
>
> For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end based on a dimension table which holds
thismapping the user would be able to filter the data. So programmatically the user can set the id value of the column
pluscreating the mapping table from an application let’s say and be able to distinguish the data. 
>

In your example, are different tenants represent different publisher
nodes? If so, why can't we have a predefined column and value for the
required tables on each publisher rather than logical replication
generate that value while replicating data?

--
With Regards,
Amit Kapila.



Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
It's easy to answer this question.

Imagine that in a software company who sells the product and also offers reporting solutions, the ERP tables will not have this additional column to all the tables.
Now the reporting department comes and needs to consolidate all that data from different databases (publishers) and create one multitenant database to have all the data.
So in an ERP like NAV or anything else you cannot suggest change all the code to all of the tables plus all functions to add one additional column to this table, even that was possible then you cannot work with integers but you need to work with GUIDs as this column should be predefined to each ERP. Then joining with GUID in the second phase for reporting definitely will slow down the performance.

In summary:
  1. Cannot touch the underlying source (important)
  2. GUID identifier column will slow down the reporting performance

Στις Τετ 23 Νοε 2022 στις 5:19 π.μ., ο/η Amit Kapila <amit.kapila16@gmail.com> έγραψε:
On Wed, Nov 23, 2022 at 1:40 AM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Reading more carefully what you described, I think you are interested in getting something you call origin from publishers, probably some metadata from the publications.
>
> This identifier in those metadata maybe does not have business value on the reporting side. The idea is to use a value which has specific meaning to the user at the end.
>
> For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end based on a dimension table which holds this mapping the user would be able to filter the data. So programmatically the user can set the id value of the column plus creating the mapping table from an application let’s say and be able to distinguish the data.
>

In your example, are different tenants represent different publisher
nodes? If so, why can't we have a predefined column and value for the
required tables on each publisher rather than logical replication
generate that value while replicating data?

--
With Regards,
Amit Kapila.

Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
Just one correction for the subscriber
On Subscriber:

test_sub=# CREATE TABLE tab(id int pkey, description varchar, subscription varchar pkey);
CREATE TABLE

The subscription table should have the same primary key columns as the publisher plus one more.
We need to make sure that on update only the same origin data is being updated.

Στις Τετ 23 Νοε 2022 στις 1:24 π.μ., ο/η Peter Smith <smithpb2250@gmail.com> έγραψε:
On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Reading more carefully what you described, I think you are interested in getting something you call origin from publishers, probably some metadata from the publications.
>
> This identifier in those metadata maybe does not have business value on the reporting side. The idea is to use a value which has specific meaning to the user at the end.
>
> For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end based on a dimension table which holds this mapping the user would be able to filter the data. So programmatically the user can set the id value of the column plus creating the mapping table from an application let’s say and be able to distinguish the data.
>
> In addition this column should have the ability to be part of the primary key on the subscription table in order to not conflict with lines from other tenants having the same keys.
>
>

I was wondering if a simpler syntax solution might also work here.

Imagine another SUBSCRIPTION parameter that indicates to write the
*name* of the subscription to some pre-defined table column:
e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
CONNECTION '...' WITH (subscription_column);

Logical Replication already allows the subscriber table to have extra
columns, so you just need to manually create the extra 'subscription'
column up-front.

Then...

~~

On Publisher:

test_pub=# CREATE TABLE tab(id int primary key, description varchar);
CREATE TABLE

test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
INSERT 0 3

test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
CREATE PUBLICATION

~~

On Subscriber:

test_sub=# CREATE TABLE tab(id int, description varchar, subscription varchar);
CREATE TABLE

test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
CREATE SUBSCRIPTION

test_sub=# SELECT * FROM tab;
 id | description | subscription
----+-------------+--------------
  1 | one         | sub_tenant1
  2 | two         | sub_tenant1
  3 | three       | sub_tenant1
(3 rows)

~~

Subscriptions to different tenants would be named differently.

And using other SQL you can map/filter those names however your
application wants.

------
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Logical Replication Custom Column Expression

From
Ashutosh Bapat
Date:
On Wed, Nov 23, 2022 at 4:54 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
> <koureasstavros@gmail.com> wrote:
> >
> > Reading more carefully what you described, I think you are interested in getting something you call origin from
publishers,probably some metadata from the publications. 
> >
> > This identifier in those metadata maybe does not have business value on the reporting side. The idea is to use a
valuewhich has specific meaning to the user at the end. 
> >
> > For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end based on a dimension table which holds
thismapping the user would be able to filter the data. So programmatically the user can set the id value of the column
pluscreating the mapping table from an application let’s say and be able to distinguish the data. 
> >
> > In addition this column should have the ability to be part of the primary key on the subscription table in order to
notconflict with lines from other tenants having the same keys. 
> >
> >
>
> I was wondering if a simpler syntax solution might also work here.
>
> Imagine another SUBSCRIPTION parameter that indicates to write the
> *name* of the subscription to some pre-defined table column:
> e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
> CONNECTION '...' WITH (subscription_column);
>
> Logical Replication already allows the subscriber table to have extra
> columns, so you just need to manually create the extra 'subscription'
> column up-front.
>
> Then...
>
> ~~
>
> On Publisher:
>
> test_pub=# CREATE TABLE tab(id int primary key, description varchar);
> CREATE TABLE
>
> test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
> INSERT 0 3
>
> test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
> CREATE PUBLICATION
>
> ~~
>
> On Subscriber:
>
> test_sub=# CREATE TABLE tab(id int, description varchar, subscription varchar);
> CREATE TABLE
>
> test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
> dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
> CREATE SUBSCRIPTION
>
> test_sub=# SELECT * FROM tab;
>  id | description | subscription
> ----+-------------+--------------
>   1 | one         | sub_tenant1
>   2 | two         | sub_tenant1
>   3 | three       | sub_tenant1
> (3 rows)
>
> ~~
>
Thanks for the example. This is more concrete than just verbal description.

In this example, do all the tables that a subscription subscribes to
need that additional column or somehow the pglogical receiver will
figure out which tables have that column and populate rows
accordingly?

My further fear is that the subscriber will also need to match the
subscription column along with the rest of PK so as not to update rows
from other subscriptions.
--
Best Wishes,
Ashutosh Bapat



Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
Yes, if the property is on the subscription side then it should be applied for all the tables that the connected publication is exposing.
So if the property is enabled you should be sure that this origin column exists to all of the tables that the publication is exposing...

Sure this is the complete idea, that the subscriber should match the PK of origin, <previous_pkey>
As the subscription table will contain same key values from different origins, for example:

For publisher1 database table
id pk integer | value character varying
1                   | testA1
2                   | testA2

For publisher2 database table
id pk integer | value character varying
1                   | testB1
2                   | testB2

For subscriber database table
origin pk character varying | id pk integer | value character varying
publisher1                           | 1                   | testA1
publisher1                           | 2                   | testA2
publisher2                           | 1                   | testB1
publisher2                           | 2                   | testB2

All statements INSERT, UPDATE, DELETE should always include the predicate of the origin.

Στις Παρ 25 Νοε 2022 στις 12:21 μ.μ., ο/η Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> έγραψε:
On Wed, Nov 23, 2022 at 4:54 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, Nov 23, 2022 at 7:38 AM Stavros Koureas
> <koureasstavros@gmail.com> wrote:
> >
> > Reading more carefully what you described, I think you are interested in getting something you call origin from publishers, probably some metadata from the publications.
> >
> > This identifier in those metadata maybe does not have business value on the reporting side. The idea is to use a value which has specific meaning to the user at the end.
> >
> > For example assigning 1 for tenant 1, 2 for tenant 2 and so one, at the end based on a dimension table which holds this mapping the user would be able to filter the data. So programmatically the user can set the id value of the column plus creating the mapping table from an application let’s say and be able to distinguish the data.
> >
> > In addition this column should have the ability to be part of the primary key on the subscription table in order to not conflict with lines from other tenants having the same keys.
> >
> >
>
> I was wondering if a simpler syntax solution might also work here.
>
> Imagine another SUBSCRIPTION parameter that indicates to write the
> *name* of the subscription to some pre-defined table column:
> e.g. CREATE SUBSCRIPTION subname FOR PUBLICATION pub_tenant_1
> CONNECTION '...' WITH (subscription_column);
>
> Logical Replication already allows the subscriber table to have extra
> columns, so you just need to manually create the extra 'subscription'
> column up-front.
>
> Then...
>
> ~~
>
> On Publisher:
>
> test_pub=# CREATE TABLE tab(id int primary key, description varchar);
> CREATE TABLE
>
> test_pub=# INSERT INTO tab VALUES (1,'one'),(2,'two'),(3,'three');
> INSERT 0 3
>
> test_pub=# CREATE PUBLICATION tenant1 FOR ALL TABLES;
> CREATE PUBLICATION
>
> ~~
>
> On Subscriber:
>
> test_sub=# CREATE TABLE tab(id int, description varchar, subscription varchar);
> CREATE TABLE
>
> test_sub=# CREATE SUBSCRIPTION sub_tenant1 CONNECTION 'host=localhost
> dbname=test_pub' PUBLICATION tenant1 WITH (subscription_column);
> CREATE SUBSCRIPTION
>
> test_sub=# SELECT * FROM tab;
>  id | description | subscription
> ----+-------------+--------------
>   1 | one         | sub_tenant1
>   2 | two         | sub_tenant1
>   3 | three       | sub_tenant1
> (3 rows)
>
> ~~
>
Thanks for the example. This is more concrete than just verbal description.

In this example, do all the tables that a subscription subscribes to
need that additional column or somehow the pglogical receiver will
figure out which tables have that column and populate rows
accordingly?

My further fear is that the subscriber will also need to match the
subscription column along with the rest of PK so as not to update rows
from other subscriptions.
--
Best Wishes,
Ashutosh Bapat

Re: Logical Replication Custom Column Expression

From
Peter Smith
Date:
On Fri, Nov 25, 2022 at 9:43 PM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Yes, if the property is on the subscription side then it should be applied for all the tables that the connected
publicationis exposing.
 
> So if the property is enabled you should be sure that this origin column exists to all of the tables that the
publicationis exposing...
 
>
> Sure this is the complete idea, that the subscriber should match the PK of origin, <previous_pkey>
> As the subscription table will contain same key values from different origins, for example:
>
> For publisher1 database table
> id pk integer | value character varying
> 1                   | testA1
> 2                   | testA2
>
> For publisher2 database table
> id pk integer | value character varying
> 1                   | testB1
> 2                   | testB2
>
> For subscriber database table
> origin pk character varying | id pk integer | value character varying
> publisher1                           | 1                   | testA1
> publisher1                           | 2                   | testA2
> publisher2                           | 1                   | testB1
> publisher2                           | 2                   | testB2
>
> All statements INSERT, UPDATE, DELETE should always include the predicate of the origin.
>

This sounds similar to what I had posted [1] although I was saying the
generated column value might be the *subscriber* name, not the origin
publisher name. (where are you getting that value from -- somehow from
the subscriptions' CONNECTION dbname?)

Anyway, regardless of the details, please note -- my idea was really
intended just as a discussion starting point to demonstrate that
required functionality might be achieved using a simpler syntax than
what had been previously suggested. But in practice there may be some
problems with this approach -- e.g. how will the initial tablesync
COPY efficiently assign these subscriber name column values?

------
[1] https://www.postgresql.org/message-id/CAHut%2BPuZowXd7Aa7t0nqjP6afHMwJarngzeMq%2BQP0vE2KKLOgQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.



Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
Sure I understand and neither do I have good knowledge of what else could be influenced by such a change.
If the value of the column is the subscriber name has no benefit to this idea of merging multiple upstreams with same primary keys, later you describe the "connection dbname", yes this could be a possibility.
I do not fully understand that part "how will the initial tablesync COPY efficiently assign these subscriber name column values?"
Why is difficult that during the initial sync put everywhere the same value for all rows of the same origin?

Στις Δευ 28 Νοε 2022 στις 10:16 π.μ., ο/η Peter Smith <smithpb2250@gmail.com> έγραψε:
On Fri, Nov 25, 2022 at 9:43 PM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Yes, if the property is on the subscription side then it should be applied for all the tables that the connected publication is exposing.
> So if the property is enabled you should be sure that this origin column exists to all of the tables that the publication is exposing...
>
> Sure this is the complete idea, that the subscriber should match the PK of origin, <previous_pkey>
> As the subscription table will contain same key values from different origins, for example:
>
> For publisher1 database table
> id pk integer | value character varying
> 1                   | testA1
> 2                   | testA2
>
> For publisher2 database table
> id pk integer | value character varying
> 1                   | testB1
> 2                   | testB2
>
> For subscriber database table
> origin pk character varying | id pk integer | value character varying
> publisher1                           | 1                   | testA1
> publisher1                           | 2                   | testA2
> publisher2                           | 1                   | testB1
> publisher2                           | 2                   | testB2
>
> All statements INSERT, UPDATE, DELETE should always include the predicate of the origin.
>

This sounds similar to what I had posted [1] although I was saying the
generated column value might be the *subscriber* name, not the origin
publisher name. (where are you getting that value from -- somehow from
the subscriptions' CONNECTION dbname?)

Anyway, regardless of the details, please note -- my idea was really
intended just as a discussion starting point to demonstrate that
required functionality might be achieved using a simpler syntax than
what had been previously suggested. But in practice there may be some
problems with this approach -- e.g. how will the initial tablesync
COPY efficiently assign these subscriber name column values?

------
[1] https://www.postgresql.org/message-id/CAHut%2BPuZowXd7Aa7t0nqjP6afHMwJarngzeMq%2BQP0vE2KKLOgQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Logical Replication Custom Column Expression

From
Ashutosh Bapat
Date:
On Fri, Nov 25, 2022 at 4:13 PM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
> Yes, if the property is on the subscription side then it should be applied for all the tables that the connected
publicationis exposing.
 
> So if the property is enabled you should be sure that this origin column exists to all of the tables that the
publicationis exposing...
 
>

That would be too restrictive - not necessarily in your application
but generally. There could be some tables where consolidating rows
with same PK from different publishers into a single row in subscriber
would be desirable. I think we need to enable the property for every
subscriber that intends to add publisher column to the desired and
subscribed tables. But there should be another option per table which
will indicate that receiver should add publisher when INSERTING row to
that table.


> Sure this is the complete idea, that the subscriber should match the PK of origin, <previous_pkey>
> As the subscription table will contain same key values from different origins, for example:
>

And yes, probably you need to change the way you reply to email on
this list. Top-posting is generally avoided. See
https://wiki.postgresql.org/wiki/Mailing_Lists.

-- 
Best Wishes,
Ashutosh Bapat



Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:


Στις Τρί 29 Νοε 2022 στις 3:27 μ.μ., ο/η Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> έγραψε:
> That would be too restrictive - not necessarily in your application
> but generally. There could be some tables where consolidating rows
> with same PK from different publishers into a single row in subscriber
> would be desirable. I think we need to enable the property for every
> subscriber that intends to add publisher column to the desired and
> subscribed tables. But there should be another option per table which
> will indicate that receiver should add publisher when INSERTING row to
> that table.

So we are discussing the scope level of this property, if this property will be implemented on subscriber level or on subscriber table.
In that case I am not sure how this will be implemented as currently postgres subscribers can have multiple tables streamed from a single publisher.
In that case we may have an additional syntax on subscriber, for example:

CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost port=5432 user=postgres password=XXXXXX dbname=publisher1' PUBLICATION pub1 with (enabled = false, create_slot = false, slot_name = NONE, tables = {tableA:union, tableB:none, ....});

Something like this?

> And yes, probably you need to change the way you reply to email on
> this list. Top-posting is generally avoided. See
https://wiki.postgresql.org/wiki/Mailing_Lists.

Thanks for bringing this into the discussion :)

Re: Logical Replication Custom Column Expression

From
Ashutosh Bapat
Date:
On Wed, Nov 30, 2022 at 2:09 PM Stavros Koureas
<koureasstavros@gmail.com> wrote:
>
>
>
> Στις Τρί 29 Νοε 2022 στις 3:27 μ.μ., ο/η Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> έγραψε:
> > That would be too restrictive - not necessarily in your application
> > but generally. There could be some tables where consolidating rows
> > with same PK from different publishers into a single row in subscriber
> > would be desirable. I think we need to enable the property for every
> > subscriber that intends to add publisher column to the desired and
> > subscribed tables. But there should be another option per table which
> > will indicate that receiver should add publisher when INSERTING row to
> > that table.
>
> So we are discussing the scope level of this property, if this property will be implemented on subscriber level or on
subscribertable. 
> In that case I am not sure how this will be implemented as currently postgres subscribers can have multiple tables
streamedfrom a single publisher. 
> In that case we may have an additional syntax on subscriber, for example:
>
> CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost port=5432 user=postgres password=XXXXXX dbname=publisher1'
PUBLICATIONpub1 with (enabled = false, create_slot = false, slot_name = NONE, tables = {tableA:union, tableB:none,
....});
>
> Something like this?

Nope, I think we will need to add a table level property through table
options or receiver can infer it by looking at the table columns -
e.g. existence of origin_id column or some such thing.


--
Best Wishes,
Ashutosh Bapat



Re: Logical Replication Custom Column Expression

From
Stavros Koureas
Date:
>> And yes, probably you need to change the way you reply to email on
>> this list. Top-posting is generally avoided. See
>> https://wiki.postgresql.org/wiki/Mailing_Lists.

>Thanks for bringing this into the discussion :)

Thinking these days more about this topic, subscriber name is not a bad idea, although it makes sense to be able to give your own value even on subscriber level, for example an integer.
Having a custom integer value is better as definitely this integer will participate later in all joins beside the tables and for sure joining with an integer it would be quicker rather than joining on a character varying (plus the rest of the columns).
In addition, discussing with other people and also on Stack Overflow/DBAExchange I have found that other people think it is a great enhancement for analytical purposes.

Στις Τετ 30 Νοε 2022 στις 10:39 π.μ., ο/η Stavros Koureas <koureasstavros@gmail.com> έγραψε:


Στις Τρί 29 Νοε 2022 στις 3:27 μ.μ., ο/η Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> έγραψε:
> That would be too restrictive - not necessarily in your application
> but generally. There could be some tables where consolidating rows
> with same PK from different publishers into a single row in subscriber
> would be desirable. I think we need to enable the property for every
> subscriber that intends to add publisher column to the desired and
> subscribed tables. But there should be another option per table which
> will indicate that receiver should add publisher when INSERTING row to
> that table.

So we are discussing the scope level of this property, if this property will be implemented on subscriber level or on subscriber table.
In that case I am not sure how this will be implemented as currently postgres subscribers can have multiple tables streamed from a single publisher.
In that case we may have an additional syntax on subscriber, for example:

CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost port=5432 user=postgres password=XXXXXX dbname=publisher1' PUBLICATION pub1 with (enabled = false, create_slot = false, slot_name = NONE, tables = {tableA:union, tableB:none, ....});

Something like this?

> And yes, probably you need to change the way you reply to email on
> this list. Top-posting is generally avoided. See
https://wiki.postgresql.org/wiki/Mailing_Lists.

Thanks for bringing this into the discussion :)